Can't login after reboot

Question

Hi, 
 A customers Sophos HA Active/Passive cluster was misbehaving a few days ago - logging had stopped and some other things were going on. I rebooted the primary and then could no longer connect to anything. I didn't check beforehand if HA was functioning correctly :( 
 The next morning I power cycled both devices and booted up, and while everything appears to be working (packets forwarding, AP's working, etc) I can't actually log in to check anything remotely. I can get to the login for web console and SSH, but both the active and the passive devices are refusing my password. 
 This isn't the first time this has happened with this one client - a few things stop working, reboot, then no longer can connect. 
 As far as I am aware there is nothing strange about their config. 
 My plan at this stage is to rebuild it all (maybe 17.5.4... currently running 17.5.3) and restore from backup 
 The hardware is SG210 and is running XG. I think the hardware might be rev1 but can't confirm at the moment. I've definitely had some issues with XG on rev1 hardware at other clients but not like this. 
 Any other suggestions? Any idea why the password might have been corrupted? 
 Thanks 
 James

jamesharper · Accepted Answer

I'm pretty sure in this case i've found the root cause. One of the nodes has a failed SSD. The disk test passes when run from SFLoader, but after a few hours of running i get this: 
 
 Apr 14 13:11:06 (none) user.err kernel: [276396.518959] end_request: I/O error, dev sda, sector 162339046 Apr 14 13:11:06 (none) user.info kernel: [276396.518967] sd 0:0:0:0: [sda] Unhandled error code Apr 14 13:11:06 (none) user.info kernel: [276396.518968] sd 0:0:0:0: [sda] Apr 14 13:11:06 (none) user.warn kernel: [276396.518969] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Apr 14 13:11:06 (none) user.info kernel: [276396.518970] sd 0:0:0:0: [sda] CDB: Apr 14 13:11:06 (none) user.warn kernel: [276396.518970] Write(10): 2a 00 09 ad 18 f6 00 00 18 00 Apr 14 13:11:06 (none) user.err kernel: [276396.518974] end_request: I/O error, dev sda, sector 162339062 Apr 14 13:11:06 (none) user.info kernel: [276396.518978] sd 0:0:0:0: [sda] Unhandled error code Apr 14 13:11:06 (none) user.info kernel: [276396.518979] sd 0:0:0:0: [sda] Apr 14 13:11:06 (none) user.warn kernel: [276396.518980] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Apr 14 13:11:06 (none) user.info kernel: [276396.518981] sd 0:0:0:0: [sda] CDB: Apr 14 13:11:06 (none) user.warn kernel: [276396.518981] Write(10): 2a 00 09 ad 19 16 00 00 18 00 Apr 14 13:11:06 (none) user.err kernel: [276396.518984] end_request: I/O error, dev sda, sector 162339094 Apr 14 13:11:06 (none) user.info kernel: [276396.518989] sd 0:0:0:0: [sda] Unhandled error code Apr 14 13:11:06 (none) user.info kernel: [276396.518990] sd 0:0:0:0: [sda] Apr 14 13:11:06 (none) user.warn kernel: [276396.518990] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Apr 14 13:11:06 (none) user.info kernel: [276396.518991] sd 0:0:0:0: [sda] CDB: Apr 14 13:11:06 (none) user.warn kernel: [276396.518992] Write(10): 2a 00 09 ad 19 46 00 00 18 00 
 
 Which suggests that the SSD is going offline and freezing the cluster. 
 It's a bit disappointing that this brings down the whole A/P HA cluster, but at least I know what the problem is and that there's an easy solution. 
 James