This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Can't login after reboot

Hi,

A customers Sophos HA Active/Passive cluster was misbehaving a few days ago - logging had stopped and some other things were going on. I rebooted the primary and then could no longer connect to anything. I didn't check beforehand if HA was functioning correctly :(

The next morning I power cycled both devices and booted up, and while everything appears to be working (packets forwarding, AP's working, etc) I can't actually log in to check anything remotely. I can get to the login for web console and SSH, but both the active and the passive devices are refusing my password.

This isn't the first time this has happened with this one client - a few things stop working, reboot, then no longer can connect.

As far as I am aware there is nothing strange about their config.

My plan at this stage is to rebuild it all (maybe 17.5.4... currently running 17.5.3) and restore from backup

The hardware is SG210 and is running XG. I think the hardware might be rev1 but can't confirm at the moment. I've definitely had some issues with XG on rev1 hardware at other clients but not like this.

Any other suggestions? Any idea why the password might have been corrupted?

Thanks

James



This thread was automatically locked due to age.
Parents
  • I'm pretty sure in this case i've found the root cause. One of the nodes has a failed SSD. The disk test passes when run from SFLoader, but after a few hours of running i get this:

    Apr 14 13:11:06 (none) user.err kernel: [276396.518959] end_request: I/O error, dev sda, sector 162339046
    Apr 14 13:11:06 (none) user.info kernel: [276396.518967] sd 0:0:0:0: [sda] Unhandled error code
    Apr 14 13:11:06 (none) user.info kernel: [276396.518968] sd 0:0:0:0: [sda]
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518969] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    Apr 14 13:11:06 (none) user.info kernel: [276396.518970] sd 0:0:0:0: [sda] CDB:
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518970] Write(10): 2a 00 09 ad 18 f6 00 00 18 00
    Apr 14 13:11:06 (none) user.err kernel: [276396.518974] end_request: I/O error, dev sda, sector 162339062
    Apr 14 13:11:06 (none) user.info kernel: [276396.518978] sd 0:0:0:0: [sda] Unhandled error code
    Apr 14 13:11:06 (none) user.info kernel: [276396.518979] sd 0:0:0:0: [sda]
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518980] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    Apr 14 13:11:06 (none) user.info kernel: [276396.518981] sd 0:0:0:0: [sda] CDB:
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518981] Write(10): 2a 00 09 ad 19 16 00 00 18 00
    Apr 14 13:11:06 (none) user.err kernel: [276396.518984] end_request: I/O error, dev sda, sector 162339094
    Apr 14 13:11:06 (none) user.info kernel: [276396.518989] sd 0:0:0:0: [sda] Unhandled error code
    Apr 14 13:11:06 (none) user.info kernel: [276396.518990] sd 0:0:0:0: [sda]
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518990] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    Apr 14 13:11:06 (none) user.info kernel: [276396.518991] sd 0:0:0:0: [sda] CDB:
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518992] Write(10): 2a 00 09 ad 19 46 00 00 18 00

    Which suggests that the SSD is going offline and freezing the cluster.

    It's a bit disappointing that this brings down the whole A/P HA cluster, but at least I know what the problem is and that there's an easy solution.

    James

  • If only I was that lucky. My client's has locked up twice in three days with no failed SFloader tests and no disk errors in syslog...

    I agree. It's called a HA cluster, if one starts acting wonky for any reason, the other should take over. Period. Not take down the whole cluster...

Reply
  • If only I was that lucky. My client's has locked up twice in three days with no failed SFloader tests and no disk errors in syslog...

    I agree. It's called a HA cluster, if one starts acting wonky for any reason, the other should take over. Period. Not take down the whole cluster...

Children
No Data