Guest User!

You are not Sophos Staff.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SFOS HA Cluster Primary hangs without Failover

our customer has a HA Cluster containing 2x SG330 Rev 1 running SFOS 18.5.1 MR-1-Build326. within the same week the primary devices failed and did not failover to the auxiliary. It was necessary to manual power off the primary to trigger a failover to the aux. after the primary power cycled it switched back to primary, since failback 2 primary is checked. 

in both cases, there is not much i can find in the logs.

From the primary devices it looks like it "freezed" to death. LED display was unresponsive at that time.

From the aux devices it looks like it did no noticed the primray died.  

the few logs i can see are after the reload on of the primary. crash happend around 03:00 and power cycle around 07:00,  there are no logs in between. 

Primary device:

1974 2022-01-20 07:08:15.204 GMTLOG: database system was interrupted; last known up at 2022-01-20 02:09:39 GMT
1974 2022-01-20 07:08:19.441 GMTLOG: database system was not properly shut down; automatic recovery in progress

this messaged keeps spamming the syslog:

Jan 20 02:57:04 localhost kernel: [10488717.715382] netlink: 153776 bytes leftover after parsing attributes in process `ipsetelite'.
Jan 20 03:00:04 localhost kernel: [10488897.692808] netlink: 153776 bytes leftover after parsing attributes in process `ipsetelite'.
Jan 20 03:03:04 localhost kernel: [10489077.710459] netlink: 153776 bytes leftover after parsing attributes in process `ipsetelite'.
Jan 20 03:06:04 localhost kernel: [10489257.660675] netlink: 153776 bytes leftover after parsing attributes in process `ipsetelite'.
Jan 20 03:09:03 localhost kernel: [10489437.533547] netlink: 153776 bytes leftover after parsing attributes in process `ipsetelite'.

the network logs after the reload indicate the both devices were active at one point..

i assume the applianced played pingpong a couple of time untill the cluster was backup again.

Jan 24 07:15:34: %IP-4-DUPADDR: Duplicate address 192.168.25.1 on Vlan25, sourced by 00e0.2011.0976
Jan 24 07:15:34: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 309 is flapping between port Po44 and port Po45
Jan 24 07:15:35: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 310 is flapping between port Po44 and port Po45
Jan 24 07:15:35: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 75 is flapping between port Po44 and port Po45
Jan 24 07:15:40: %LINEPROTO-5-UPDOWN: Line protocol on Interface TwentyFiveGigE2/0/45, changed state to down
Jan 24 07:15:40: %LINEPROTO-5-UPDOWN: Line protocol on Interface TwentyFiveGigE1/0/45, changed state to down
Jan 24 07:15:40: %LINEPROTO-5-UPDOWN: Line protocol on Interface Port-channel45, changed state to down
Jan 24 07:15:41: %LINK-3-UPDOWN: Interface TwentyFiveGigE2/0/45, changed state to down
Jan 24 07:15:41: %LINK-3-UPDOWN: Interface Port-channel45, changed state to down
Jan 24 07:15:41: %LINK-3-UPDOWN: Interface TwentyFiveGigE1/0/45, changed state to down
Jan 24 07:17:27: %LINK-3-UPDOWN: Interface TwentyFiveGigE1/0/45, changed state to up
Jan 24 07:17:28: %LINK-3-UPDOWN: Interface TwentyFiveGigE2/0/45, changed state to up
Jan 24 07:17:36: %ETC-5-L3DONTBNDL2: Twe1/0/45 suspended: LACP currently not enabled on the remote port.
Jan 24 07:17:37: %ETC-5-L3DONTBNDL2: Twe2/0/45 suspended: LACP currently not enabled on the remote port.
Jan 24 07:18:01: %ETC-5-L3DONTBNDL2: Twe2/0/45 suspended: LACP currently not enabled on the remote port.
Jan 24 07:18:02: %ETC-5-L3DONTBNDL2: Twe1/0/45 suspended: LACP currently not enabled on the remote port.
Jan 24 07:19:50: %LINEPROTO-5-UPDOWN: Line protocol on Interface TwentyFiveGigE1/0/45, changed state to up
Jan 24 07:19:50: %LINEPROTO-5-UPDOWN: Line protocol on Interface TwentyFiveGigE2/0/45, changed state to up
Jan 24 07:19:51: %LINK-3-UPDOWN: Interface Port-channel45, changed state to up
Jan 24 07:19:52: %LINEPROTO-5-UPDOWN: Line protocol on Interface Port-channel45, changed state to up
Jan 24 07:20:18: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 309 is flapping between port Po45 and port Po44
Jan 24 07:20:20: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 310 is flapping between port Po44 and port Po45
Jan 24 07:20:21: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 309 is flapping between port Po44 and port Po45
Jan 24 07:20:23: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 310 is flapping between port Po45 and port Po44
Jan 24 07:20:24: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 309 is flapping between port Po45 and port Po44
Jan 24 07:20:36: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 80 is flapping between port Po45 and port Po44
Jan 24 07:20:38: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 310 is flapping between port Po45 and port Po44
Jan 24 07:20:39: %SW_MATM-4-MACFLAP_NOTIF: Host 00e0.2011.0976 in vlan 309 is flapping between port Po45 and port Po44

HA Link is done by a Cable back2back. Ports look fine.

SG330_WP01_SFOS 18.5.1 MR-1-Build326# ifconfig PortE3
PortE3 Link encap:Ethernet HWaddr 00:1A:8C:5F:E2:47
inet addr:192.0.2.1 Bcast:192.0.2.3 Mask:255.255.255.252
inet6 addr: fe80::21a:8cff:fe5f:e247/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:643351 errors:0 dropped:0 overruns:0 frame:0
TX packets:1032066 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:492887169 (470.0 MiB) TX bytes:284445208 (271.2 MiB)

PortE3 Link encap:Ethernet HWaddr 00:1A:8C:60:70:C3
inet addr:192.0.2.2 Bcast:192.0.2.3 Mask:255.255.255.252
inet6 addr: fe80::21a:8cff:fe60:70c3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:945885 errors:0 dropped:0 overruns:0 frame:0
TX packets:378813 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:278501212 (265.5 MiB) TX bytes:120476213 (114.8 MiB)

are there any known issues?

someone else exprerienced cluster hangs?

(this is the 4th or 5th time for this customer..   both appliances were replaced since their initial installation due to disk failure.. )



This thread was automatically locked due to age.
Parents
  • The question is the reason, why it hangs. Linux system could be potentially be still "online" and respond to certain packets but other systems could hang. 

    To actually find the reason, you need to get a serial cable / monitor attached and look for the system hung. 

    __________________________________________________________________________________________________________________

  • the fact that there are 0 logs between 03:00 and 07:00 tells me, that this box completely crashed.  The only real bummer is, that the axillary did not take over. 

    unfortunately the customer does have the opportunity to leave this in a broken state,  and since I don't have remote, i can't access the console server to troubleshoot. 

    I opened a case and they straight requested to upgrade to 18.5.2 without any evidence or explanations..

    we will switch to the standby device tonight to limit the risk of and then RMA both devices. seems like the only reasonable option, since the sophos is not able nor willing to troubleshoot this.

  • If you do a putty serial logging, you will get a output, which helps to find the RCA. 

    www.eye4software.com/.../

    __________________________________________________________________________________________________________________

  • Hi!

    We might have same kind of symptoms on one of our customers XG active/passive cluster (2 x XG310, SFOS 18.5.2 MR-2-Build380).

    About every two, three weeks, traffic will stop flowing partially (i.e Teamviewer works, but nothing else not, no traffic between vlans etc) and i cannot  access to managent gui (SSH / Web), so i need to manually turn off the power from firewalls. I have created ticked to Sophos and we are now trying to log those firewalls  via serial/putty... (but i cannot say yet, do we have any logs, cannot access to my loggin device in datacenter right now :( )

    I has done this three times and it´s getting me frustrated :(

  • The device crashed again, luckily it was after we switch the primary to auxiliary.

    This time I was able to console in remotely to the frozen device, unfortunately it just repeats "can't run '/bin" when you try to login.

    fun fact, the primary still thinks that the auxiliary is alive. You can ping the peer via HA link, but SSH gets refused.

    I can't believe that "ping" is the only keepalive mechanism the XG considers... anyway

    unfortunately there aren't any logs on the aux  from the moment it crashed untill it rebooted.

    If i had to make a guess, i would say the SSD was interrupted, thus it wasn't able to write.  But since it reboots properly and FSCK succeeds, there is no way to be sure if it isn't a software bug.

  • Sorry - This is the COM Port. As expected, the device is in a weird Linux state. Seems like certain modules are dead but the underlaying linux is still operating. You already replaced this unit? 

    __________________________________________________________________________________________________________________

  • I'll wait for the feedback of the support engineer and we will probably RMA this device or the whole cluster.

    we already RMA'ed primary and Auxiliary in the past. both of them had a disk failure and were not able to boot.  unlike this time.

  • Hello Samuel,

    did you reach any solution of this problem ?

    I have similar problem - two units is A/P cluster which sometimes (last time it is about 3 weeks) hangs and I am not able to connect (or even ping) to units. It seems like both units are working and think they are master. Solution is to shutdown port at switch - to get link down at any of XGs unit. This leads to its restart and master unit is reachable and working again. Secondary unit becomes member of cluster after restart and cluster is working for some time.

    I did update to 18.5.2 but this did not solve the problem.

    Best regards,

    Petr

Reply
  • Hello Samuel,

    did you reach any solution of this problem ?

    I have similar problem - two units is A/P cluster which sometimes (last time it is about 3 weeks) hangs and I am not able to connect (or even ping) to units. It seems like both units are working and think they are master. Solution is to shutdown port at switch - to get link down at any of XGs unit. This leads to its restart and master unit is reachable and working again. Secondary unit becomes member of cluster after restart and cluster is working for some time.

    I did update to 18.5.2 but this did not solve the problem.

    Best regards,

    Petr

Children
No Data
Share Feedback
×

Submitted a Tech Support Case lately from the Support Portal?