Guest User!

You are not Sophos Staff.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA cluster problem

Hi,

we have a HA cluster that is in standalone/faulty state. The faulty device (standby) is still reachable through SSH over the HA link but as far as I can see it has the same IP configured on the LAN interface and so I cannot reach it through the peer administration IP. It seems both nodes have thought they are primary after a power interruption.

Is there any way to disable/enable HA through SSH so I can bring back the HA cluster?
I do not have physical access to the device as it is located 1500km or 3 country borders away.

I had the customer reboot the aux. device then Central reported "Both HA nodes are now connected and at full health." directly followed by a "One of the HA nodes is down or in a degraded state, and high availability is not degraded."

Regards,

Kevin



This thread was automatically locked due to age.
  • Hi kerobra

    Please follow How to troubleshoot HA issues link as below share your finding on logs : 

    https://docs.sophos.com/nsg/sophos-firewall/18.5/Help/en-us/webhelp/onlinehelp/HighAvailablityStartupGuide/HATroubleshooting/index.html 

    Thanks and Regards

    "Sophos Partner: Infrassist Technologies Pvt Ltd".

    If a post solves your question please use the 'Verify Answer' button.

  • Ifconfig will always show the IP of the Primary. HA does not work with Ifconfig in this scenario. Instead it will tag a alias as the peer Administration ip. So you do not see the alias. There is a known limitation, you cannot access the peer administration IP through the primary (for example like a IPsec tunnel). So you need a device on site to access the peer adminstration. 

    BTW: If you break the HA, you would be able to access the peer administration remotely. 

    If you can access via SSH, check the applogs of both appliances to see the reason for the failure in the first place, before starting to break the HA.

    You can do #grep ha: applog.log | less      for more insight on both appliances. 

    __________________________________________________________________________________________________________________

  • Should the port status in dmesg, executed on the backup device through the SSH-connection of the HA-interface show all interfaces up?
    Here is the output for Port1 (LAN):

    [    8.470997] igb_nm 0000:02:00.0 Port1: renamed from eth0
    [   62.745131] 505.242015 [2311] netmap_do_regif           vale0:Port1: lut ffffaa2a41fb1000 bufs 33792 size 2048
    [   62.745135] 505.242021 [2334] netmap_do_regif           vale0:Port1: mtu 1500 rx_buf_maxsize 2048 netmap_buf_size 2048
    [   63.423428] vfp info: vale_ports_map_table_init:326: Adding LIF for Port1 index 0
    [   63.423445] vfp info: vale_ports_map_table_init:377:   Port   0:  "vale0:Port1", Phys port 0. <=> Vale Stack 1, vale0:Port1^
    [   63.423446] vfp info: vale_ports_map_table_init:382:   Port   1:  "vale0:Port1^", Stack port. <=> Vale Phys 0, vale0:Port1
    [   63.423457] vfp info: vale_ports_map_table_init:394:   Phys 0 <-> Vale 0 (vale0:Port1)
    [   66.934095] IPv6: ADDRCONF(NETDEV_UP): Port1: link is not ready
    [   66.934097] 8021q: adding VLAN 0 to HW filter on device Port1
    

    and here for Port2 (WAN):

    [    8.492227] igb_nm 0000:03:00.0 Port2: renamed from eth1
    [   62.828708] 505.325592 [2311] netmap_do_regif           vale0:Port2: lut ffffaa2a41fb1000 bufs 33792 size 2048
    [   62.828711] 505.325597 [2334] netmap_do_regif           vale0:Port2: mtu 1500 rx_buf_maxsize 2048 netmap_buf_size 2048
    [   63.423431] vfp info: vale_ports_map_table_init:326: Adding LIF for Port2 index 1
    [   63.423447] vfp info: vale_ports_map_table_init:377:   Port   2:  "vale0:Port2", Phys port 1. <=> Vale Stack 3, vale0:Port2^
    [   63.423447] vfp info: vale_ports_map_table_init:382:   Port   3:  "vale0:Port2^", Stack port. <=> Vale Phys 2, vale0:Port2
    [   63.423457] vfp info: vale_ports_map_table_init:394:   Phys 1 <-> Vale 2 (vale0:Port2)
    [   68.303971] IPv6: ADDRCONF(NETDEV_UP): Port2: link is not ready
    [   68.303973] 8021q: adding VLAN 0 to HW filter on device Port2
    [   72.656850] igb_nm 0000:03:00.0 Port2: igb: Port2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [   72.657097] IPv6: ADDRCONF(NETDEV_CHANGE): Port2: link becomes ready

    With my very basic linux skills I would say that the backup appliance is not connected with all interfaces it should be, well the customer onsite says all links are up. And therefore I cannot reach the faulty appliance through the designated peer admin interface (LAN) from a server inside the LAN.

    Would you agree with my suggestion of a failed link?

    Regards,

    Kevin

    Sophos CE/CA (XG, UTM, Central Endpoint)
    Gold Partner

  • Is there any way to disable/enable HA through SSH so I can bring back the HA cluster?

    console>system ha disable 

    The above command will disable HA

    Make sure to take regular backups of the existing configuration.

    Please run the command before you try to disable HA from SSH as well as from console with serial cable from Both the Appliance.

    console>system ha show details

    console>system ha show logs lines 10000

    If you want to enable HA back that is done from GUI only as per the link https://www.sophos.com/en-us/medialibrary/PDFs/documentation/SophosFirewall/Pocket-Guides/Active-Passive-HA-Configuration.pdf 

    Please share the output 

    Regards

    "Sophos Partner: Infrassist Technologies Pvt Ltd".

    If a post solves your question please use the 'Verify Answer' button.

  • But that is the device concole, not advanced shell. I can only use advanced shell/SSH through the dedicated HA link.
    As I said, physical access is not possible because the devices are located in romania and we are located in germany, which makes it a bit difficult to plug cables.

    If I had physical access I would have reinitiated the HA services meanwhile and could also check the device cabling. But I am forced to remote assistance only, that is my problem.

    Regards,

    Kevin

    Sophos CE/CA (XG, UTM, Central Endpoint)
    Gold Partner

  • From SSH, Go to option 4 and share the status of the logs : 

    console>system ha show details

    console>system ha show logs lines 10000

    Also, share the status under CONFIGURE -->System Services --->High Availability and Device Access status under System->Administration

    Regards

    "Sophos Partner: Infrassist Technologies Pvt Ltd".

    If a post solves your question please use the 'Verify Answer' button.

  • OK, we can cut it off here...
    I requested the customer to take a picture of both firewalls and you imagine what? Port1 had no link.

    They seem to have found the issue since both nodes are available and synced now and the peer admin IP is reachable, too.

    Regards,

    Kevin

    Sophos CE/CA (XG, UTM, Central Endpoint)
    Gold Partner