Guest User!

You are not Sophos Staff.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA cluster problem

Hi,

we have a HA cluster that is in standalone/faulty state. The faulty device (standby) is still reachable through SSH over the HA link but as far as I can see it has the same IP configured on the LAN interface and so I cannot reach it through the peer administration IP. It seems both nodes have thought they are primary after a power interruption.

Is there any way to disable/enable HA through SSH so I can bring back the HA cluster?
I do not have physical access to the device as it is located 1500km or 3 country borders away.

I had the customer reboot the aux. device then Central reported "Both HA nodes are now connected and at full health." directly followed by a "One of the HA nodes is down or in a degraded state, and high availability is not degraded."

Regards,

Kevin



This thread was automatically locked due to age.
Parents
  • Ifconfig will always show the IP of the Primary. HA does not work with Ifconfig in this scenario. Instead it will tag a alias as the peer Administration ip. So you do not see the alias. There is a known limitation, you cannot access the peer administration IP through the primary (for example like a IPsec tunnel). So you need a device on site to access the peer adminstration. 

    BTW: If you break the HA, you would be able to access the peer administration remotely. 

    If you can access via SSH, check the applogs of both appliances to see the reason for the failure in the first place, before starting to break the HA.

    You can do #grep ha: applog.log | less      for more insight on both appliances. 

    __________________________________________________________________________________________________________________

Reply
  • Ifconfig will always show the IP of the Primary. HA does not work with Ifconfig in this scenario. Instead it will tag a alias as the peer Administration ip. So you do not see the alias. There is a known limitation, you cannot access the peer administration IP through the primary (for example like a IPsec tunnel). So you need a device on site to access the peer adminstration. 

    BTW: If you break the HA, you would be able to access the peer administration remotely. 

    If you can access via SSH, check the applogs of both appliances to see the reason for the failure in the first place, before starting to break the HA.

    You can do #grep ha: applog.log | less      for more insight on both appliances. 

    __________________________________________________________________________________________________________________

Children
  • Should the port status in dmesg, executed on the backup device through the SSH-connection of the HA-interface show all interfaces up?
    Here is the output for Port1 (LAN):

    [    8.470997] igb_nm 0000:02:00.0 Port1: renamed from eth0
    [   62.745131] 505.242015 [2311] netmap_do_regif           vale0:Port1: lut ffffaa2a41fb1000 bufs 33792 size 2048
    [   62.745135] 505.242021 [2334] netmap_do_regif           vale0:Port1: mtu 1500 rx_buf_maxsize 2048 netmap_buf_size 2048
    [   63.423428] vfp info: vale_ports_map_table_init:326: Adding LIF for Port1 index 0
    [   63.423445] vfp info: vale_ports_map_table_init:377:   Port   0:  "vale0:Port1", Phys port 0. <=> Vale Stack 1, vale0:Port1^
    [   63.423446] vfp info: vale_ports_map_table_init:382:   Port   1:  "vale0:Port1^", Stack port. <=> Vale Phys 0, vale0:Port1
    [   63.423457] vfp info: vale_ports_map_table_init:394:   Phys 0 <-> Vale 0 (vale0:Port1)
    [   66.934095] IPv6: ADDRCONF(NETDEV_UP): Port1: link is not ready
    [   66.934097] 8021q: adding VLAN 0 to HW filter on device Port1
    

    and here for Port2 (WAN):

    [    8.492227] igb_nm 0000:03:00.0 Port2: renamed from eth1
    [   62.828708] 505.325592 [2311] netmap_do_regif           vale0:Port2: lut ffffaa2a41fb1000 bufs 33792 size 2048
    [   62.828711] 505.325597 [2334] netmap_do_regif           vale0:Port2: mtu 1500 rx_buf_maxsize 2048 netmap_buf_size 2048
    [   63.423431] vfp info: vale_ports_map_table_init:326: Adding LIF for Port2 index 1
    [   63.423447] vfp info: vale_ports_map_table_init:377:   Port   2:  "vale0:Port2", Phys port 1. <=> Vale Stack 3, vale0:Port2^
    [   63.423447] vfp info: vale_ports_map_table_init:382:   Port   3:  "vale0:Port2^", Stack port. <=> Vale Phys 2, vale0:Port2
    [   63.423457] vfp info: vale_ports_map_table_init:394:   Phys 1 <-> Vale 2 (vale0:Port2)
    [   68.303971] IPv6: ADDRCONF(NETDEV_UP): Port2: link is not ready
    [   68.303973] 8021q: adding VLAN 0 to HW filter on device Port2
    [   72.656850] igb_nm 0000:03:00.0 Port2: igb: Port2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    [   72.657097] IPv6: ADDRCONF(NETDEV_CHANGE): Port2: link becomes ready

    With my very basic linux skills I would say that the backup appliance is not connected with all interfaces it should be, well the customer onsite says all links are up. And therefore I cannot reach the faulty appliance through the designated peer admin interface (LAN) from a server inside the LAN.

    Would you agree with my suggestion of a failed link?

    Regards,

    Kevin

    Sophos CE/CA (XG, UTM, Central Endpoint)
    Gold Partner