Guest User!

You are not Sophos Staff.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA failover taking a long time (Hyper-V)

I'm in the process of trying to add a second XG VM in my production environment to allow for HA. As part of that I spun up two test machines to ensure I was comfortable with how everything failed over, the behavior of IPSEC tunnels, etc.

So in my environment I have two brand new XG VMs running on Hyper-V, both running 18.0.5 MR-5-Build586

They seem to work in HA fine, report to each other, etc. However if I attempt to fail over from the primary to the secondary (either via a hard power event or just by clicking the "Switch to passive device" button on the HA page) it takes between 1 and 2 minutes for the aux device to take over and begin properly responding on the network. The formerly primary device continues booting and usually around 4-5 minutes after it's boot, reboots again. At this time the aux device usually loses connection for another 2-3 minutes before finally recovering and responding properly. If everything sits for 5 minutes or so, it seems to be stable and good. The formerly primary device is aux and the formerly aux device is primary.

In order to test this further I ran the same process in a lab environment I have. In that environment everything seems to work properly. The failover takes 5-10 seconds or so and the machine only boots once. Both lab and production were set up identically for this test at the VM level. XG VMs created off the same base VHD, running the same software, etc. so it doesn't seem to be any sort of issue with the XG software itself, but rather some environment difference. Unfortunately I can't duplicate the full config of production in my lab, so I can't play with it fully, thus I'm hoping to be able to pull some logs from the XG VMs to understand why the aux device seems to take upwards of 1-2 minutes to take over.

Things I've noticed in production:

- When the primary reboots, the aux keeps thinking it's the aux for the whole 1-2 minute duration based on system ha show details. Current HA State is Aux and Peer HA state is Primary. Once it takes over, it goes to Current state: Standalone Peer HA state: Failure as you'd generally expect while the other device finishes its reboot cycle.

- When the formerly primary device reboots the first time it seems to come back online thinking it's still the primary. If I connect to the console during that first few minutes it will say Current HA State: Standalone, Peer HA State: Failure  Therefore both devices are simultaneously saying they're standalone with peer in fault.

- Once the former primary device does it's second reboot it says its the aux device as it should.

- If I shut down the primary device until after the aux has fully taken over, and then boot it. It comes up in the very first boot showing that it's the aux device which seems to imply the reason for that first boot in standalone mode is related to how long the aux is taking to take over.

- HA config on the VMs was done via the quick discovery mode.

One additional thing that makes this more complicated to troubleshoot for me is that the XG VMs in production are running in a managed environment where we don't directly control the network layer. Thus I need to be able to identify what the devices may be seeing that's causing the delay and, if necessary, bring that to the managed team to potentially resolve. I know in this environment before, we've had a couple issues with our actual standalone production XG VM not responding on the network after a boot and it was due to the arp cache having an old mac address for the interfaces. I'm not sure if that could also be causing an issue here, nor am I sure how to verify if it is/isn't.

I'm looking for assistance in identifying why the aux device seems to take so long in my production environment to complete the process of taking over for the primary and what logs I can pull to help to identify the difference between the lab and production environments so I can get with the relevant teams to resolve.

Thanks,



This thread was automatically locked due to age.
Parents Reply Children
  • Are there any specific logs I can pull that would hint towards this? (like the VM waiting for the arp response or something). As I mentioned, this is in a managed datacenter, so I'd like as much specifics as possible to go to the vendor/team.

  • whenever we've had HA issues with V17.5 or now v18 and we went to support and complained, it was never possible to discover the root cause from all the log that we provided them  or they searched themselves on the (physical) XGs. They needed to have some debug levels enabled to diagnose and always more tests.

    As your problem seems to be reproduceable, it should be catcheable by support. But I'd expect a loooong time support case and lot's of recreations of the issue (each time with downtimes....) if you decide open a case.

  • It highly depends on the customer setup and the switch infrastructure. Sophos is working with a virtual MAC and a virtual IP. Virtual mac can be disabled (see checkbox above) but still the HA works with a virtual IP. After the takeover, the Aux will send out gracious ARP. https://wiki.wireshark.org/Gratuitous_ARP It simply means "I am here, give me my packets". There are switch setups, which cannot understand this "failover" in a quick manner. Hence it takes a long time, until the packets are send to the new host. 

    Just for the expectation: The HA will not load (fully) the Webadmin until the HA is in a stable manner. This means, the network traffic should work, but the webadmin takes some time, until the old primary is back online and back in the HA (fully synced). 

  • I'm basing the "outage" on pinging management IPs, WAN IPs, as well as pings over an IPSEC tunnel that is established on the pair.

    The aux device responds on the aux management IP for the first 1-2 minutes during which the WAN or other interfaces the pair handle are not responding, then after it seems to recognize it should take over it starts responding on the primary management IP with a few seconds and all relevant IPs start responding.

    I'll reach out to the datacenter in regards to the arp stuff and see if there's anything they can see in terms of traffic.