This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA failover taking a long time (Hyper-V)

I'm in the process of trying to add a second XG VM in my production environment to allow for HA. As part of that I spun up two test machines to ensure I was comfortable with how everything failed over, the behavior of IPSEC tunnels, etc.

So in my environment I have two brand new XG VMs running on Hyper-V, both running 18.0.5 MR-5-Build586

They seem to work in HA fine, report to each other, etc. However if I attempt to fail over from the primary to the secondary (either via a hard power event or just by clicking the "Switch to passive device" button on the HA page) it takes between 1 and 2 minutes for the aux device to take over and begin properly responding on the network. The formerly primary device continues booting and usually around 4-5 minutes after it's boot, reboots again. At this time the aux device usually loses connection for another 2-3 minutes before finally recovering and responding properly. If everything sits for 5 minutes or so, it seems to be stable and good. The formerly primary device is aux and the formerly aux device is primary.

In order to test this further I ran the same process in a lab environment I have. In that environment everything seems to work properly. The failover takes 5-10 seconds or so and the machine only boots once. Both lab and production were set up identically for this test at the VM level. XG VMs created off the same base VHD, running the same software, etc. so it doesn't seem to be any sort of issue with the XG software itself, but rather some environment difference. Unfortunately I can't duplicate the full config of production in my lab, so I can't play with it fully, thus I'm hoping to be able to pull some logs from the XG VMs to understand why the aux device seems to take upwards of 1-2 minutes to take over.

Things I've noticed in production:

- When the primary reboots, the aux keeps thinking it's the aux for the whole 1-2 minute duration based on system ha show details. Current HA State is Aux and Peer HA state is Primary. Once it takes over, it goes to Current state: Standalone Peer HA state: Failure as you'd generally expect while the other device finishes its reboot cycle.

- When the formerly primary device reboots the first time it seems to come back online thinking it's still the primary. If I connect to the console during that first few minutes it will say Current HA State: Standalone, Peer HA State: Failure Therefore both devices are simultaneously saying they're standalone with peer in fault.

- Once the former primary device does it's second reboot it says its the aux device as it should.

- If I shut down the primary device until after the aux has fully taken over, and then boot it. It comes up in the very first boot showing that it's the aux device which seems to imply the reason for that first boot in standalone mode is related to how long the aux is taking to take over.

- HA config on the VMs was done via the quick discovery mode.

One additional thing that makes this more complicated to troubleshoot for me is that the XG VMs in production are running in a managed environment where we don't directly control the network layer. Thus I need to be able to identify what the devices may be seeing that's causing the delay and, if necessary, bring that to the managed team to potentially resolve. I know in this environment before, we've had a couple issues with our actual standalone production XG VM not responding on the network after a boot and it was due to the arp cache having an old mac address for the interfaces. I'm not sure if that could also be causing an issue here, nor am I sure how to verify if it is/isn't.

I'm looking for assistance in identifying why the aux device seems to take so long in my production environment to complete the process of taking over for the primary and what logs I can pull to help to identify the difference between the lab and production environments so I can get with the relevant teams to resolve.

Thanks,

This thread was automatically locked due to age.

Parents

0 LuCar Toni over 5 years ago

Do you have enable the checkbox in HA about virtual Systems?

And is the network traffic affect or only the webadmin?

__________________________________________________________________________________________________________________
Cancel
Vote Up 0 Vote Down

Cancel
0 Sean Patterson over 5 years ago in reply to LuCar Toni

Yes, I have the Use host or hypervisor-assigned MAC address checkbox checked

It is both network traffic and webadmin traffic that has the outage times.
Cancel
Vote Up 0 Vote Down

Cancel
0 LuCar Toni over 5 years ago in reply to Sean Patterson

Sounds like a network issue? Did you check the Switch, if the packets are routed correctly and not a Spoof Protection is hitting in this scenario?

__________________________________________________________________________________________________________________
Cancel
Vote Up 0 Vote Down

Cancel
0 LHerzog over 5 years ago in reply to Sean Patterson

From our experience, a HA Failover with XG was never really quick or very reliable, at least not in a sophisticated network environment. So we do expect minutes of outage when this happens, because it really takes it's time.

As you say, this is much quicker in your lab, it may have to do with leaned ARPs, existing session etc. I'm excited for any changes in HA.
Cancel
Vote Up 0 Vote Down

Cancel
0 Sean Patterson over 5 years ago in reply to LuCar Toni

Are there any specific logs I can pull that would hint towards this? (like the VM waiting for the arp response or something). As I mentioned, this is in a managed datacenter, so I'd like as much specifics as possible to go to the vendor/team.
Cancel
Vote Up 0 Vote Down

Cancel
0 Sean Patterson over 5 years ago in reply to LHerzog

Yea, the way it is behaving now isn't the end of the world. 2 minutes of downtime for a failover is dramatically better than a manual recovery process, or 30-45 minutes downtime to do a full firmware upgrade like we have now, but I'd love to be able to make it work the same as it does in the lab where there's essentially no noticeable interruption on the business.
Cancel
Vote Up 0 Vote Down

Cancel
0 LHerzog over 5 years ago in reply to Sean Patterson

whenever we've had HA issues with V17.5 or now v18 and we went to support and complained, it was never possible to discover the root cause from all the log that we provided them or they searched themselves on the (physical) XGs. They needed to have some debug levels enabled to diagnose and always more tests.

As your problem seems to be reproduceable, it should be catcheable by support. But I'd expect a loooong time support case and lot's of recreations of the issue (each time with downtimes....) if you decide open a case.
Cancel
Vote Up 0 Vote Down

Cancel
0 LuCar Toni over 5 years ago in reply to LHerzog

It highly depends on the customer setup and the switch infrastructure. Sophos is working with a virtual MAC and a virtual IP. Virtual mac can be disabled (see checkbox above) but still the HA works with a virtual IP. After the takeover, the Aux will send out gracious ARP. https://wiki.wireshark.org/Gratuitous_ARP It simply means "I am here, give me my packets". There are switch setups, which cannot understand this "failover" in a quick manner. Hence it takes a long time, until the packets are send to the new host.

Just for the expectation: The HA will not load (fully) the Webadmin until the HA is in a stable manner. This means, the network traffic should work, but the webadmin takes some time, until the old primary is back online and back in the HA (fully synced).

__________________________________________________________________________________________________________________
Cancel
Vote Up 0 Vote Down

Cancel
0 Sean Patterson over 5 years ago in reply to LuCar Toni

I'm basing the "outage" on pinging management IPs, WAN IPs, as well as pings over an IPSEC tunnel that is established on the pair.

The aux device responds on the aux management IP for the first 1-2 minutes during which the WAN or other interfaces the pair handle are not responding, then after it seems to recognize it should take over it starts responding on the primary management IP with a few seconds and all relevant IPs start responding.

I'll reach out to the datacenter in regards to the arp stuff and see if there's anything they can see in terms of traffic.
Cancel
Vote Up 0 Vote Down

Cancel

Reply

0 Sean Patterson over 5 years ago in reply to LuCar Toni

I'm basing the "outage" on pinging management IPs, WAN IPs, as well as pings over an IPSEC tunnel that is established on the pair.

The aux device responds on the aux management IP for the first 1-2 minutes during which the WAN or other interfaces the pair handle are not responding, then after it seems to recognize it should take over it starts responding on the primary management IP with a few seconds and all relevant IPs start responding.

I'll reach out to the datacenter in regards to the arp stuff and see if there's anything they can see in terms of traffic.
Cancel
Vote Up 0 Vote Down

Cancel

Children

No Data