I'm in the process of trying to add a second XG VM in my production environment to allow for HA. As part of that I spun up two test machines to ensure I was comfortable with how everything failed over, the behavior of IPSEC tunnels, etc.
So in my environment I have two brand new XG VMs running on Hyper-V, both running 18.0.5 MR-5-Build586
They seem to work in HA fine, report to each other, etc. However if I attempt to fail over from the primary to the secondary (either via a hard power event or just by clicking the "Switch to passive device" button on the HA page) it takes between 1 and 2 minutes for the aux device to take over and begin properly responding on the network. The formerly primary device continues booting and usually around 4-5 minutes after it's boot, reboots again. At this time the aux device usually loses connection for another 2-3 minutes before finally recovering and responding properly. If everything sits for 5 minutes or so, it seems to be stable and good. The formerly primary device is aux and the formerly aux device is primary.
In order to test this further I ran the same process in a lab environment I have. In that environment everything seems to work properly. The failover takes 5-10 seconds or so and the machine only boots once. Both lab and production were set up identically for this test at the VM level. XG VMs created off the same base VHD, running the same software, etc. so it doesn't seem to be any sort of issue with the XG software itself, but rather some environment difference. Unfortunately I can't duplicate the full config of production in my lab, so I can't play with it fully, thus I'm hoping to be able to pull some logs from the XG VMs to understand why the aux device seems to take upwards of 1-2 minutes to take over.
Things I've noticed in production:
- When the primary reboots, the aux keeps thinking it's the aux for the whole 1-2 minute duration based on system ha show details. Current HA State is Aux and Peer HA state is Primary. Once it takes over, it goes to Current state: Standalone Peer HA state: Failure as you'd generally expect while the other device finishes its reboot cycle.
- When the formerly primary device reboots the first time it seems to come back online thinking it's still the primary. If I connect to the console during that first few minutes it will say Current HA State: Standalone, Peer HA State: Failure Therefore both devices are simultaneously saying they're standalone with peer in fault.
- Once the former primary device does it's second reboot it says its the aux device as it should.
- If I shut down the primary device until after the aux has fully taken over, and then boot it. It comes up in the very first boot showing that it's the aux device which seems to imply the reason for that first boot in standalone mode is related to how long the aux is taking to take over.
- HA config on the VMs was done via the quick discovery mode.
One additional thing that makes this more complicated to troubleshoot for me is that the XG VMs in production are running in a managed environment where we don't directly control the network layer. Thus I need to be able to identify what the devices may be seeing that's causing the delay and, if necessary, bring that to the managed team to potentially resolve. I know in this environment before, we've had a couple issues with our actual standalone production XG VM not responding on the network after a boot and it was due to the arp cache having an old mac address for the interfaces. I'm not sure if that could also be causing an issue here, nor am I sure how to verify if it is/isn't.
I'm looking for assistance in identifying why the aux device seems to take so long in my production environment to complete the process of taking over for the primary and what logs I can pull to help to identify the difference between the lab and production environments so I can get with the relevant teams to resolve.
Thanks,
This thread was automatically locked due to age.