This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Issues with HA Failover and 9.003-16

I've recently had some strange behaviour with a couple HA failover setups. I haven't changed the network configuration that I've been running with for a few years, only updated from 8.305 to 9.003-16. The production example is upgrade of hardware 120 --> 220 in the main office. The second example is exact same setup on home network with nearly same setup, home network only running in HA failover for last 5 days in hopes of replicating or ruling out and its occurred twice since. Here's the overview.

PROD HA config is done via isolated VLANs (OUTSIDE doesn't see INSIDE)
  - VLAN-OUTSIDE: port1 to node1, port2 to node2, port3 to ISP-WAN
  - VLAN-INSIDE port4 to node1, port5 to node2, port 6 connected to downstream L3 routing switch via untagged VLAN INSIDE (switch is Procurve 2824 I.10.73 one version behind, latest version is I.10.77 but has no documented fix for this, plus its worked for years with this config)
  - ISSUE: One of the nodes will fall off the network, followed by the 2nd node falling off as well. Or the nodes will switch roles. Upon investigation on switch it indicates no MAC address associated with the ports that are connected to the nodes. Switch has been changed to spare and and problem continues. It appears the nodes no longer provide their MAC address for eth0 or eth1. (Next time this happens I will plug the node into an alternate switch to see if it gets the MAC as well as consoling into the node to see if it still lists a MAC via ifconfig.) Clearing ARP on switch or rebooting switch doesn't help, only fix is reboot the UTM.

HOME HA config is done via isolated VLANs (OUTSIDE doesn't see INSIDE)
  - VLAN-OUTSIDE: port1 to node1, port2 to node2, port3 to ISP-WAN
  - VLAN-INSIDE port4 to node1, port5 to node2, (home network uses tagged VLANs 98 and 99 as the VLAN INSIDE, no routing, switch is Procurve 2810-24G N.11.52 current version)
  - ISSUE: Same as in production, nodes fall off network or switch roles and switch indicates no MAC address associated with the ports connected to the nodes. Home doesn't have luxury of spare switch to try with. It appears the nodes no longer provide their MAC address for eth0 or eth1. (Next time this happens I will console into the node see if it still lists a MAC via ifconfig.) Clearing ARP on switch or rebooting switch doesn't help, only fix is reboot the UTM.

Reviewed system and high-availability logs and don't see any indication of errors to assist with troubleshooting.

(I haven't escalated to Sophos Support yet until I have additional info from next failures. I'm also hesitant to make things fail again as its unpredictable as to when it will fail other than to say it will be at the most inconvenient time.)

A further point of clarification, I have 3 other networks in production and none are experiencing issues.
  - ASG 120 HA Failover (8.305) with Procurve 2824
  - UTM 50 IP HA Failover (9.003-16) with PowerConnect 6248 in stacking mode
  - UTM 50 IP HA Failover (9.003-16) with PowerConnect 6248 in stacking mode

Is anyone else having issues with HA failover since moving to 9.003-16?
Is there something wrong with my setup at the network connection level?
Any suggestions?

This thread was automatically locked due to age.

Parents

0 BAlfson over 13 years ago

Darcy, Sophos Support should be able to analyze this from your logs. I'm surprised that you could do an in-place upgrade to V9 for your 120HA. In fact, I wasn't aware that the in-place upgrade for a standalone 120 had been released.

Good luck!

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 darcym over 13 years ago in reply to BAlfson

Hi Bob,

Yes its getting close to engaging Sophos Support. Being a vet of dealing with support from many companies, I'm attempting to eliminate the switch from the equation. I've also been lucky enough to have replicated the issue on home network which only ticks off family rather than entire corporate network ;-)

Latest update is that Master-eth1 is losing link and dropping to become a slave(via ethtool eth1), plugging/replugging doesn't bring it back, nor does toggling the switch port. Updated to 9.004 today and same behaviour. Waiting for it to occur again and will command line into slave and try manually setting eth1 to other speeds/duplex combos. If that fails to provide any additional insight plan is to use a dumb switch rather than a VLAN for consolidation of eth1, and if that fails then drop back to 9.002.

I'm surprised that you could do an in-place upgrade to V9 for your 120HA.

We upgraded from HA 120(8.305) to a HA 220(9.003-12), just dropped in the restore file to new 220's and used previous connections. The 2 changes were migration of hardware, and a later software version. Argghh!
Cancel
Vote Up 0 Vote Down

Cancel

Reply

0 darcym over 13 years ago in reply to BAlfson

Hi Bob,

Yes its getting close to engaging Sophos Support. Being a vet of dealing with support from many companies, I'm attempting to eliminate the switch from the equation. I've also been lucky enough to have replicated the issue on home network which only ticks off family rather than entire corporate network ;-)

Latest update is that Master-eth1 is losing link and dropping to become a slave(via ethtool eth1), plugging/replugging doesn't bring it back, nor does toggling the switch port. Updated to 9.004 today and same behaviour. Waiting for it to occur again and will command line into slave and try manually setting eth1 to other speeds/duplex combos. If that fails to provide any additional insight plan is to use a dumb switch rather than a VLAN for consolidation of eth1, and if that fails then drop back to 9.002.

I'm surprised that you could do an in-place upgrade to V9 for your 120HA.

We upgraded from HA 120(8.305) to a HA 220(9.003-12), just dropped in the restore file to new 220's and used previous connections. The 2 changes were migration of hardware, and a later software version. Argghh!
Cancel
Vote Up 0 Vote Down

Cancel

Children

No Data