This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA Active-Passive VLANs Do Not Fail Over XG310

Hi, I'm at the beginning of a new deployment (not live) and have run into a problem. I hope it's something I've misconfigured but I cannot see how/where.

Short version: at HA failover the native VLAN continues to work, any others do not. Not only that the configuration for the VLANs appears correct however they seem to remain inactive through all future tests.

  • I have 2x XG310 devices configured in HA Active-Passive.
  • SFOS 16.05.6 MR-6
  • Each XG has a 4 port 10 GbE SFP+ FleXi Port module.
  • Port A1 is the designation for the first port in the flexi module.
  • LAN network is set up on A1, VLANs are added to A1.
  • A1 is the only monitored port for HA.
  • Each XG is linked with SFP+ to the same switch via A1.
  • I have a pc linked to that switch I'm using to ping test the VLAN gateways on the firewall.

With the setup as is here, everything works ok. 

Test 1:

  1. Start pinging the interface address.
  2. Unpatch the SFP+ link from the primary device.
  3. Everything works as expected. Auxiliary takes control, I lose 1 or 2 pings, things are fine.

Point to note, the documentation for this situation states: "Once the primary device becomes functional, it automatically takes over from the auxiliary device." This does not occur for me, maybe that line is supposed to be in the Active-Active manual. Once the aux becomes primary it stays primary until it goes to faulty/offline.

At this point once the original primary is repatched and back online I can fail over back the other way and things continue to work pinging the native VLAN gateway.

Test 2: 

  1. Create a new VLAN off port A1 (switches are already configured).
  2. Start pinging that gateway - everything is fine.
  3. Fail the primary device by unpatching A1 again.
  4. Pings all fail.

I have tested this with multiple VLANs, in both directions from either XG always with the same result.

I cannot get the VLAN(s) in question responding again without deleting and recreating them, which is of course unacceptable in production.

As soon as I recreate a VLAN everything starts working immediately.

Any ideas? Thanks for your time.



This thread was automatically locked due to age.
Parents
  • Ticket has been submitted. 

    I've moved on to other setup and leaving HA alone until I hear back.

    Thought I'd do another quick test today as I've been setting up a lot of vlans and zones.

    Previously I've written that I can't get the vlan back online after a failover. I'm having inconsistent results when disabling HA in this state.

    The last two times the vlans have gone down during failover, when I disabled HA the vlans have come back online and appear ok.

     

  • Quick update on this one, after two remote sessions with Sophos support the problem has been confirmed and escalated again, I believe to development.

    I've confirmed in MR7 the problem occurs on enabling HA. All VLANs on flexiport module will be disabled on HA activation (I have 4 port SFP+ module).

    What happens to a live system in a failover situation when HA was already active and firmware got upgraded, no idea.

    You can see the VLANs are disabled by querying the database and looking at the enabled flag for the interfaces. It looks like it also somehow affects the physical interface. I recall seeing flexiport PortA1 enabled flag as 0 even though the links were up.

    This bug is preventing my deployment.

    I'll update when I hear back from support. 

  • Final update hopefully - it was the driver.

    I've had contact with support techs every week. Last week Sophos replicated the issue in house.

    The problem was the driver version used by the flexi port interfaces (only 10G tests as far as I know).

    The tech updated the driver on both primary and aux devices and everything is looking ok.

    I'm not 100% but I believe I was told this driver wouldn't be updated (rolled back) in minor updates, however the correct driver should be deployed in the next major update (v17).

    Thanks for your input.

Reply
  • Final update hopefully - it was the driver.

    I've had contact with support techs every week. Last week Sophos replicated the issue in house.

    The problem was the driver version used by the flexi port interfaces (only 10G tests as far as I know).

    The tech updated the driver on both primary and aux devices and everything is looking ok.

    I'm not 100% but I believe I was told this driver wouldn't be updated (rolled back) in minor updates, however the correct driver should be deployed in the next major update (v17).

    Thanks for your input.

Children
No Data