Guest User!

You are not Sophos Staff.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

No failover on HA Cluster if Nic is in Error-State

Hello,

we are Using two SG 310 as a active / passive cluster.

Unfortunately we experienced that HA is not working.

Approximately once in a month we are loosing internet-connection.

The external WAN-Interface shows "ERROR" under "Link" and "UP" under "State"

Unfortunately the Cluster isn't switching to the other Node and the whole Office is cut from Internet.

If we turn the Interface off and on again, everything is working fine.

Rebooting the affecting Node will work too.

 

Did someone experince a similar behaviour.

We have to sort this problem out and support is not very helpful in this case

Tibor



This thread was automatically locked due to age.
Parents Reply Children
  • unfortunaqtely the error is not reproduceable.

    The manual failover solves the problem but not cause it.

    in between i read a thread where they mentioned a problem on ARP Level between UTM and a Router.

     

    I've taken initiatives to set the providerrouter to fixed 100mbit.

    Then i will set the UTM to 100mbit.

     

    lets see....

  • When it fails over, the auxiliary will do a gratuitous arp when taking over. Perhaps this is what causes it to work but its better to find out why its not working when it happens. 

    I would recommend connecting to the device while its in the broken state and run captures with TCPDUMP on the external interface to see if it can arp for the default gateway, reach it, reach past it. 

  • unfortunately we do not have a second to investigate why its down when its down.

    Thats why we have a Cluster ;-)

    We have to be online 24/7

    Maybe at Midnight we'll have a chance, but i can not reproduce it

  • Like Dirk, I also suspect the speed/duplex settings.  I think the discussion with MasterRoshi confirms this.  `I've had several clients with a similar problem.  See the solution in #7.7 in Rulz (last updated 2019-04-17).

    Cheers - Bob