This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Problems with Meshed Cluster Setup

Hello,

I am trying to setup a Meshed Cluster Setup as described in the Astaro Cluster Deployment Guide on page 13. As long as all cables are patched everything is working fine.

Now I want to simulate what happens if a cable or device fails. As long as I remove one cable only from one Astaro to one internal switch everything is working as described. But in the case one internal switch fails, the hole link aggregation group on all Astaros go down. All Astaros switch to state 'unlinked' and now traffic is processed.

Can anybody please give me an advice what I did wrong?

Thank you in advance,
Stephan

This thread was automatically locked due to age.

0 BAlfson over 15 years ago

Hi, Stephan. That description confuses me. I would have thought you would want to use Link Aggregation for the HA link on every unit in the cluster - is that what you have?

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 TPok over 15 years ago
Hi Bob,

Sorry for the confusion. So here is what I did:
I Have two internal LAN switches. Both ASGs are connected two both switches the following way:

eth0 of both ASGs is connected to LAN switch 1
eth1 of both ASGs is connected to LAN switch 2
eth4 of both ASGs is connected to the internet connection
eth7 is the direct HA connection between the two ASGs

eth0 and eth1 are aggregated to a link aggregation group (lag0).

As long as all connections are up everything is working fine.
But the following happens:

If I disconnect one cable (eth0 or eth1) on the Slave, it changes it's state to "unlinked". --> Working
If I disconnect one cable (eth0 or eth1) on the Master, it changes it's state to "unlinked" and the Slave becomes Master. --> Working
If I turn one of the LAN switches off (same as disconnecting one cable (eth0 or eth1) on both ASGs at the same time) both ASGs change their state to "unlinked" and connection from the clients to the AGSs is lost (not pingable anymore). I think this it not the intended behavior.

I hope you can understand my problem now.
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 15 years ago

OK, change the HA backup interface to the External (eth4) instead of one of the LANs.

You didn't say, but I assume that eth0 and eth1 are in a Link Aggregation Group - Correct?

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 TPok over 15 years ago

There is no HA backup interface configured.
The HA sync NIC is eth7 and this is a direct cable connection between the two ASGs.

As stated above eth0 and eth1 are in a Link Aggregation Group.
Cancel
Vote Up 0 Vote Down

Cancel
0 UrsWeiss over 15 years ago in reply to TPok

Hmmmm... Funny, i have a similar configuration, but in my case it only works if i detach one of the aggregated interfaces. If they are connected it works for some time, then fails until i plug out one of the cables. Unfortunately never found a working solution within the last six months.

I think that the Astaros where not made to use a Cluster together with Link Aggregation.

Still hope that it will work when upgrading to V8. But that will not be before the end of the year.

Urs
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 15 years ago

Ahh, yes, lag0, I see that now. I don't understand why you're having this problem, but please try setting the backup interface to lag0 to see if that changes anything. If not, then try eth4.

Whity, have you asked Astaro Support to look at your box and analyze the logs?

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 TPok over 15 years ago

I played a little bit with the settings and here is what I got:

Setting the HA backup interface to the LAG (lag0) doesn't work. I get the error "The HA backup interface requires an object reference.".

Setting the HA backup interface to eth4 (external interface) works but doesn't solve my problem.

But there is something much more interesting:
If I disconnect eth0 on both devices they stay in "ACTIVE" state and the connection keeps up and running. The HA log doesn't even notice that the link on eth0 is down.

If I disconnect eth1 on both devices they switch to "UNLINKED" state and the connection goes down. No pings or anything else possible.
The HA log shows the following messages:

2010:07:09-10:27:11 fw-1 ha_daemon[7468]: id="38A3" severity="debug" sys="System" sub="ha" name="Netlink: Lost link beat on eth1!"
2010:07:09-10:27:11 fw-2 ha_daemon[8584]: id="38A3" severity="debug" sys="System" sub="ha" name="Netlink: Lost link beat on eth1!"
2010:07:09-10:27:15 fw-1 ha_daemon[7468]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost link on interface lag0"
2010:07:09-10:27:17 fw-2 ha_daemon[8584]: id="38A0" severity="info" sys="System" sub="ha" name="Node 1 changed state: ACTIVE -> UNLINKED"
2010:07:09-10:27:17 fw-2 ha_daemon[8584]: id="38A1" severity="warn" sys="System" sub="ha" name="Lost link on interface lag0"
2010:07:09-10:27:18 fw-1 ha_daemon[7468]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 changed state: ACTIVE -> UNLINKED"

My first thought was that there is a problem with my internal LAN switches so I changed the connections of eth0 and eth1 on both devices. The behavior stays the same.

So i assume that there is a problem with link monitoring on my ASGs. I have no idea what I can do next. Any hint is greatly appreciated.

Thanks,
Stephan
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 15 years ago

I think your reseller needs to get Astaro support on this one. I think you've demonstrated that there's a problem that you shouldn't see. Also, although I haven't tried it myself, I don't think you should have gotten that error when trying to use lag0 for the backup interface.

The only other thing I can guess here is that you might put a LAG on two other Astaro interfaces and see if the problem occurs there, too.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 TPok over 15 years ago

Finally I solved the problem.

I removed the slave device from the cluster, rebooted the master, added the slave again and configured everything new. Then I rebooted the hole system again.

Now everything works fine. The log shows the right messages and a maximum of 3 or 4 pings get lost until the connection is back up again.

It's like a dream come true. [:D]

Thank you very much for your assistance.
Cancel
Vote Up 0 Vote Down

Cancel