This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ASG 8.2 HA under vSphere 5 -> WAN issues

Okay, so this is probably a bit special...

I am running two virtual ASG 8.202's in HA (Hot standby) on two ESXi 5.0 hosts.
There are two WAN Uplinks - both "Cable Modem" DHCP.
Due to previous issues on the LAN I have disabled HA MACs.

The two WAN DHCP leases bind to MAC addresses.
So I have spoofed WAN MACs in VMWare - so WAN1 MAC is identical on both ASGs and WAN2 MAC is identical on both ASGs.

The setup worked fine on a combination of ESXi 4.1 and ASG 8.1.
But I have not tested failover since upgrading ASG and ESXi - until now.

When I power down the master, the slave does not bring up the two WAN interfaces,

In the Dashboard WAN1 and WAN2 display as "State=Error" and "Link=Down".
But if I go to "Advanced" -> "Support" -> "Interfaces Table", both interfaces are listed as "up" and seem to have assigned IP-addresses.

If I power on the master, it will become active again and everything will run fine.

If I instead reboot the slave, the interfaces will come up, and everything runs fine.

I am not sure if this is a ESXi 5 issue - or an ASG 8.2 issue?
Maybe something to do with the spoofed MACs?

Any ideas about where to start looking?

Best regards
Martin

This thread was automatically locked due to age.

0 da_merlin over 13 years ago
Probably this one, will be fixed in 8.300. In the meantime, you have to replace dhcp_updown.plx:

If you are affected by this issue, you can run the following commands as root on your ASG:
wget http://people.astaro.com/uweber/mantis_19139/dhcp_updown.plx
mv /var/chroot-dhcpc/usr/sbin/dhcp_updown.plx /var/chroot-dhcpc/usr/sbin/dhcp_updown.plx.org
mv dhcp_updown.plx /var/chroot-dhcpc/usr/sbin/dhcp_updown.plx
chmod a+x /var/chroot-dhcpc/usr/sbin/dhcp_updown.plx

Cheers
Ulrich
Cancel
Vote Up 0 Vote Down

Cancel
0 martinh_dk over 13 years ago in reply to da_merlin

Probably this one, will be fixed in 8.300. In the meantime, you have to replace dhcp_updown.plx:

Cheers
Ulrich

Hi Ulrich,

Thank you for the answer.
The fix changed HA from "not working at all" to "flaky" [:)]

Both interfaces still come up as "Error" and "Down" in the dashboard.
But WAN2 was actually up and running.

A manual DHCP Renew on both interfaces brought them up and corrected the status in the dashboard.

I will run a few tests and report if the behavior is consistent.

Best regards
Martin
Cancel
Vote Up 0 Vote Down

Cancel
0 martinh_dk over 13 years ago in reply to martinh_dk

Strange indeed:

Both interfaces seem to bind to DHCP assigned IPs.
But WAN1 is only up for about 20 seconds - and then it stops responding.

If I renew the lease on WAN1 manually, it will come up shortly after.
But simultaneously WAN2 will stop responding - until I renew the lease on WAN2 as well.

Here is a log of the initial DHCP bind:
2011:11:14-12:31:25 ASTARO-2 dhclient: Listening on LPF/eth1/00:50:56:11:22:33
2011:11:14-12:31:25 ASTARO-2 dhclient: Sending on   LPF/eth1/00:50:56:11:22:33
2011:11:14-12:31:25 ASTARO-2 dhclient: Sending on   Socket/fallback
2011:11:14-12:31:25 ASTARO-2 dhclient: DHCPREQUEST on eth1 to 255.255.255.255 port 67
2011:11:14-12:31:25 ASTARO-2 dhclient: DHCPACK from 87.55.253.***
2011:11:14-12:31:28 ASTARO-2 dhclient: Listening on LPF/eth2/00:50:56:33:44:55
2011:11:14-12:31:28 ASTARO-2 dhclient: Sending on   LPF/eth2/00:50:56:33:44:55
2011:11:14-12:31:28 ASTARO-2 dhclient: Sending on   Socket/fallback
2011:11:14-12:31:28 ASTARO-2 dhclient: DHCPREQUEST on eth2 to 255.255.255.255 port 67
2011:11:14-12:31:28 ASTARO-2 dhclient: DHCPACK from 172.27.0.***
2011:11:14-12:31:30 ASTARO-2 dhclient: bound to 176.21.36.*** -- renewal in 2940 seconds.
2011:11:14-12:31:31 ASTARO-2 dhclient: bound to 87.104.147.*** -- renewal in 30214 second

Here is a log of the subsequent manual renew on WAN1:
2011:11:14-12:42:29 ASTARO-2 dhclient: Listening on LPF/eth1/00:50:56:11:22:33
2011:11:14-12:42:29 ASTARO-2 dhclient: Sending on   LPF/eth1/00:50:56:11:22:33
2011:11:14-12:42:29 ASTARO-2 dhclient: Sending on   Socket/fallback
2011:11:14-12:42:29 ASTARO-2 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 6
2011:11:14-12:42:29 ASTARO-2 dhclient: DHCPREQUEST on eth1 to 255.255.255.255 port 67
2011:11:14-12:42:29 ASTARO-2 dhclient: DHCPOFFER from 87.55.253.***
2011:11:14-12:42:29 ASTARO-2 dhclient: DHCPACK from 87.55.253.***
2011:11:14-12:42:30 ASTARO-2 dhclient: bound to 176.21.36.*** -- renewal in 2756 seconds.

Any ideas?

BTW: The issue only occurs with "forced" failover.
When the master node is up and running again, it gracefully takes over.

/Martin
Cancel
Vote Up 0 Vote Down

Cancel
0 da_merlin over 13 years ago

Do you have preempt mode set?

VMware rejects by default that two systems use the same MAC address.
Do you have that validation deactivated?

Cheers
Ulrich
Cancel
Vote Up 0 Vote Down

Cancel
0 martinh_dk over 13 years ago in reply to da_merlin

Do you have preempt mode set?

VMware rejects by default that two systems use the same MAC address.
Do you have that validation deactivated?

Cheers
Ulrich

Hello Ulrich,

Preempt? Is that not related to Active/Active cluster?
I cannot seem to find the setting under my HA settings. (Hot Standby).

I have the ignoreMACAddressConflict = "true" set for all interfaces in the VMWare vmx configuration files for the two ASGs

I am unsure if there are other ways to allow duplicate MACs in ESXi?

Also, I tend to think that it may not be an ESXi but an ASG issue?

I tried to leave the ASG until the lease for WAN1 expired - and it actually renewed the lease without problems.
But both interfaces are still displayed as "Down" and "Error" in Dashboard - and no traffic besides DCHP is passed through the WAN1 interface.
So some part of the ASG must not recognize that the interface is up and running?

A manual renew on WAN1 from ASG still fixes the problem - for WAN1
But then WAN2 stops passing traffic, until I have manually renewed DHCP lease on that interface as well.
After 2x manual renew both interfaces are working - and they display as "Up".

What do you think?

Best regards
Martin
Cancel
Vote Up 0 Vote Down

Cancel
0 da_merlin over 13 years ago

You wrote gratitous takeover, so I thought you speak from HA preempt Mode (HA Advanced tab).

Your Slave node is always in UNLINKED state?
If Link is DOWN, then the Ethernet port seems not to be connected.
This should be fixed first...

How did you disable HA virtual MAC? Did you set ha->advanced->virtual_mac to 0
via confd-client?

Cheers
Ulrich
Cancel
Vote Up 0 Vote Down

Cancel
0 martinh_dk over 13 years ago in reply to da_merlin

You wrote gratitous takeover, so I thought you speak from HA preempt Mode (HA Advanced tab).

Your Slave node is always in UNLINKED state?
If Link is DOWN, then the Ethernet port seems not to be connected.
This should be fixed first...

How did you disable HA virtual MAC? Did you set ha->advanced->virtual_mac to 0
via confd-client?

Cheers
Ulrich

Sorry if I used the wrong terms [:)]
I am running Hot Standby.
What I meant was: When the Preferred Master has been down and is turned back on, it takes over the Master role without problems, once sync'ed.

The problems only occur on forced failover - e.g. when Master is shut down and Slave must take over.

Regarding interface state: If I run iconfig on the slave node's console, all interfaces are listed as "UP" and "RUNNING".
That is also the case after a forced failover - even though the two WAN interfaces are listed as "Down" in the Dashboard.

HA virtual MACs were disabled with the command "cc set ha advanced virtual_mac 0" from console on both ASGs.

Best regards
Martin
Cancel
Vote Up 0 Vote Down

Cancel
0 da_merlin over 13 years ago

Ok, but the old Master should not take over automatically.
This is either if preempt mode is on (which seems not to be the case),
or if the Slave node is UNLINKED.

Just have a look under Management >> High Availability >> Status
Whats the Slaves Status?

On the console, whats the output of "ethtool eth1"
Cancel
Vote Up 0 Vote Down

Cancel
0 martinh_dk over 13 years ago in reply to da_merlin

Ok, but the old Master should not take over automatically.
This is either if preempt mode is on (which seems not to be the case),
or if the Slave node is UNLINKED.

Just have a look under Management >> High Availability >> Status
Whats the Slaves Status?

On the console, whats the output of "ethtool eth1"

Status on both nodes is always "ACTIVE". (Except after a failover, where the failed node is "SYNCING".)

When nodes are in sync, the Master always takes over. Is this not intended?
I have not (to my best knowledge) configured anything besides "Automatic Configuration" and "Preferred Master".

If I run "ethtool eth1" on slave's console, it reports:
Link detected: yes
Speed: 1000Mb/s
Duplex: Full

And a lot more..
Do you need more detailed output?

/Martin
Cancel
Vote Up 0 Vote Down

Cancel
0 da_merlin over 13 years ago

But the I don't get it why there is a automatic failover after the old Master returns.

Can you mail me your hight-availability.log from an effected day to uweber@astaro.com?

Thanks
Cancel
Vote Up 0 Vote Down

Cancel