This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA installation steps

To install ASG v7.4 in a two server HA setup (hot standby), shall I just install two servers normally, just as if they are two standalone servers, and put the same license (with HA) on both servers? and then enable the HA? Will I need three sets of IP addresses? one for server A, one for server B and one for the virtual IP (I believe that's how checkpoint works, so that you can manage/access both servers via their own IP, and the virtual IP is exposed from the cluster).

I read some talks about issues and re-release of v7.4. Is the iso file at http://download.astaro.com/Astaro_Security_Gateway/v7/software_appliance/iso/ dated Feb 26 good to use for a new installation?

This thread was automatically locked due to age.

Parents

0 BAlfson over 16 years ago

You only configure the Master; it configures the Slave. There's only one set of IPs that you need to use or are aware of; the Astaro manages the additional IPs it uses to communicate between the two devices. If you are comfortable with SSH, you can indeed use these IPs, but it's usually only Astaro Support that would need to look at a slave in Hot-Standy mode.

Connect the internal interfaces to your network, the external interfaces to a switch to the public Internet and the boxes to eachother through an extra switch or crossover cable.

Load the software on one server. Start up Astaro, apply the license that allows HA and configure High Availability. Load software on the other indentical server; as soon as it's started, the first server will make it its slave and will give it all of the configuration and copies of its data.

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 KKnecht over 16 years ago in reply to BAlfson

You can also check the Astaro KnowledgeBase. You can find the Cluster-Guide and more under
\ASG Version 7 \Management \High Availability
Good luck !! [;)]
Cancel
Vote Up 0 Vote Down

Cancel
0 liug over 16 years ago in reply to KKnecht

ok, I have both machines up, with v7.401. eth0, eth1 are broadcom and eth2-5 are the Intel chip.
I use crossover cable on eth0 for heartbeat, eth1 for external (default gw) and eth2 for internal.
The master/slave HA setup seems to be working.
When I login console to those two machines, "ifconfig -a" shows both uses the same set of IP address, and both are in UP state. Is that normal?
Also, when I view the ha log from the webadmin page, I see

ha_daemon[3268]: id="38A0" severity="info" sys="System" sub="ha" name="Monitoring interfaceses for link beat: eth1 eth2 "

I thought I had only configured the eth0 for link beat. Does that message mean something is misconfigured? Why would it monitor eth1 and eth2, instead of eth0?
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 16 years ago in reply to liug

On the 'Configuration' tab of 'Management >> High Availablility', the 'Backup Interface' is how your two Astaros communicate if the HA connection on 'Sync NIC' fails. If everything looks right on that page, I wouldn't be concerned. If it turns out that you hadn't selected a backup interface, then I'd be interested to know if the message above changes after you select an interface.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 liug over 16 years ago in reply to BAlfson

On that page, under "Advanced", I have "Enable auto config of new devices" checked, and the "Back interface" has "no backup interface" selected.

I have a more severe problem right now, and can't test HA further:

Initially, node 1 is master and node 2 is slave. After about 3 hours, node 1 goes dead (no console, no heartbeat), and node 2 takes over as master. All I could do is to power cycle node 1, and it will boot fine and go to slave mode. node 1 will run fine for about 3 hours, then dead again. It happened 3 times yesterday, and I thought it could have some bad hardware, so I login shell and check those files in /var/log but couldn't find any mentioning of hardware problem. I suspect maybe the cluster somehow killed node 1, so finally I decided to shutdown node 2, and node 1 becomes master. That was done yesterday around 3:30pm PDT, and node 1 is still running now after almost 24 hours!

Since the HA is now running in single node (node 1) mode and node 2 is still off, I can't really test your other suggestion about selecting backup interface.
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 16 years ago in reply to liug

Have you confirmed on the HCL that your broadcom adapter is approved for heartbeat?

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 liug over 16 years ago in reply to BAlfson

No, they are not on the HCL, but per the suggestions I got here , I just use it for testing anyway. Does the fact that it works fine for 3 hours mean the NIC is working? Is there a log or other things that I can check to confirm whether there are issues with the NIC?
Cancel
Vote Up 0 Vote Down

Cancel

0 liug over 16 years ago in reply to BAlfson

If it turns out that you hadn't selected a backup interface, then I'd be interested to know if the message above changes after you select an interface.

Cheers - Bob

ok, I powered up the node 2 and configured HA to use internal (eth2) as backup interface. Here is what's in the ha log. It seems the message has no change, it is still monitoring eth1 (wan) and eth2 (internal), just like before.


2009:03:23-10:16:40 vfw1n1-1 ha_daemon[3254]: id="38A0" severity="info" sys="System" sub="ha" name="Backup interface changed: none -> eth2"

2009:03:23-10:16:40 vfw1n1-2 ha_daemon[3252]: id="38A0" severity="info" sys="System" sub="ha" name="Backup interface changed: none -> eth2"

2009:03:23-10:16:41 vfw1n1-2 ha_daemon[3252]: id="38A1" severity="warn" sys="System" sub="ha" name="Received backup heartbeats from master node!"

2009:03:23-10:16:45 vfw1n1-2 ha_daemon[3252]: id="38A0" severity="info" sys="System" sub="ha" name="Monitoring interfaceses for link beat: eth1 eth2 "

2009:03:23-10:17:00 vfw1n1-1 ha_daemon[3254]: id="38A0" severity="info" sys="System" sub="ha" name="Monitoring interfaceses for link beat: eth1 eth2 "

0 liug over 16 years ago in reply to liug

BTW, I see some FATAL errors in node 1 ha log after I powered on node 2. Are they normal?


2009:03:23-09:49:04 vfw1n1-1 ha_daemon[3254]: id="38A3" severity="debug" sys="System" sub="ha" name="Netlink: Found link beat on eth0 again!"

2009:03:23-09:49:12 vfw1n1-1 ha_daemon[3254]: id="38A0" severity="info" sys="System" sub="ha" name="Access granted to remote node 2!"

2009:03:23-09:49:15 vfw1n1-1 ha_daemon[3254]: id="38C0" severity="info" sys="System" sub="ha" name="Node 2 is alive!"

2009:03:23-09:49:15 vfw1n1-1 ha_daemon[3254]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 changed state: DEAD -> ACTIVE"

2009:03:23-09:49:15 vfw1n1-1 ha_daemon[3254]: id="38A0" severity="info" sys="System" sub="ha" name="Node 2 joined with version 7.401"

2009:03:23-09:49:15 vfw1n1-1 slon_control[3451]: Started slon process 14357 for reporting

2009:03:23-09:49:15 vfw1n1-1 slon_control[3451]: Started slon process 14358 for pop3

2009:03:23-09:49:15 vfw1n1-1 slon[14357]: [1-1] CONFIG main: slon version 1.2.15 starting up

2009:03:23-09:49:15 vfw1n1-1 slon[14358]: [1-1] CONFIG main: slon version 1.2.15 starting up

2009:03:23-09:49:15 vfw1n1-1 slon[14360]: [2-1] ERROR  cannot get sl_local_node_id - ERROR:  schema "_asg_cluster" does not exist

2009:03:23-09:49:15 vfw1n1-1 slon[14360]: [3-1] FATAL  main: Node is not initialized properly - sleep 10s

2009:03:23-09:49:15 vfw1n1-1 slon[14359]: [2-1] CONFIG main: local node id = 1

2009:03:23-09:49:15 vfw1n1-1 slon[14359]: [3-1] CONFIG main: launching sched_start_mainloop

2009:03:23-09:49:15 vfw1n1-1 slon[14359]: [4-1] CONFIG main: loading current cluster configuration

2009:03:23-09:49:15 vfw1n1-1 slon[14359]: [5-1] CONFIG storeSet: set_id=1 set_origin=1 set_comment='reporting tables'

2009:03:23-09:49:15 vfw1n1-1 slon[14359]: [6-1] CONFIG main: configuration complete - starting threads

2009:03:23-09:49:23 vfw1n1-1 ctsyncd: new node detected (ID 2, state 223)

2009:03:23-09:49:23 vfw1n1-1 ctsyncd: sync request (node 2)

2009:03:23-09:49:23 vfw1n1-1 ctsyncd: starting initial sync

2009:03:23-09:49:23 vfw1n1-1 ctsyncd: initial sync done

2009:03:23-09:49:25 vfw1n1-1 slon[14358]: [1-1] CONFIG main: slon version 1.2.15 starting up

2009:03:23-09:49:25 vfw1n1-1 slon[14388]: [2-1] ERROR  cannot get sl_local_node_id - ERROR:  schema "_asg_cluster" does not exist

2009:03:23-09:49:25 vfw1n1-1 slon[14388]: [3-1] FATAL  main: Node is not initialized properly - sleep 10s

0 BAlfson over 16 years ago in reply to liug

After that, does the cluster come up, or does the error keep repeating?

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel

Reply

0 BAlfson over 16 years ago in reply to liug

After that, does the cluster come up, or does the error keep repeating?

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel

Children

0 liug over 16 years ago in reply to BAlfson

Yes, the cluster did come up per the web gui screen.
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 16 years ago in reply to liug

This is one of my gripes about the Astaro; it is really chatty. I waste a lot of time explaing to Astaro admins that such-and-such a message is normal. The programmers have a lot to do that's more fun than cleaning up the error messages or preventing the unimportant ones from being sent.

I don't know if that's abnormal; it looks like what I would expect, and if the Dashboard says the cluster is active, then that's good enough for me!

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 liug over 16 years ago in reply to BAlfson

This is one of my gripes about the Astaro; it is really chatty. I waste a lot of time explaing to Astaro admins that such-and-such a message is normal. The programmers have a lot to do that's more fun than cleaning up the error messages or preventing the unimportant ones from being sent.

I don't know if that's abnormal; it looks like what I would expect, and if the Dashboard says the cluster is active, then that's good enough for me!

Seems like those errors are real, and they are coming back to me now.
I am getting tons of bonus failovers every day. The slave (node 2) thinks master is dead, and becomes master itself. node 1 then finds two masters and asks node 2 to shutoff, so node 2 is becoming slave again and the HA is functional normally afterwards. This cycle happens every 30 min, and I am getting tons of emails every night.
Cancel
Vote Up 0 Vote Down

Cancel
0 liug over 16 years ago in reply to liug

It turns out the NIC doesn't work well for heartbeat. Though it worked initially, it starts losing packets when the cpu goes over 50%.
Lesson learned: you can say the NIC works if it works initially.

Any recommendation for a good quad port gigabit NIC based on your experience? The HCL is a good start, but I have already been bitten by the fact that the vendors change their chipset, rev# without changing the model number.
Cancel
Vote Up 0 Vote Down

Cancel