This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA failover flip-flopping

Morning,

I've noticed over the last two days a pair of ASG120's I manage have started flip-flopping. I'll get reports the slave has taken over and then 30min to an hour later the master will take over again. During this process I get several emails about the "HA ctsync daemon not running - restarted" and a couple of reboot notifications (one for each node). The system then stays quiet for about 24hrs and repeats.

Also, while the nodes are showing as sync'd & active I keep getting this error:

2010:10:30-09:29:22 dclfw1-2 slon[8481]: [1-1] CONFIG main: slon version 1.2.20 starting up 

2010:10:30-09:29:22 dclfw1-2 slon[8801]: [2-1] ERROR cannot get sl_local_node_id - ERROR: schema "_asg_cluster" does not exist 

2010:10:30-09:29:22 dclfw1-2 slon[8801]: [3-1] FATAL main: Node is not initialized properly - sleep 10s

The nodes were setup per instructions in the HA guide and with some help from threads on here. The only other thing I've noticed during these events is that the system load on the node that dies (per the email) is typically above 0.80.

Everything works frine from the user standpoint but getting emails about these nodes is a bit annoying. Any ideas?

This thread was automatically locked due to age.

0 BAlfson over 15 years ago

Definitely submit a ticket to Astaro Support.

It would be interesting to know what top says is causing the load when the switchover occurs.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 TheDrew over 15 years ago in reply to BAlfson
I've submitted a case to Astaro and I've got top running in an ssh shell w/ logging so I'm getting snapshots of the top output every 3sec.

Assuming it crashes again in about 4hrs we'll see what the logs say.

Also, as a related note for posterity, I'm seeing these in the system logs too:

2010:10:30-20:30:47 dclfw1-2 postgres[14864]: [3-1] ERROR: schema "_asg_cluster" does not exist
2010:10:30-20:30:47 dclfw1-2 postgres[14864]: [3-2] STATEMENT: select last_value::int4 from "_asg_cluster".sl_local_node_id
2010:10:30-20:30:52 dclfw1-2 postgres[14862]: [4-1] LOG: unexpected EOF on client connection

No idea if the flipflopping & HA error logs are related but something is wrong with the postgres on the slave, that much I do know. Call it intuition. [;)]
Cancel
Vote Up 0 Vote Down

Cancel
0 TheDrew over 15 years ago in reply to TheDrew

Minor update.

No joy on the logs. The flip-flop didn't occur the last two nights so nothing.

The database schema errors I was seeing also appear to have cleaned up. I'm wondering if they're related to the cutover. Sort of a remnant that takes a while to cleanup.
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 15 years ago

You mean, you're so good that it started behaving when it saw that you weren't going to let it act up? [;)]

Good news!

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 TheDrew over 15 years ago in reply to BAlfson
You mean, you're so good that it started behaving when it saw that you weren't going to let it act up? [;)]

I wish.[8-)] I did however get 7hrs worth of useless top output. [:D]

Astaro has an open ticket and will be on my box over the next day or so to see if they can find anything.

It looks like something happened that killed the process on the slave. The process dieing triggered a replication from the master. Since then, not a peep out of the HA system.

2010:10:31-00:01:42 dclfw1-2 slon_control[3601]: Slonik error, process exited with value 255
2010:10:31-00:01:43 dclfw1-2 slon_control[3601]: Resetting reporting
2010:10:31-00:01:43 dclfw1-2 slon_control[3601]: Starting replication from Node 1 to 2
Cancel
Vote Up 0 Vote Down

Cancel
0 TheDrew over 15 years ago in reply to TheDrew

Bob,

The answer back from Astaro is that occasionally the slon process does have issues when the system is updated. Typically though the system resolves the issue on its own.

It still doesn't answer why the system flip-flopped a few times, but as the the system has behaved as expected during the failures (I had some nightly 2-3GB backup sets running over the VPN tunnels during the swaps), I'll write it off as a random glitch for now.
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 15 years ago

Thanks for posting "the rest of the story" for us, Andrew.
I'm wondering if they're related to the cutover. Sort of a remnant that takes a while to cleanup.

I just now internalized that your HA is new. I think you can expect this same phenomenon with any major Up2Date that changes the structure of one of the PostgreSQL data bases. One way to reduce this effect seems to be to cut down on the number of months of data retained in the Reporting Settings.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel