I have configured master/slave HA with latest 7.401 software. It's been work fine mostly for a few weeks, but from time to time, I am getting those emails saying node 2 (slave) becomes master, then another email saying it becomes slave, then another email saying HA is fully functional again. It's been crazy in the last two days, I got 50+ such emails every day.
If I connect to the web interface and look at the Management -> HA status page, I can see:
1 MASTER node1 ACTIVE 7.401 Tue Mar 24 12:13:51 2009
2 SLAVE Node2 ACTIVE 7.401 Tue Apr 22 xx 2009 (usually the last time I got the email).
I thought node 2 must have rebooted or something, but if ssh to node 1 and then ssh to node 2 via the linkbeat. I can check the system uptimes of both nodes. They are both many weeks long.
Looking at the heartbeat log, I see lots of errors about node 2:
2009:04:21-21:49:32 fw1n1-2 slon[13826]: [2-1] ERROR cannot get sl_local_node_id - ERROR: schema "_asg_cluster" does not exist
2009:04:21-21:49:32 fw1n1-2 slon[13826]: [3-1] FATAL main: Node is not initialized properly - sleep 10s
This thread was automatically locked due to age.