2x SG430 - After starting Up2Date Process HA commincation broke down - Slave in unknown state

Question

Hallo, 
 Wir haben am Freitag Mittag die letzten beiden Updates durchlaufen lassen wollen. 
 Dabei ist das Cluster leider gecrashed. 
 Seit dem gibt es auch keinerlei Eintr&auml;ge mehr im Hochverf&uuml;gbarkeits LOG (HA-Log) 
 Die letzten Eintr&auml;ge sehen wie folgt aus: 
 
 2018:03:16-13:10:01 dialin-2 ha_daemon[4836]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 466 01.166" name="HA control: cmd = 'up2date 9.508010'"
2018:03:16-13:10:01 dialin-2 ha_daemon[4836]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 467 01.166" name="Initiating up2date on node 1 to version 9.508010"
2018:03:16-13:10:01 dialin-1 ha_daemon[4808]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 438 01.166" name="state change ACTIVE(0) -> UP2DATE(256)"
2018:03:16-13:10:01 dialin-1 ha_daemon[4808]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 439 01.166" name="Starting local up2date 9.506002 -> 9.508010"
2018:03:16-13:10:01 dialin-1 ha_daemon[4808]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 440 01.167" name="Executing (nowait) /etc/init.d/ha_mode disable"
2018:03:16-13:10:01 dialin-1 ha_daemon[4808]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 441 01.167" name="--- Node is disabled ---"
2018:03:16-13:10:01 dialin-1 ha_mode[9587]: calling disable
2018:03:16-13:10:01 dialin-1 ha_mode[9587]: disable: waiting for last ha_mode done
2018:03:16-13:10:01 dialin-1 ha_mode[9587]: Switching disable mode
2018:03:16-13:10:01 dialin-1 ha_mode[9587]: disable done (started at 13:10:01)
2018:03:16-13:10:01 dialin-1 ha_up2date[9586]: already running(9592) (exit 3)
2018:03:16-13:10:01 dialin-1 repctl[19687]: [i] execute(1768): waiting for server to shut down...
2018:03:16-13:10:01 dialin-1 repctl[19687]: [i] execute(1768): .
2018:03:16-13:10:02 dialin-2 ha_daemon[4836]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 468 02.010" name="Node 1 changed state: ACTIVE(0) -> UP2DATE(256)"
2018:03:16-13:10:02 dialin-2 ha_daemon[4836]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 469 02.011" name="Executing (nowait) /etc/init.d/ha_mode topology_changed"
2018:03:16-13:10:02 dialin-2 ha_mode[31156]: calling topology_changed
2018:03:16-13:10:02 dialin-2 ha_mode[31156]: topology_changed: waiting for last ha_mode done
2018:03:16-13:10:02 dialin-2 ha_mode[31156]: topology_changed done (started at 13:10:02)
2018:03:16-13:10:02 dialin-1 repctl[19687]: [i] execute(1768): done
2018:03:16-13:10:02 dialin-1 repctl[19687]: [i] execute(1768): server stopped
2018:03:16-13:10:02 dialin-1 repctl[19687]: [i] execute(1768): waiting for server to start....
2018:03:16-13:10:03 dialin-2 repctl[8098]: [i] terminate(2321): exit due to signal TERM
2018:03:16-13:10:03 dialin-1 repctl[19687]: [i] execute(1768): done
2018:03:16-13:10:03 dialin-1 repctl[19687]: [i] execute(1768): server started
2018:03:16-13:10:03 dialin-1 repctl[19687]: [i] terminate(2321): exit due to signal TERM
2018:03:16-14:30:10 dialin-2 ha_daemon[4836]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 470 10.609" name="Monitoring interfaces for link beat: eth10 eth11 eth6 eth0 eth13 eth16 eth5 eth4 eth2 eth14 eth3 eth15 eth7 eth12 eth1 eth17" Was k&ouml;nnen wir hier tun?

BAlfson · Accepted Answer

Hallo Patrick, 
 Try a hard reboot of the Slave (node 1). If that doesn't work, try disabling HA and then re-enabling it. 
 If it still doesn't work, I would be tempted to re-image it from ISO. If you do that, remember that it has to be on the same version as the Master (node 2). It's been about five years since this happened to one of my clients - I don't remember if we also had to disable/enable HA also. 
 Any luck with any of those ideas? 
 Cheers - Bob