HA Cluster (2x SG430) ausgefallen

Question

Moin 
 Wir haben am Samstag zwei neue SG430 mit Fullgurad und Sandstorm aufgebaut und live geschaltet, um unsere alten BareMetal Installationen abzul&ouml;sen. 
 Nun scheinen wir jedoch ein recht erhebliches Problem zu haben. 
 Denn etwa eine halbe Stunde nachdem wir die Maschinen live geschaltet hatten ist offensichtlich die E-Mail Protection ausgefallen. Zumindest stammt der letzte Log Eintrag von 10:32 gestern fr&uuml;h. 
 Ich wurde heute darauf aufmerksam gemacht und habe in der UTM in der E-Mail Protection nachschauen wollen. Jedoch wurde mir dort &uuml;berhaupt nichts angezeigt. alle Graphen weg. Keine Statistiken - nichts. 
 Also dachte ich, ich boote beide Systeme einmal durch. Habe erst den HA Master neu gestartet. Nach einigen Sekunden kam ich dann auch wieder auf die Weboberfl&auml;che. Dort stellte ich fest, dass sehr viele E-Mails in der Warteschlange sind. Aber nicht, weil sie nicht weiterkommen, sondern weil der Mail-Server die Daten-Dateien f&uuml;r diese E-Mails niche mehr finden kann. Daraufhin habe ich in die HA-Cluster Konfig geschaut und festgestellt, dass die neu gestartete Node nach &uuml;ber 6 Stunden immer noch auf Sync steht. 
 Das HA-Live Log zeigte mir dann dieses debakel: 
 
 2017:10:22-20:52:04 dialin-2 ha_daemon[4798]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 76 04.614" name="HA control: cmd = 'sync start 1 database'" 
 2017:10:22-20:52:04 dialin-2 ha_daemon[4798]: id="38A0" severity="info" sys="System" sub="ha" seq="S: 77 04.614" name="Activating sync process for database on node 1" 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [i] execute(1768): pg_ctl: PID file "/var/storage/pgsql92/data/postmaster.pid" does not exist 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [i] execute(1768): Is server running? 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [c] sql_execute(1834): SQL execute: fetch failed 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [c] start_backup_mode(741): cannot start backup mode 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [c] standby_clone(885): standby_clone failed: cannot start backup mode 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [e] prepare_secondary(346): prepare_secondary: clone failed 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [c] sql_execute(1834): SQL execute: fetch failed 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [c] start_backup_mode(741): cannot start backup mode 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [c] standby_clone(885): standby_clone failed: cannot start backup mode 
 2017:10:22-20:52:04 dialin-2 repctl[4846]: [e] prepare_secondary(346): prepare_secondary: clone failed 
 2017:10:22-20:52:05 dialin-2 repctl[4846]: [c] sql_execute(1834): SQL execute: fetch failed 
 2017:10:22-20:52:05 dialin-2 repctl[4846]: [c] start_backup_mode(741): cannot start backup mode 
 2017:10:22-20:52:05 dialin-2 repctl[4846]: [c] standby_clone(885): standby_clone failed: cannot start backup mode 
 2017:10:22-20:52:05 dialin-2 repctl[4846]: [e] prepare_secondary(346): prepare_secondary: clone failed 
 2017:10:22-20:52:05 dialin-2 repctl[4846]: [c] prepare_secondary(360): failed to get database up, waiting for retry 
 2017:10:22-20:52:05 dialin-2 repctl[4846]: [c] setup_replication(274): setup_replication was not properly executed 
 2017:10:22-20:52:05 dialin-2 repctl[4846]: [i] setup_replication(278): checkinterval 300 
 2017:10:22-20:52:08 dialin-1 ha_daemon[4811]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 729 08.359" name="Set syncing.files for node 2" 
 2017:10:22-20:52:09 dialin-1 ha_daemon[4811]: id="38A0" severity="info" sys="System" sub="ha" seq="M: 730 09.528" name="Clear syncing.files for node 2" 
 2017:10:22-20:52:10 dialin-1 repctl[9946]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused 
 2017:10:22-20:52:10 dialin-1 repctl[9946]: [e] db_connect(2206): Is the server running on host "198.19.250.2" and accepting 
 2017:10:22-20:52:10 dialin-1 repctl[9946]: [e] db_connect(2206): TCP/IP connections on port 5432? 
 2017:10:22-20:52:10 dialin-1 repctl[9946]: [e] master_connection(2045): could not connect to server: Connection refused 
 2017:10:22-20:52:10 dialin-1 repctl[9946]: [e] master_connection(2045): Is the server running on host "198.19.250.2" and accepting 
 2017:10:22-20:52:10 dialin-1 repctl[9946]: [e] master_connection(2045): TCP/IP connections on port 5432? 
 2017:10:22-20:52:13 dialin-1 repctl[9946]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused 
 2017:10:22-20:52:13 dialin-1 repctl[9946]: [e] db_connect(2206): Is the server running on host "198.19.250.2" and accepting 
 2017:10:22-20:52:13 dialin-1 repctl[9946]: [e] db_connect(2206): TCP/IP connections on port 5432? 
 2017:10:22-20:52:13 dialin-1 repctl[9946]: [e] master_connection(2045): could not connect to server: Connection refused 
 2017:10:22-20:52:13 dialin-1 repctl[9946]: [e] master_connection(2045): Is the server running on host "198.19.250.2" and accepting 
 2017:10:22-20:52:13 dialin-1 repctl[9946]: [e] master_connection(2045): TCP/IP connections on port 5432? 
 2017:10:22-20:52:16 dialin-1 repctl[9946]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused 
 2017:10:22-20:52:16 dialin-1 repctl[9946]: [e] db_connect(2206): Is the server running on host "198.19.250.2" and accepting 
 2017:10:22-20:52:16 dialin-1 repctl[9946]: [e] db_connect(2206): TCP/IP connections on port 5432? 
 2017:10:22-20:52:16 dialin-1 repctl[9946]: [e] master_connection(2045): could not connect to server: Connection refused 
 2017:10:22-20:52:16 dialin-1 repctl[9946]: [e] master_connection(2045): Is the server running on host "198.19.250.2" and accepting 
 2017:10:22-20:52:16 dialin-1 repctl[9946]: [e] master_connection(2045): TCP/IP connections on port 5432? 
 2017:10:22-20:52:20 dialin-1 repctl[9946]: [e] db_connect(2206): error while connecting to database(DBI:Pg:dbname=repmgr;host=198.19.250.2): could not connect to server: Connection refused 
 2017:10:22-20:52:20 dialin-1 repctl[9946]: [e] db_connect(2206): Is the server running on host "198.19.250.2" and accepting 
 2017:10:22-20:52:20 dialin-1 repctl[9946]: [e] db_connect(2206): TCP/IP connections on port 5432? 
 2017:10:22-20:52:20 dialin-1 repctl[9946]: [e] master_connection(2045): could not connect to server: Connection refused 
 2017:10:22-20:52:20 dialin-1 repctl[9946]: [e] master_connection(2045): Is the server running on host "198.19.250.2" and accepting 
 2017:10:22-20:52:20 dialin-1 repctl[9946]: [e] master_connection(2045): TCP/IP connections on port 5432? 
 
 Offensichtlich gibt es auf der neu gestarteten Node (Node2) Probleme mit dem SQL Server. Ein "netstat -n | grep 5432" gab nur Verbindungen von der Node2 auf Node1 (Master) 
 
 Wir brauchen echt dringend hilfe. 
 Denn ohne E-Mail versand gibt es richtig Probleme. 
 
 [EDIT] 
 Ein "PS AUX | grep postgres" auf der Node2 gab folgendes Ergebnis: 
 ps aux | grep postgres postgres 4846 0.0 0.0 14516 9700 ? Ss 20:27 0:00 /usr/local/bin/repctl 100 17234 0.0 0.0 5672 748 pts/0 S+ 21:06 0:00 grep postgres 
 
 Das ist quasi nichts... kann ich die DB manuell neu starten? oder sollte das lieber der Support von Sophos &uuml;bernehmen? 
 Kann ich diesen den mit dem Standard-Support &uuml;berhaupt direkt kontaktieren, oder muss ich da erst &uuml;ber die Firma gehen, die uns die SG430er verkauft hat? 
 [/EDIT] 
 
 [EDIT2] 
 [4-1] LOG: invalid primary checkpoint record 2017:10:22-13:14:13 dialin-2 postgres[19268]: [5-1] LOG: could not open file "pg_xlog/0000000100000001000000FE" (log file 1, segment 254): No such file or directory 2017:10:22-13:14:13 dialin-2 postgres[19268]: [6-1] LOG: invalid secondary checkpoint record 2017:10:22-13:14:13 dialin-2 postgres[19268]: [7-1] PANIC: could not locate a valid checkpoint record 2017:10:22-13:14:13 dialin-1 postgres[20097]: [3-1] FATAL: remaining connection slots are reserved for non-replication superuser connections 2017:10:22-13:14:14 dialin-2 postgres[19275]: [2-1] FATAL: the database system is starting up 2017:10:22-13:14:14 dialin-1 postgres[20100]: [3-1] FATAL: remaining connection slots are reserved for non-replication superuser connections 2017:10:22-13:14:15 dialin-2 postgres[19277]: [2-1] FATAL: the database system is starting up 2017:10:22-13:14:15 dialin-2 postgres[19267]: [2-1] LOG: startup process (PID 19268) was terminated by signal 6: Aborted 2017:10:22-13:14:15 dialin-2 postgres[19267]: [3-1] LOG: aborting startup due to startup process failure 2017:10:22-13:14:21 dialin-1 postgres[20727]: [3-1] ERROR: a backup is already in progress 2017:10:22-13:14:21 dialin-1 postgres[20727]: [3-2] HINT: Run pg_stop_backup() and try again. 2017:10:22-13:14:21 dialin-1 postgres[20727]: [3-3] STATEMENT: select pg_xlogfile_name(pg_start_backup($1, $2)) 2017:10:22-13:14:21 dialin-1 postgres[20730]: [3-1] ERROR: a backup is already in progress 2017:10:22-13:14:21 dialin-1 postgres[20730]: [3-2] HINT: Run pg_stop_backup() and try again. 2017:10:22-13:14:21 dialin-1 postgres[20730]: [3-3] STATEMENT: select pg_xlogfile_name(pg_start_backup($1, $2)) 2017:10:22-13:14:21 dialin-1 postgres[20733]: [3-1] ERROR: a backup is already in progress 2017:10:22-13:14:21 dialin-1 postgres[20733]: [3-2] HINT: Run pg_stop_backup() and try again. 2017:10:22-13:14:21 dialin-1 postgres[20733]: [3-3] STATEMENT: select pg_xlogfile_name(pg_start_backup($1, $2)) 
 Das habe ich aus dem system.log 
 
 Beide Nodes haben offenbar FATAL Meldungen bez&uuml;glich des Postgres Servers 
 
 FATAL: the database system is shutting down 
 
 Diese "Shutting down" Eintrag wird f&uuml;r Node2 mehrere hundert male angezeigt... 
 Bis ich diese Node neu gestartet hatte.. 
 [/EDIT2]

patrick.hackert@preventgr · Accepted Answer

Der Support hat uns heute dann weitergeholfen. 
 Die Datenbank war offenbar aus einem nicht nachvollziehbaren Grund w&auml;hrend eines FailOvers besch&auml;digt worden. 
 Hier half nur ein Rebuild. 
 Da die UTM ja erst ein paar Stunden lief bevor die Datenbank kaputt ging. 
 
 Jetzt l&auml;uft wieder alles. Auch ohne Neuinstallation. 
 
 Noch mal ein gro&szlig;es DANKE an den Sophos Support!