We have a problem with SFOS 18.x
Primary Device stucks in bootloop after failover to auxilary device. It does not failback.
this is the second time this happens. the first time was with 18.0.0 GA-Build339, now same with 18.0.1 MR-1
Both devices are just freshly reimaged from ISO image and backup restore was done on primary only. HA Cluster was running stable until reload.
Devices are connected via Switch and isolated vlan over Port6. Interface was renamed to "Port6 HA". Interface was assigned to a Zone called "HA", which has SSH enabled.
Primary Device has: 192.168.0.10 /24
Aux has: 192.168.0.20 /24
Primary device does not bootloop if we disconnect HA peer link, it works in standalone mode, but immediately reboots if we reconnect auxiliary.
during the boot loop it looks there is a sync between primary and aux, which eventually forces the primary to reload.
Primary console output during bootloop:
Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, `e' to edit the commands
Booting `18_0_1_367'Loading configuration
Performing automated file system integrity checks. It will take some time before your system is available.
Examining Config partition.....
Examining Signature partition.....
Examining Report partition.....
System will reboot
Sent SIGKILL to all processes
[ 59.449190] reboot: Restarting system
GNU GRUB version 2.02Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, `e' to edit the commands
Booting `18_0_1_367'Loading configuration
Performing automated file system integrity checks. It will take some time before your system is available.
Examining Config partition.....
Examining Signature partition.....
Examining Report partition.....
System will reboot
The system is gipsec_starter[5297]: charon stopped after 400 msipsec_starter[5297]: ipsec starter stopped
Sent SIGKILL to all processes
[ 126.601181] reboot: Restarting system
Auxilary devices logfiles::
cat /log/ha_tunnel.log
Apr 29 12:40:40 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:40:46 ssh: connect to host 192.168.0.10 port 22: Connection timed out
XG230_WP02_SFOS 18.0.1 MR-1# Apr 29 12:41:09 Timeout, server 192.168.0.10 not responding.
Apr 29 12:41:14 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:41:19 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:41:24 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:41:29 ssh: connect to host 192.168.0.10 port 22: Connection timed out
XG230_WP02_SFOS 18.0.1 MR-1# Connection to 192.168.0.10 closed by remote host.
Connection to 192.168.0.10 closed.
Apr 29 12:42:02 ssh: connect to host 192.168.0.10 port 22: Connection refused
Apr 29 12:42:03 ssh: connect to host 192.168.0.10 port 22: Connection refused
Apr 29 12:42:08 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:42:13 ssh: connect to host 192.168.0.10 port 22: Connection timed out
cat /log/csc.log | grep 'ERROR Apr 29 12:'
ERROR Apr 29 12:54:45 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
PAckage ::::system::systemgraphERROR Apr 29 12:54:46 [listener:1177]: worker 21136 killed
ERROR Apr 29 12:54:46 [listener:1177]: csc_waitpid: pid(21136) exited with signal 9
ERROR Apr 29 12:54:46 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
PAckage ::::system::systemgraphERROR Apr 29 12:54:47 [listener:1177]: worker 21138 killed
ERROR Apr 29 12:54:47 [listener:1177]: csc_waitpid: pid(21138) exited with signal 9
ERROR Apr 29 12:54:47 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
ERROR Apr 29 12:54:47 [listener:1177]: worker 21126 killed
It's possible to connect via SSH over Peerlink for a 1-2 minutes. so far we were able to get some informations from the csc.log.
here are the imo most supecious parts:
XG230_WP02_SFOS 18.0.1 MR-1# tail -f /log/csc.log
ERROR Apr 29 14:22:27 [listener:1177]: worker 1192 killed
ERROR Apr 29 14:22:27 [listener:1177]: csc_waitpid: pid(1192) exited with signal 9
MESSAGE Apr 29 14:22:27 [postgres:1196]: Child exited with status 1
WARNING Apr 29 14:22:27 [postgres:1196]: action with nofail failed
MESSAGE Apr 29 14:22:27 [postgres:1196]: Child exited with status 1
WARNING Apr 29 14:22:27 [postgres:1196]: action with nofail failed
MESSAGE Apr 29 14:22:27 [system_startup:1504]: Child exited with status 1
WARNING Apr 29 14:22:27 [system_startup:1504]: action with nofail failed
MESSAGE Apr 29 14:22:27 [ha_tunnel:1242]: do_stop(): status = EXITING
MESSAGE Apr 29 14:22:27 [ha_tunnel:1242]: do_waitpid: pid 1522 exited with status 255
IFNAME_PREFIX: Port
IFNAME_SUFFIX: NUM
IFLIST: 8_PortA8
MESSAGE Apr 29 14:22:28 [system_startup:1504]: Child exited with status 1
WARNING Apr 29 14:22:28 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:29 [system_startup:1504]: nvram_get(utm9): failed with -12
WARNING Apr 29 14:22:29 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:29 [system_startup:1504]: nvram_eget(wz.restore): failed with -16
WARNING Apr 29 14:22:29 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:29 [system_startup:1504]: nvram_eget(wz.mandatory): failed with -16
WARNING Apr 29 14:22:29 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:30 [listener:1177]: worker 1191 killed
PAckage ::::cli::alertERROR Apr 29 14:22:30 [listener:1177]: csc_waitpid: pid(1191) exited with signal 9
ERROR Apr 29 14:22:30 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'Success'
ERROR Apr 29 14:22:30 [listener:1177]: worker 1615 killed
ERROR Apr 29 14:22:30 [listener:1177]: csc_waitpid: pid(1615) exited with signal 9
########## Package: firewall::atp
********** atp Read through ORM
********** objEntityDef: HASH(0xa2c6ba0)
ERROR Apr 29 14:22:30 [listener:1177]: worker 1680 killed
ERROR Apr 29 14:22:30 [listener:1177]: csc_waitpid: pid(1680) exited with signal 9
ERROR Apr 29 14:22:30 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'Success'
MESSAGE Apr 29 14:22:30 [ha_tunnel:1242]: do_stop(): status = EXITING
MESSAGE Apr 29 14:22:30 [ha_tunnel:1242]: do_waitpid: pid 1672 exited with status 255
ERROR Apr 29 14:22:31 [listener:1177]: worker 1681 killed
ERROR Apr 29 14:22:31 [listener:1177]: csc_waitpid: pid(1681) exited with signal 9
ERROR Apr 29 14:22:31 [system_startup:1504]: nvram_get(migrated_utm9): failed with -12
WARNING Apr 29 14:22:31 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:31 [listener:1177]: worker 1693 killed
ERROR Apr 29 14:22:31 [listener:1177]: csc_waitpid: pid(1693) exited with signal 9
ERROR Apr 29 14:22:31 [listener:1177]: worker 1730 killed
ERROR Apr 29 14:22:31 [listener:1177]: csc_waitpid: pid(1730) exited with signal 9
ERROR Apr 29 14:22:31 [listener:1177]: worker 1763 killed
ERROR Apr 29 14:22:31 [listener:1177]: csc_waitpid: pid(1763) exited with signal 9
MESSAGE Apr 29 14:22:56 [ospfd:1214]: do_stop(): status = UNTOUCHED
ERROR Apr 29 14:22:56 [listener:1177]: worker 2917 killed
ERROR Apr 29 14:22:56 [listener:1177]: csc_waitpid: pid(2917) exited with signal 9
ERROR Apr 29 14:22:56 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port3 do not exist
MESSAGE Apr 29 14:22:56 [dev2cpumap:2447]: Child exited with status 255
WARNING Apr 29 14:22:56 [dev2cpumap:2447]: action with nofail failed
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port4 do not exist
MESSAGE Apr 29 14:22:56 [dev2cpumap:2447]: Child exited with status 255
WARNING Apr 29 14:22:56 [dev2cpumap:2447]: action with nofail failed
ERROR Apr 29 14:22:56 [listener:1177]: worker 2907 killed
ERROR Apr 29 14:22:56 [listener:1177]: csc_waitpid: pid(2907) exited with signal 9
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port5 do not exist
MESSAGE Apr 29 14:22:56 [dev2cpumap:2447]: Child exited with status 255
WARNING Apr 29 14:22:56 [dev2cpumap:2447]: action with nofail failed
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port6 do not exist
MESSAGE Apr 29 14:22:56 [dev2cpumap:2447]: Child exited with status 255
WARNING Apr 29 14:22:56 [dev2cpumap:2447]: action with nofail failed
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port7 do not exist
-----
DEBUG Apr 29 14:23:25 [showfw:5136]: exec: filename '/bin/sh'
DEBUG Apr 29 14:23:25 [showfw:5136]: exec: argv[0] = '/bin/sh'
DEBUG Apr 29 14:23:25 [showfw:5136]: exec: argv[1] = '-c'
DEBUG Apr 29 14:23:25 [showfw:5136]: exec: argv[2] = 'showfw |grep -v FW0|awk -F '=' '{printf("%s,", $2);}''
MESSAGE Apr 29 14:23:25 [ha_tunnel:1242]: do_stop(): status = EXITING
MESSAGE Apr 29 14:23:25 [ha_tunnel:1242]: do_waitpid: pid 4935 exited with status 255
INFO Apr 29 14:23:25 [showfw:5136]: Child exited with status 0
INFO Apr 29 14:23:25 [showfw:5136]: create_act_out_perl_obj: varname=out
INFO Apr 29 14:23:25 [showfw:5136]: create_act_out_perl_obj: out.status=0
INFO Apr 29 14:23:25 [showfw:5136]: create_act_out_perl_obj: TEXT out.output=18_0_0_339,18_0_1_367,
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: CALL lcd_format
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: DLOPEN(do_nvram_get, FW1)
INFO Apr 29 14:23:25 [showfw:5136]: do_nvram_get: FW1 in dv var
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: CALL lcd_dversion
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: DLOPEN(do_nvram_get, FW2)
INFO Apr 29 14:23:25 [showfw:5136]: do_nvram_get: FW2 in dv var
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: CALL lcd_dversion
INFO Apr 29 14:23:25 [showfw:5136]: opcode 'showfw': time taken: 0.570195756 seconds
DEBUG Apr 29 14:23:25 [worker:5136]: worker_proc: going to release dbhandle for 5136(1,513601)
DEBUG Apr 29 14:23:25 [worker:5136]: Main TLV:{ 4, data:{ 1, 4, 41D670} data:{ 2, 4, 101400} data:{ 5, 4, 0000} , 27}
DEBUG Apr 29 14:23:25 [worker:5136]: response:
-----------
DEBUG Apr 29 14:23:25 [worker:5136]: request method = opcode
DEBUG Apr 29 14:23:25 [worker:5136]: request name = showfw
DEBUG Apr 29 14:23:25 [worker:5136]: request version = 1.1
DEBUG Apr 29 14:23:25 [worker:5136]: content length = 64
DEBUG Apr 29 14:23:25 [worker:5136]: content type = text
DEBUG Apr 29 14:23:25 [worker:5136]: response code = 200
DEBUG Apr 29 14:23:25 [worker:5136]: response string len = 2
DEBUG Apr 29 14:23:25 [worker:5136]: response string is = 'OK'
DEBUG Apr 29 14:23:25 [worker:5136]: response content = 'FW1=SFOS 18.0.0# GA-Build339 V#FW2=SFOS 18.0.1# MR-1 '
DEBUG Apr 29 14:23:25 [worker:5136]:
-----------
DEBUG Apr 29 14:23:25 [worker:5136]: # OPCODE Exited: 'showfw' with Status: '200'
DEBUG Apr 29 14:23:25 [worker:5136]: write_packet: 52+64+2
ERROR Apr 29 14:23:25 [listener:1177]: worker 5136 killed
ERROR Apr 29 14:23:25 [listener:1177]: csc_waitpid: pid(5136) exited with signal 9
ERROR Apr 29 14:23:25 [listener:1177]: worker 5150 killed
ERROR Apr 29 14:23:25 [listener:1177]: csc_waitpid: pid(5150) exited with signal 9
######### Package: system::webadminsettings
********** webadminsettings Read through ORM
********** objEntityDef: HASH(0xa30eac0)
ERROR Apr 29 14:23:25 [listener:1177]: worker 5139 killed
ERROR Apr 29 14:23:25 [listener:1177]: csc_waitpid: pid(5139) exited with signal 9
ERROR Apr 29 14:23:25 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
########## Package: system::localservice
********** localservice Read through ORM
********** objEntityDef: HASH(0xa2c7450)
IFLIST: 8_PortA8
ERROR Apr 29 14:23:56 [listener:1177]: worker 7821 killed
ERROR Apr 29 14:23:56 [listener:1177]: csc_waitpid: pid(7821) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7825 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7825) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7832 killed
ERROR Apr 29 14:23:57 [lic_appliance_update:5094]: nvram_get(ha_update): failed with -12
WARNING Apr 29 14:23:57 [lic_appliance_update:5094]: action with nofail failed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7832) exited with signal 9
ERROR Apr 29 14:23:57 [lic_appliance_update:5094]: nvram_get(app_update): failed with -12
WARNING Apr 29 14:23:57 [lic_appliance_update:5094]: action with nofail failed
ERROR Apr 29 14:23:57 [listener:1177]: worker 5094 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(5094) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7833 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7833) exited with signal 9
ERROR Apr 29 14:23:57 [nasm:1201]: do_real_start: before_start failed. aborting!
MESSAGE Apr 29 14:23:57 [nasm:1201]: do_stop(): status = UNTOUCHED
ERROR Apr 29 14:23:57 [nasm:1201]: do_start: service failed to start
ERROR Apr 29 14:23:57 [listener:1177]: worker 7844 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7844) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7860 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7860) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7876 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7876) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7887 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7887) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7903 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pidipsec_starter[5173]: charon stopped after 400 ms
ipsec_starter[5173]: ipsec starter stopped
Sent SIGKILL to all processes
--> reload
any ideas? any other logs to check?
This thread was automatically locked due to age.