This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

XG HA: Primary Device Boot Loop after failover to auxilary

We have a problem with SFOS 18.x

Primary Device stucks in bootloop after failover to auxilary device. It does not failback.

this is the second time this happens. the first time was with 18.0.0 GA-Build339, now same with 18.0.1 MR-1

Both devices are just freshly reimaged from ISO image and backup restore was done on primary only. HA Cluster was running stable until reload.

Devices are connected via Switch and isolated vlan over Port6. Interface was renamed to "Port6 HA". Interface was assigned to a Zone called "HA", which has SSH enabled.

Primary Device has: 192.168.0.10 /24

Aux has: 192.168.0.20 /24

Primary device does not bootloop if we disconnect HA peer link, it works in standalone mode, but immediately reboots if we reconnect auxiliary.

during the boot loop it looks there is a sync between primary and aux, which eventually forces the primary to reload.

Primary console output during bootloop:

Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, `e' to edit the commands
Booting `18_0_1_367'

Loading configuration
Performing automated file system integrity checks. It will take some time before your system is available.
Examining Config partition.....
Examining Signature partition.....
Examining Report partition.....
System will reboot
Sent SIGKILL to all processes
[ 59.449190] reboot: Restarting system
GNU GRUB version 2.02

Use the ^ and v keys to select which entry is highlighted.
Press enter to boot the selected OS, `e' to edit the commands
Booting `18_0_1_367'

Loading configuration
Performing automated file system integrity checks. It will take some time before your system is available.
Examining Config partition.....
Examining Signature partition.....
Examining Report partition.....
System will reboot
The system is gipsec_starter[5297]: charon stopped after 400 ms

ipsec_starter[5297]: ipsec starter stopped

Sent SIGKILL to all processes
[ 126.601181] reboot: Restarting system

Auxilary devices logfiles::

cat /log/ha_tunnel.log

Apr 29 12:40:40 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:40:46 ssh: connect to host 192.168.0.10 port 22: Connection timed out
XG230_WP02_SFOS 18.0.1 MR-1# Apr 29 12:41:09 Timeout, server 192.168.0.10 not responding.
Apr 29 12:41:14 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:41:19 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:41:24 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:41:29 ssh: connect to host 192.168.0.10 port 22: Connection timed out
XG230_WP02_SFOS 18.0.1 MR-1# Connection to 192.168.0.10 closed by remote host.
Connection to 192.168.0.10 closed.
Apr 29 12:42:02 ssh: connect to host 192.168.0.10 port 22: Connection refused
Apr 29 12:42:03 ssh: connect to host 192.168.0.10 port 22: Connection refused
Apr 29 12:42:08 ssh: connect to host 192.168.0.10 port 22: Connection timed out
Apr 29 12:42:13 ssh: connect to host 192.168.0.10 port 22: Connection timed out

cat /log/csc.log | grep 'ERROR Apr 29 12:'

ERROR Apr 29 12:54:45 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
PAckage ::::system::systemgraphERROR Apr 29 12:54:46 [listener:1177]: worker 21136 killed
ERROR Apr 29 12:54:46 [listener:1177]: csc_waitpid: pid(21136) exited with signal 9
ERROR Apr 29 12:54:46 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
PAckage ::::system::systemgraphERROR Apr 29 12:54:47 [listener:1177]: worker 21138 killed
ERROR Apr 29 12:54:47 [listener:1177]: csc_waitpid: pid(21138) exited with signal 9
ERROR Apr 29 12:54:47 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
ERROR Apr 29 12:54:47 [listener:1177]: worker 21126 killed

It's possible to connect via SSH over Peerlink for a 1-2 minutes. so far we were able to get some informations from the csc.log.

here are the imo most supecious parts:

XG230_WP02_SFOS 18.0.1 MR-1# tail -f /log/csc.log
ERROR Apr 29 14:22:27 [listener:1177]: worker 1192 killed
ERROR Apr 29 14:22:27 [listener:1177]: csc_waitpid: pid(1192) exited with signal 9
MESSAGE Apr 29 14:22:27 [postgres:1196]: Child exited with status 1
WARNING Apr 29 14:22:27 [postgres:1196]: action with nofail failed
MESSAGE Apr 29 14:22:27 [postgres:1196]: Child exited with status 1
WARNING Apr 29 14:22:27 [postgres:1196]: action with nofail failed
MESSAGE Apr 29 14:22:27 [system_startup:1504]: Child exited with status 1
WARNING Apr 29 14:22:27 [system_startup:1504]: action with nofail failed
MESSAGE Apr 29 14:22:27 [ha_tunnel:1242]: do_stop(): status = EXITING
MESSAGE Apr 29 14:22:27 [ha_tunnel:1242]: do_waitpid: pid 1522 exited with status 255
IFNAME_PREFIX: Port
IFNAME_SUFFIX: NUM
IFLIST: 8_PortA8
MESSAGE Apr 29 14:22:28 [system_startup:1504]: Child exited with status 1
WARNING Apr 29 14:22:28 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:29 [system_startup:1504]: nvram_get(utm9): failed with -12
WARNING Apr 29 14:22:29 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:29 [system_startup:1504]: nvram_eget(wz.restore): failed with -16
WARNING Apr 29 14:22:29 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:29 [system_startup:1504]: nvram_eget(wz.mandatory): failed with -16
WARNING Apr 29 14:22:29 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:30 [listener:1177]: worker 1191 killed

PAckage ::::cli::alertERROR Apr 29 14:22:30 [listener:1177]: csc_waitpid: pid(1191) exited with signal 9
ERROR Apr 29 14:22:30 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'Success'
ERROR Apr 29 14:22:30 [listener:1177]: worker 1615 killed
ERROR Apr 29 14:22:30 [listener:1177]: csc_waitpid: pid(1615) exited with signal 9
########## Package: firewall::atp
********** atp Read through ORM
********** objEntityDef: HASH(0xa2c6ba0)
ERROR Apr 29 14:22:30 [listener:1177]: worker 1680 killed
ERROR Apr 29 14:22:30 [listener:1177]: csc_waitpid: pid(1680) exited with signal 9
ERROR Apr 29 14:22:30 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'Success'
MESSAGE Apr 29 14:22:30 [ha_tunnel:1242]: do_stop(): status = EXITING
MESSAGE Apr 29 14:22:30 [ha_tunnel:1242]: do_waitpid: pid 1672 exited with status 255
ERROR Apr 29 14:22:31 [listener:1177]: worker 1681 killed
ERROR Apr 29 14:22:31 [listener:1177]: csc_waitpid: pid(1681) exited with signal 9
ERROR Apr 29 14:22:31 [system_startup:1504]: nvram_get(migrated_utm9): failed with -12
WARNING Apr 29 14:22:31 [system_startup:1504]: action with nofail failed
ERROR Apr 29 14:22:31 [listener:1177]: worker 1693 killed
ERROR Apr 29 14:22:31 [listener:1177]: csc_waitpid: pid(1693) exited with signal 9
ERROR Apr 29 14:22:31 [listener:1177]: worker 1730 killed
ERROR Apr 29 14:22:31 [listener:1177]: csc_waitpid: pid(1730) exited with signal 9
ERROR Apr 29 14:22:31 [listener:1177]: worker 1763 killed
ERROR Apr 29 14:22:31 [listener:1177]: csc_waitpid: pid(1763) exited with signal 9

MESSAGE Apr 29 14:22:56 [ospfd:1214]: do_stop(): status = UNTOUCHED
ERROR Apr 29 14:22:56 [listener:1177]: worker 2917 killed
ERROR Apr 29 14:22:56 [listener:1177]: csc_waitpid: pid(2917) exited with signal 9
ERROR Apr 29 14:22:56 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port3 do not exist
MESSAGE Apr 29 14:22:56 [dev2cpumap:2447]: Child exited with status 255
WARNING Apr 29 14:22:56 [dev2cpumap:2447]: action with nofail failed
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port4 do not exist
MESSAGE Apr 29 14:22:56 [dev2cpumap:2447]: Child exited with status 255
WARNING Apr 29 14:22:56 [dev2cpumap:2447]: action with nofail failed
ERROR Apr 29 14:22:56 [listener:1177]: worker 2907 killed
ERROR Apr 29 14:22:56 [listener:1177]: csc_waitpid: pid(2907) exited with signal 9
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port5 do not exist
MESSAGE Apr 29 14:22:56 [dev2cpumap:2447]: Child exited with status 255
WARNING Apr 29 14:22:56 [dev2cpumap:2447]: action with nofail failed
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port6 do not exist
MESSAGE Apr 29 14:22:56 [dev2cpumap:2447]: Child exited with status 255
WARNING Apr 29 14:22:56 [dev2cpumap:2447]: action with nofail failed
ERROR Apr 29 14:22:56 [timer:1274]: status_sched_node: timer pa_Port7 do not exist

-----

DEBUG Apr 29 14:23:25 [showfw:5136]: exec: filename '/bin/sh'
DEBUG Apr 29 14:23:25 [showfw:5136]: exec: argv[0] = '/bin/sh'
DEBUG Apr 29 14:23:25 [showfw:5136]: exec: argv[1] = '-c'
DEBUG Apr 29 14:23:25 [showfw:5136]: exec: argv[2] = 'showfw |grep -v FW0|awk -F '=' '{printf("%s,", $2);}''
MESSAGE Apr 29 14:23:25 [ha_tunnel:1242]: do_stop(): status = EXITING
MESSAGE Apr 29 14:23:25 [ha_tunnel:1242]: do_waitpid: pid 4935 exited with status 255
INFO Apr 29 14:23:25 [showfw:5136]: Child exited with status 0
INFO Apr 29 14:23:25 [showfw:5136]: create_act_out_perl_obj: varname=out
INFO Apr 29 14:23:25 [showfw:5136]: create_act_out_perl_obj: out.status=0
INFO Apr 29 14:23:25 [showfw:5136]: create_act_out_perl_obj: TEXT out.output=18_0_0_339,18_0_1_367,
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: CALL lcd_format
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: DLOPEN(do_nvram_get, FW1)
INFO Apr 29 14:23:25 [showfw:5136]: do_nvram_get: FW1 in dv var
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: CALL lcd_dversion
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: DLOPEN(do_nvram_get, FW2)
INFO Apr 29 14:23:25 [showfw:5136]: do_nvram_get: FW2 in dv var
INFO Apr 29 14:23:25 [showfw:5136]: ACTION: CALL lcd_dversion
INFO Apr 29 14:23:25 [showfw:5136]: opcode 'showfw': time taken: 0.570195756 seconds
DEBUG Apr 29 14:23:25 [worker:5136]: worker_proc: going to release dbhandle for 5136(1,513601)
DEBUG Apr 29 14:23:25 [worker:5136]: Main TLV:{ 4, data:{ 1, 4, 41D670} data:{ 2, 4, 101400} data:{ 5, 4, 0000} , 27}
DEBUG Apr 29 14:23:25 [worker:5136]: response:
-----------
DEBUG Apr 29 14:23:25 [worker:5136]: request method = opcode
DEBUG Apr 29 14:23:25 [worker:5136]: request name = showfw
DEBUG Apr 29 14:23:25 [worker:5136]: request version = 1.1
DEBUG Apr 29 14:23:25 [worker:5136]: content length = 64
DEBUG Apr 29 14:23:25 [worker:5136]: content type = text
DEBUG Apr 29 14:23:25 [worker:5136]: response code = 200
DEBUG Apr 29 14:23:25 [worker:5136]: response string len = 2
DEBUG Apr 29 14:23:25 [worker:5136]: response string is = 'OK'
DEBUG Apr 29 14:23:25 [worker:5136]: response content = 'FW1=SFOS 18.0.0# GA-Build339 V#FW2=SFOS 18.0.1# MR-1 '
DEBUG Apr 29 14:23:25 [worker:5136]:
-----------
DEBUG Apr 29 14:23:25 [worker:5136]: # OPCODE Exited: 'showfw' with Status: '200'
DEBUG Apr 29 14:23:25 [worker:5136]: write_packet: 52+64+2
ERROR Apr 29 14:23:25 [listener:1177]: worker 5136 killed
ERROR Apr 29 14:23:25 [listener:1177]: csc_waitpid: pid(5136) exited with signal 9
ERROR Apr 29 14:23:25 [listener:1177]: worker 5150 killed
ERROR Apr 29 14:23:25 [listener:1177]: csc_waitpid: pid(5150) exited with signal 9
######### Package: system::webadminsettings
********** webadminsettings Read through ORM
********** objEntityDef: HASH(0xa30eac0)
ERROR Apr 29 14:23:25 [listener:1177]: worker 5139 killed
ERROR Apr 29 14:23:25 [listener:1177]: csc_waitpid: pid(5139) exited with signal 9
ERROR Apr 29 14:23:25 [listener:1177]: ln_recvfrom: fd '109.TCP.INET.auxilary': peer connection closed'File exists'
########## Package: system::localservice
********** localservice Read through ORM
********** objEntityDef: HASH(0xa2c7450)

IFLIST: 8_PortA8
ERROR Apr 29 14:23:56 [listener:1177]: worker 7821 killed
ERROR Apr 29 14:23:56 [listener:1177]: csc_waitpid: pid(7821) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7825 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7825) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7832 killed
ERROR Apr 29 14:23:57 [lic_appliance_update:5094]: nvram_get(ha_update): failed with -12
WARNING Apr 29 14:23:57 [lic_appliance_update:5094]: action with nofail failed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7832) exited with signal 9
ERROR Apr 29 14:23:57 [lic_appliance_update:5094]: nvram_get(app_update): failed with -12
WARNING Apr 29 14:23:57 [lic_appliance_update:5094]: action with nofail failed
ERROR Apr 29 14:23:57 [listener:1177]: worker 5094 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(5094) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7833 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7833) exited with signal 9
ERROR Apr 29 14:23:57 [nasm:1201]: do_real_start: before_start failed. aborting!
MESSAGE Apr 29 14:23:57 [nasm:1201]: do_stop(): status = UNTOUCHED
ERROR Apr 29 14:23:57 [nasm:1201]: do_start: service failed to start
ERROR Apr 29 14:23:57 [listener:1177]: worker 7844 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7844) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7860 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7860) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7876 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7876) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7887 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pid(7887) exited with signal 9
ERROR Apr 29 14:23:57 [listener:1177]: worker 7903 killed
ERROR Apr 29 14:23:57 [listener:1177]: csc_waitpid: pidipsec_starter[5173]: charon stopped after 400 ms

ipsec_starter[5173]: ipsec starter stopped

Sent SIGKILL to all processes

--> reload

any ideas? any other logs to check?

This thread was automatically locked due to age.

Parents

0 FormerMember over 5 years ago

Hi Samuel Heinrich

Apologies for the inconvenience caused. Thank you for providing logs, we are analyzing the logs.

If you have an open case with technical support, please PM me the case number.

Thanks,
Cancel
Vote Up 0 Vote Down

Cancel
0 Samuel Heinrich over 5 years ago in reply to FormerMember

Hi Patel

Thank you for your fast response.

We were able to solve the issue, it was related to a network misconfiguration.

Short answer:

Spanning-tree was enabled on HA (Switch-) Port, which caused a 15-20s delay until the Switchport went from "blocking, listening, learning" to "forward" state.

In this time frame, the primary XG device booted as standalone, which eventually let to dual active after the link came up.

Since there was an active auxiliary with higher uptime, the primary reloaded.

Long Answer:

We have two Sophos XGs which are located in different datecenters, thus the HA Port is not directly connected via single cable, as you would normally do it if they are at the same location.

We use a dedicated vlan to transport HA traffic between the devices.

The switchports on our Cat9400 Switches were configured as access ports with spanning-tree enabled:

interface GigabitEthernet1/1/0/46
description *****************FW-HA
switchport access vlan 4000
switchport mode access
spanning-tree guard root
switchport port-security
switchport port-security maximum 2
switchport port-security violation protect
storm-control broadcast level pps 100 1000
storm-control action trap
load-interval 30

During the boot sequence, the Switch runs spanning-tree to detect if there is a loop in the network.

show spanning-tree interface gigabitEthernet 1/1/0/46

VLAN4000 BLK 4 128.814 P2p Edge

VLAN4000 LST 4 128.814 P2p Edge

VLAN4000 LRN 4 128.814 P2p Edge

If there is no loop detected, the switch will eventually put the port in forwarding state:

VLAN4000 FWD 4 128.814 P2p Edge

During this process, the switch does not forward frames received on the port.  this process takes around 15-20s.

Unfortunately, it looks like the primary XG Device boots "too fast" and either detects that there is no Link on Port6 or it has a link but is unable to connect to auxiliary, it puts itself into Standalone/Master mode.
After spanning-tree forwards traffic on the link, the primary detects that there is already a master with higher uptime, which triggers a reload. because of the reload and the port flap, the process starts over again.

we were able to solve this by deactivating spanning-tree by configuring "portfast". The command might be different with other switch vendors, but essentially "turn off spanning-tree"

interface GigabitEthernet1/1/0/46
switchport access vlan 4000
switchport mode access
spanning-tree portfast
Cancel
Vote Up 0 Vote Down

Cancel

Reply

0 Samuel Heinrich over 5 years ago in reply to FormerMember

Hi Patel

Thank you for your fast response.

We were able to solve the issue, it was related to a network misconfiguration.

Short answer:

Spanning-tree was enabled on HA (Switch-) Port, which caused a 15-20s delay until the Switchport went from "blocking, listening, learning" to "forward" state.

In this time frame, the primary XG device booted as standalone, which eventually let to dual active after the link came up.

Since there was an active auxiliary with higher uptime, the primary reloaded.

Long Answer:

We have two Sophos XGs which are located in different datecenters, thus the HA Port is not directly connected via single cable, as you would normally do it if they are at the same location.

We use a dedicated vlan to transport HA traffic between the devices.

The switchports on our Cat9400 Switches were configured as access ports with spanning-tree enabled:

interface GigabitEthernet1/1/0/46
description *****************FW-HA
switchport access vlan 4000
switchport mode access
spanning-tree guard root
switchport port-security
switchport port-security maximum 2
switchport port-security violation protect
storm-control broadcast level pps 100 1000
storm-control action trap
load-interval 30

During the boot sequence, the Switch runs spanning-tree to detect if there is a loop in the network.

show spanning-tree interface gigabitEthernet 1/1/0/46

VLAN4000 BLK 4 128.814 P2p Edge

VLAN4000 LST 4 128.814 P2p Edge

VLAN4000 LRN 4 128.814 P2p Edge

If there is no loop detected, the switch will eventually put the port in forwarding state:

VLAN4000 FWD 4 128.814 P2p Edge

During this process, the switch does not forward frames received on the port.  this process takes around 15-20s.

Unfortunately, it looks like the primary XG Device boots "too fast" and either detects that there is no Link on Port6 or it has a link but is unable to connect to auxiliary, it puts itself into Standalone/Master mode.
After spanning-tree forwards traffic on the link, the primary detects that there is already a master with higher uptime, which triggers a reload. because of the reload and the port flap, the process starts over again.

we were able to solve this by deactivating spanning-tree by configuring "portfast". The command might be different with other switch vendors, but essentially "turn off spanning-tree"

interface GigabitEthernet1/1/0/46
switchport access vlan 4000
switchport mode access
spanning-tree portfast
Cancel
Vote Up 0 Vote Down

Cancel

Children

No Data