Hello Community,
this is my first Post here. We updated our Cluster this Weekend to 18.0.1 MR-1-Build396.
After the update both Devices restarted. One took Master Role and other Aux all fine... for about a minute.
Auxilliary Device dropped to Faulty State. Upon check of the local console following came up:
BUG: unable to handle kernel NULL pointer dereference at
[ 163.527682] IP: (null)
[ 163.537399] PGD 800000039b846067 P4D 800000039b846067 PUD 0
[ 163.554381] Oops: 0010 [#1] SMP PTI
[ 163.564855] Modules linked in: nf_conntrack_ipslb nfnetmap_queue(O) xt_master YN_DATA ip6t_ADVERTISEMENT ip6t_SOLICITATION xt_LBS ip6table_filter iptable_filt onntrack_tftp nf_nat_h323 nf_conntrack_h323 nf_nat_pptp
[ 163.777028] nf_conntrack_pptp cfg80211 usbhid hid_generic hid ohci_pci ohci_ ort xfrm4_mode_tunnel xfrm4_tunnel xfrm_user af_key xfrm_algo aesni_intel glue_h er ipt_rpfilter ebt_nflog ebt_pkttype xt_serviceset
[ 163.988262] xt_appset xt_hostset xt_pkttype xt_recent xt_state xt_status xt_ et ip_set_bitmap_fwrule ip_set_bitmap_ctrxss ip_set_bitmap_user sp2fp_api ip_set ip6_udp_tunnel ptp pps_core mdio i2c_i801 i2c_dev i2c_core
[ 164.201353] netmap(O) ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw es nfnetlink button evdev [last unloaded: nfnetmap_queue]
[ 164.315505] CPU: 5 PID: 21751 Comm: winbindd Tainted: G O 4.14.3
[ 164.337970] Hardware name: Sophos XG/XG, BIOS 5.11 06/01/2018
[ 164.355246] task: ffff8803b0257080 task.stack: ffffc90008304000
[ 164.373031] RIP: 0010: (null)
[ 164.384308] RSP: 0000:ffff88046dd43e18 EFLAGS: 00010202
[ 164.400016] RAX: ffffffffa083f700 RBX: ffff8803e5c0e780 RCX: ffff88044dde0400
[ 164.421468] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8803e5c0e780
[ 164.442919] RBP: ffff88044dde0410 R08: 0000000000000001 R09: 0000000000000001
[ 164.464343] R10: 0000000000000000 R11: ffffc90008307bf0 R12: ffff8804546b2000
[ 164.485767] R13: ffff8804546b2078 R14: ffff8804546b20a0 R15: 0000000000000008
[ 164.507190] FS: 0000000000000000(0000) GS:ffff88046dd40000(0063) knlGS:00000
[ 164.531487] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 164.548737] CR2: 0000000000000000 CR3: 000000039b924003 CR4: 00000000001606e0
[ 164.570164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 164.591585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 164.613049] Call Trace:
[ 164.620439] <IRQ>
[ 164.626523] ? ip_rcv+0x316/0x4c0
[ 164.636509] ? ip_local_deliver_finish+0x1d0/0x1d0
[ 164.650941] ? __netif_receive_skb_core+0x3ec/0xac0
[ 164.665632] ? enqueue_task_fair+0x320/0x440
[ 164.678475] ? process_backlog+0x86/0x120
[ 164.690537] ? process_backlog+0x86/0x120
[ 164.702601] ? net_rx_action+0xcc/0x270
[ 164.714148] ? __do_softirq+0xc5/0x1ec
[ 164.725428] ? do_softirq_own_stack+0x2a/0x40
[ 164.738533] </IRQ>
[ 164.744901] ? do_softirq.part.2+0x3c/0x40
[ 164.757227] ? netif_rx_ni+0x1d/0x30
[ 164.767992] ? dev_loopback_xmit+0xa3/0xc0
[ 164.780317] ? ip_mc_output+0x176/0x240
[ 164.791860] ? ip_finish_output2+0x3b0/0x3b0
[ 164.804704] ? ip_send_skb+0x10/0x40
[ 164.815469] ? udp_send_skb+0x94/0x240
[ 164.826750] ? udp_sendmsg+0x2f8/0x8c0
[ 164.838037] ? release_sock+0x3b/0x90
[ 164.849059] ? sock_sendmsg+0xe/0x20
[ 164.859822] ? SyS_sendto+0xad/0x150
[ 164.870587] ? ep_poll_wakeup_proc+0x20/0x20
[ 164.883432] ? compat_SyS_socketcall+0x12c/0x210
[ 164.897319] ? do_int80_syscall_32+0x58/0x110
[ 164.910421] ? entry_INT80_compat+0x48/0x50
[ 164.923002] Code: Bad RIP value.
[ 164.933012] RIP: (null) RSP: ffff88046dd43e18
[ 164.948719] CR2: 0000000000000000
[ 164.958727] ---[ end trace 0c3cc4f11b5d6136 ]---
[ 164.958728] BUG: unable to handle kernel NULL pointer dereference at
[ 164.958729] IP: (null)
[ 164.958729] PGD 80000003a7ba4067 P4D 80000003a7ba4067 PUD 0
[ 164.958731] Oops: 0010 [#2] SMP PTI
[ 164.958732] Modules linked in: nf_conntrack_ipslb nfnetmap_queue(O) xt_master YN_DATA ip6t_ADVERTISEMENT ip6t_SOLICITATION xt_LBS ip6table_filter iptable_filt onntrack_tftp nf_nat_h323 nf_conntrack_h323 nf_nat_pptp
[ 164.958744] nf_conntrack_pptp cfg80211 usbhid hid_generic hid ohci_pci ohci_ ort xfrm4_mode_tunnel xfrm4_tunnel xfrm_user af_key xfrm_algo aesni_intel glue_h er ipt_rpfilter ebt_nflog ebt_pkttype xt_serviceset
[ 164.958758] xt_appset xt_hostset xt_pkttype xt_recent xt_state xt_status xt_ et ip_set_bitmap_fwrule ip_set_bitmap_ctrxss ip_set_bitmap_user sp2fp_api ip_set ip6_udp_tunnel ptp pps_core mdio i2c_i801 i2c_dev i2c_core
[ 164.958771] netmap(O) ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw es nfnetlink button evdev [last unloaded: nfnetmap_queue]
[ 164.958780] CPU: 6 PID: 21752 Comm: winbindd Tainted: G D O 4.14.3
[ 164.958780] Hardware name: Sophos XG/XG, BIOS 5.11 06/01/2018
[ 164.958780] task: ffff8803b0250000 task.stack: ffffc9000830c000
[ 164.958781] RIP: 0010: (null)
[ 164.958781] RSP: 0000:ffff88046dd83e18 EFLAGS: 00010202
[ 164.958782] RAX: ffffffffa083f700 RBX: ffff88039b9a03c0 RCX: ffff8803ae590200
[ 164.958782] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88039b9a03c0
[ 164.958782] RBP: ffff8803ae590210 R08: 0000000000000001 R09: 0000000000000001
[ 164.958783] R10: 0000000000000000 R11: ffffc9000830fbf0 R12: ffff8804546b2000
[ 164.958783] R13: ffff8804546b2078 R14: ffff8804546b20a0 R15: 0000000000000008
[ 164.958784] FS: 0000000000000000(0000) GS
console> system ha show details
HA status : Enabled
Current Appliance Key : C430-------------
Peer Appliance Key : C430-----------
Current HA state : Standalone
Peer HA state : Fault
HA Config Mode : Active-Passive
Load Balancing : Not Applicable
Dedicated Port : Port4
Current Dedicated IP : 5.5.5.1
Peer Dedicated IP : 5.5.5.2
Monitoring Port :
Auxiliary Admin Port : bond1
Auxiliary Admin IP : 10.0.5.28
Auxiliary Admin IPv6 :
HA Cluster ID : 10
Keepalive request interval : 250
Keepalive attempts : 16
Hypervisor assigned MAC addresses : Disabled
HA preemption : Disabled
less /log/applog.log | grep ha:
Sep 29 13:07:27 ha_port_down_notification: message_id : log_data : Interface Port4 went down. Appliance HA state MASTSep 29 13:07:28 ha: handle_stat_change: 2:3 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
Sep 29 13:07:28 ha: handle_stat_change: g_ha_hsc=1 is set.
Sep 29 13:07:28 ha: g_ha_transmode=0 [ CONFIG=1 INIT=2 EVENT=0 ]
Sep 29 13:07:28 ha: start tracking the device
Sep 29 13:07:28 ha: fwm:disablearpha successfully done
Sep 29 13:07:28 ha: msync:applyha: no network changes reqd
Sep 29 13:07:28 ha: fwm:applyha successfully done
Sep 29 13:07:28 ha: fwm:enablearpha successfully done
Sep 29 13:07:29 ha: mail sent successfully
Sep 29 13:07:30 ha: syncing conntracks
Sep 29 13:07:30 ha: handle_stat_change: 2:3 done.
Sep 29 13:07:30 ha: handle_stat_change: g_ha_hsc=0 is set.
Sep 29 13:08:07 ha: handle_stat_change: 3:2 [ NA=0 AUX=1 STAND=2 PRIM=3 FAULT=4 READY=5 GOTO_PRIM=6 ]
Sep 29 13:08:07 ha: handle_stat_change: g_ha_hsc=1 is set.
Sep 29 13:08:07 ha: g_ha_transmode=0 [ CONFIG=1 INIT=2 EVENT=0 ]
Sep 29 13:08:07 ha: start tracking the device
Sep 29 13:08:07 ha: fwm:disablearpha successfully done
Sep 29 13:08:07 ha: ctsyncd commited
Sep 29 13:08:07 ha: ctsyncd external cache flushed
Sep 29 13:08:07 ha: msync:applyha: prim->stand, so no network changes reqd
Sep 29 13:08:08 ha: fwm:applyha successfully done
Sep 29 13:08:10 ha: msync:garpha: send_arp
<lots of arps here>
Sep 29 13:08:10 ha: fwm:enablearpha successfully done
Sep 29 13:08:11 ha: mail sent successfully
Sep 29 13:08:11 ha: syncing conntracks
Sep 29 13:08:11 ha: handle_stat_change: 3:2 done.
Sep 29 13:08:11 ha: handle_stat_change: g_ha_hsc=0 is set.
Sep 29 13:09:15 ha: appcached_ha_sync function is called...!!!!
Sep 29 13:13:20 ha: redis DB dump file sync is done !!
If someone managed to fixed that on his cluster any help would be appreciated.
Kind regards,
Sascha
This thread was automatically locked due to age.