Hi everyone,
this morning my colleague realized that all internet traffic was non-functional. It seemed like both HA nodes were in active state. After shutting down one of the nodes, things started working again. Looking into the logs I can see this:
2021:07:19-23:04:04 m-2 ha_daemon[4300]: id="38A2" severity="error" sys="System" sub="ha" seq="M: 407 04.766" name="send_backup_heartbeat(): send(): No buffer space available"
2021:07:19-23:00:31 m-2 kernel: [437910.124002] ------------[ cut here ]------------2021:07:19-23:00:31 m-2 kernel: [437910.124014] WARNING: CPU: 3 PID: 6214 at net/sched/sch_generic.c:264 dev_watchdog+0xe6/0x181()2021:07:19-23:00:31 m-2 kernel: [437910.124016] NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out2021:07:19-23:00:31 m-2 kernel: [437910.124104] CPU: 3 PID: 6214 Comm: sasi Tainted: G O 3.12.74-0.377903089.g4999875.rb3-smp64 #12021:07:19-23:00:31 m-2 kernel: [437910.124106] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/20182021:07:19-23:00:31 m-2 kernel: [437910.124107] 0000000000000000 ffffffff8136c181 ffffffff813074b0 ffffffff813074b02021:07:19-23:00:31 m-2 kernel: [437910.124109] ffff88023fd83dd0 ffffffff81046a60 ffff880235358000 00000000000000002021:07:19-23:00:31 m-2 kernel: [437910.124111] ffff880235358000 ffff880235358348 ffffffff813073ca ffffffff81046b112021:07:19-23:00:31 m-2 kernel: [437910.124113] Call Trace:2021:07:19-23:00:31 m-2 kernel: [437910.124115] <IRQ> [<ffffffff8136c181>] ? dump_stack+0x61/0x802021:07:19-23:00:31 m-2 kernel: [437910.124122] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x1812021:07:19-23:00:31 m-2 kernel: [437910.124125] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x1812021:07:19-23:00:31 m-2 kernel: [437910.124131] [<ffffffff81046a60>] ? warn_slowpath_common+0x74/0x8b2021:07:19-23:00:31 m-2 kernel: [437910.124133] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e2021:07:19-23:00:31 m-2 kernel: [437910.124135] [<ffffffff81046b11>] ? warn_slowpath_fmt+0x45/0x4a2021:07:19-23:00:31 m-2 kernel: [437910.124137] [<ffffffff8130738f>] ? netif_tx_lock+0x43/0x7e2021:07:19-23:00:31 m-2 kernel: [437910.124143] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e2021:07:19-23:00:31 m-2 kernel: [437910.124145] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x1812021:07:19-23:00:31 m-2 kernel: [437910.124152] [<ffffffff81050bc3>] ? call_timer_fn+0x6a/0x10e2021:07:19-23:00:31 m-2 kernel: [437910.124154] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e2021:07:19-23:00:31 m-2 kernel: [437910.124156] [<ffffffff81050ddd>] ? run_timer_softirq+0x176/0x1bd2021:07:19-23:00:31 m-2 kernel: [437910.124160] [<ffffffff811cf36c>] ? timerqueue_add+0x79/0x942021:07:19-23:00:31 m-2 kernel: [437910.124163] [<ffffffff8104ae7a>] ? __do_softirq+0x128/0x24c2021:07:19-23:00:31 m-2 kernel: [437910.124166] [<ffffffff813772dc>] ? call_softirq+0x1c/0x302021:07:19-23:00:31 m-2 kernel: [437910.124173] [<ffffffff8100f6c2>] ? do_softirq+0x3f/0x792021:07:19-23:00:31 m-2 kernel: [437910.124174] [<ffffffff8104ac7e>] ? irq_exit+0x46/0xa12021:07:19-23:00:31 m-2 kernel: [437910.124180] [<ffffffff810336f6>] ? smp_apic_timer_interrupt+0x22/0x2d2021:07:19-23:00:31 m-2 kernel: [437910.124184] [<ffffffff8137661d>] ? apic_timer_interrupt+0x6d/0x802021:07:19-23:00:31 m-2 kernel: [437910.124185] <EOI> 2021:07:19-23:00:31 m-2 kernel: [437910.124187] ---[ end trace 2ab76b7259a68d8d ]---2021:07:19-23:00:31 m-2 kernel: [437910.124197] e1000 0000:02:00.0 eth0: Reset adapter2021:07:19-23:02:03 m-1 kernel: [437746.005143] IPv4: martian source 192.168.173.15 from 192.168.173.15, on dev lo2021:07:19-23:02:03 m-1 kernel: [437746.005158] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 08 00 ..............name="send_backup_heartbeat(): send(): No buffer space available" message in HA logs until now. Does anyone else have this behaviour or even an explanation what might have happened here? I've attached the full HA log of the firewall that was active after the incident.This thread was automatically locked due to age.