This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why did HA fail on 9.707-5?

Hi everyone,

this morning my colleague realized that all internet traffic was non-functional. It seemed like both HA nodes were in active state. After shutting down one of the nodes, things started working again. Looking into the logs I can see this:

2021:07:19-23:04:04 m-2 ha_daemon[4300]: id="38A2" severity="error" sys="System" sub="ha" seq="M: 407 04.766" name="send_backup_heartbeat(): send(): No buffer space available"

Kernel log shows this:

2021:07:19-23:00:31 m-2 kernel: [437910.124002] ------------[ cut here ]------------

2021:07:19-23:00:31 m-2 kernel: [437910.124014] WARNING: CPU: 3 PID: 6214 at net/sched/sch_generic.c:264 dev_watchdog+0xe6/0x181()

2021:07:19-23:00:31 m-2 kernel: [437910.124016] NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out

2021:07:19-23:00:31 m-2 kernel: [437910.124104] CPU: 3 PID: 6214 Comm: sasi Tainted: G           O 3.12.74-0.377903089.g4999875.rb3-smp64 #1

2021:07:19-23:00:31 m-2 kernel: [437910.124106] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018

2021:07:19-23:00:31 m-2 kernel: [437910.124107]  0000000000000000 ffffffff8136c181 ffffffff813074b0 ffffffff813074b0

2021:07:19-23:00:31 m-2 kernel: [437910.124109]  ffff88023fd83dd0 ffffffff81046a60 ffff880235358000 0000000000000000

2021:07:19-23:00:31 m-2 kernel: [437910.124111]  ffff880235358000 ffff880235358348 ffffffff813073ca ffffffff81046b11

2021:07:19-23:00:31 m-2 kernel: [437910.124113] Call Trace:

2021:07:19-23:00:31 m-2 kernel: [437910.124115] <IRQ> [<ffffffff8136c181>] ? dump_stack+0x61/0x80

2021:07:19-23:00:31 m-2 kernel: [437910.124122] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181

2021:07:19-23:00:31 m-2 kernel: [437910.124125] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181

2021:07:19-23:00:31 m-2 kernel: [437910.124131] [<ffffffff81046a60>] ? warn_slowpath_common+0x74/0x8b

2021:07:19-23:00:31 m-2 kernel: [437910.124133] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e

2021:07:19-23:00:31 m-2 kernel: [437910.124135] [<ffffffff81046b11>] ? warn_slowpath_fmt+0x45/0x4a

2021:07:19-23:00:31 m-2 kernel: [437910.124137] [<ffffffff8130738f>] ? netif_tx_lock+0x43/0x7e

2021:07:19-23:00:31 m-2 kernel: [437910.124143] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e

2021:07:19-23:00:31 m-2 kernel: [437910.124145] [<ffffffff813074b0>] ? dev_watchdog+0xe6/0x181

2021:07:19-23:00:31 m-2 kernel: [437910.124152] [<ffffffff81050bc3>] ? call_timer_fn+0x6a/0x10e

2021:07:19-23:00:31 m-2 kernel: [437910.124154] [<ffffffff813073ca>] ? netif_tx_lock+0x7e/0x7e

2021:07:19-23:00:31 m-2 kernel: [437910.124156] [<ffffffff81050ddd>] ? run_timer_softirq+0x176/0x1bd

2021:07:19-23:00:31 m-2 kernel: [437910.124160] [<ffffffff811cf36c>] ? timerqueue_add+0x79/0x94

2021:07:19-23:00:31 m-2 kernel: [437910.124163] [<ffffffff8104ae7a>] ? __do_softirq+0x128/0x24c

2021:07:19-23:00:31 m-2 kernel: [437910.124166] [<ffffffff813772dc>] ? call_softirq+0x1c/0x30

2021:07:19-23:00:31 m-2 kernel: [437910.124173] [<ffffffff8100f6c2>] ? do_softirq+0x3f/0x79

2021:07:19-23:00:31 m-2 kernel: [437910.124174] [<ffffffff8104ac7e>] ? irq_exit+0x46/0xa1

2021:07:19-23:00:31 m-2 kernel: [437910.124180] [<ffffffff810336f6>] ? smp_apic_timer_interrupt+0x22/0x2d

2021:07:19-23:00:31 m-2 kernel: [437910.124184] [<ffffffff8137661d>] ? apic_timer_interrupt+0x6d/0x80

2021:07:19-23:00:31 m-2 kernel: [437910.124185] <EOI>

2021:07:19-23:00:31 m-2 kernel: [437910.124187] ---[ end trace 2ab76b7259a68d8d ]---

2021:07:19-23:00:31 m-2 kernel: [437910.124197] e1000 0000:02:00.0 eth0: Reset adapter

2021:07:19-23:02:03 m-1 kernel: [437746.005143] IPv4: martian source 192.168.173.15 from 192.168.173.15, on dev lo

2021:07:19-23:02:03 m-1 kernel: [437746.005158] ll header: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 08 00        ..............

The last two lines keep repeating.

I haven't seen the name="send_backup_heartbeat(): send(): No buffer space available" message in HA logs until now. Does anyone else have this behaviour or even an explanation what might have happened here? I've attached the full HA log of the firewall that was active after the incident.

Regards

asc

ha-log-active-firewall.txt

This thread was automatically locked due to age.

0 Amodin over 4 years ago

This could be a number of things for that error from ICMP to a NIC failure, maybe even proxy authentication issues. No buffer space available has a number of things tied to it. I'm no expert on HA, but yes a communication issue there. (I haven't looked at the full log you posted yet, no time at the moment).

XG 19.5 GA 64-bit | Intel Xeon 4-core v3 1225 3.20Ghz
16GB Memory | 500GB SSD HDD | GB Ethernet x5
Cancel
Vote Up 0 Vote Down

Cancel
0 dirkkotte over 4 years ago

As mentioned by Amodin, it appears to be a network related issue.

Which hypervisor are you using? Is it a supported one? Are you also updating the hypervisor? SG need jumbo packages for the HA-Link. Most of the time, the (virtual-)switches have problems here.

Dirk

Systema Gesellschaft für angewandte Datentechnik mbH // Sophos Platinum Partner
Sophos Solution Partner since 2003
If a post solves your question, click the 'Verify Answer' link at this post.
Cancel
Vote Up 0 Vote Down

Cancel