This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

[Solved] Trouble with HA over Intel Gigabit 4P I350-t Adapter

Hello all,

I am having trouble with our HA/Cluster-Interface on our Dell PowerEdge R420 with two Broadcom NetXtreme BCM5720 onboard and an additional Intel I350 4P Card, since we have updated from 9.106-17 to Version 9.111-007.

We use one port of the Intelcard as the HA/Cluster Interface with all "automatic" settings.

Both machines are identical to each other.

What we have done so far:
1. replaced the cross-link cable
2. reinstalled UTM 9.111-007 from ISO directly without updating from previous versions
3. replaced the Intel network card on one machine which gave IO Error with mii-diag -s eth4
4. made firmware updates of the network card on both machines from 14.5.9 to 15.0.28
4. replaced the cable again
5. reinstalled the other machine with the iso
6. hard set the speed of the interface in the bios

The results are always the same: after a few minutes ( 45-90 minutes)
The dashboard shows:
Interface: eth4 Name: HA/Cluster Type: Ethernet Status: On Link[:D]own

ethttool eth4: established no / link speed unknown (on both machines)
mii-diag -s eth4: Link not established OR SIOCGMIIREG on eth4 failed: Input/output error

Sometimes this error occurs on node 1, sometimes on node 2 but NEVER on both nodes at the same time.

Also this error only occurs at the x-linked Interface/port of the Intelcard and not on the other interfaces which always have traffic.

lsmod shows that the modules igb and tg3 are loaded and the driver version of the Intel card is 5.0.6

I have no idea what happened, but before the update all things worked fine.

I have search for other topics with this problem and found out, that there was an Intel Network Driver udpate in one of the versions, we did not apply before. Also I found a Mantis ID #30669 at https://community.sophos.com/products/unified-threat-management/astaroorg/f/81/t/65555 which sound similar to our problem, but I have no idea where I can get this patch to try out.

Did any one of you have some hints for me? I could post millions of logs, but none of them seemed plausible to me. The only thing I can see in the system logs of one of the nodes is, that the auto-negotiation switches from 1000 Mbps to 100 to 10 to 100 to 10 to down. The other node just notice "down".

Please help, because this cluster is not in production at the moment but have to be in one or two weeks, so we need a working failover.

Kind regards

This thread was automatically locked due to age.

0 BrucekConvergent over 12 years ago

You need to start a case with Sophos Support; I suspect that your issue is related to a known Intel NIC driver issue -- in previous versions, they did have a hotfix kernel available to fix it -- but it is only available directly from support (I actually have installed this on one customer system, but support told me not to make this publicly available).. Look in your kernel log for a message regarding the interface going down -- if you see a message there like that... then it's that bug. 9.201 contains the bug too, so you would still need to contact support

CTO, Convergent Information Security Solutions, LLC

https://www.convergesecurity.com

Advice given as posted on this forum does not construe a support relationship or other relationship with Convergent Information Security Solutions, LLC or its subsidiaries. Use the advice given at your own risk.
Cancel
Vote Up 0 Vote Down

Cancel
0 BerndR over 12 years ago

Thank you for your reply, but in the kernel log I can only see this notice on the master and the slave: "igb: eth4 NIC Link is Down"
No more nor less.
Do you believe this is a hint towards a kernel bug?!
On the other hand... if I reboot both machines the ha works... for a little moment.
Cancel
Vote Up 0 Vote Down

Cancel
0 BrucekConvergent over 12 years ago in reply to BerndR

I would start a support case with Sophos Support ASAP. HA can be difficult to troubleshoot without the inside knowledge they have.

CTO, Convergent Information Security Solutions, LLC

https://www.convergesecurity.com

Advice given as posted on this forum does not construe a support relationship or other relationship with Convergent Information Security Solutions, LLC or its subsidiaries. Use the advice given at your own risk.
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 12 years ago

Hi, Bernd, and welcome to the User BB!

What do you see in the High Availability log when the link goes down? Is the MTU the same on both interfaces? Have you tried setting them to Half-Duplex in addition to fixed 1GB?

You do mean "Hot-Standby" instead of "Cluster," don't you?

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 BerndR over 12 years ago

Hello Bob,

I have started a request for support for this case.
You are right, I mean "Hot-Standby". I have forgotten, that there is also a "cluster" mode.
The MTU is on both machines and interfaces MTU: 2000

The log only print this:
2014:05:05-12:46:04 fw-2 ha_daemon[4379]: id="38A0" severity="info" sys="System" sub="ha" name="Monitoring interfaces for link beat: lag0 lag1 "
2014:05:05-12:37:09 fw-1 ha_daemon[4431]: id="38A0" severity="info" sys="System" sub="ha" name="Initial synchronization finished!"
2014:05:05-12:49:56 fw-2 ha_daemon[4379]: id="38A0" severity="info" sys="System" sub="ha" name="Node 1 changed state: SYNCING -> ACTIVE"
2014:05:05-15:23:39 fw-2 ha_daemon[4379]: id="38A3" severity="debug" sys="System" sub="ha" name="Netlink: Lost link beat on eth4!"
2014:05:05-15:23:39 fw-2 conntrack-tools[4967]: no dedicated links available!
2014:05:05-15:23:41 fw-2 ha_daemon[4379]: id="38C1" severity="info" sys="System" sub="ha" name="Node 1 is dead, received no heart beats!"
2014:05:05-15:23:41 fw-2 repctl[20592]: daemonize_check(1864): trying to signal daemon

I have not tried the half duplex mode, but let's face it: this could not be the answer at all.
Cancel
Vote Up 0 Vote Down

Cancel
0 BrucekConvergent over 12 years ago in reply to BerndR

I would try setting the MTU on the HA interfaces back to 1500 (this can be done via the HA_UTILS utility from the shell) ... I have had issues at times with the MTUs set on the HA interfaces higher.

Other than that, time to have Sophos Support take a crack at it.

CTO, Convergent Information Security Solutions, LLC

https://www.convergesecurity.com

Advice given as posted on this forum does not construe a support relationship or other relationship with Convergent Information Security Solutions, LLC or its subsidiaries. Use the advice given at your own risk.
Cancel
Vote Up 0 Vote Down

Cancel
0 BerndR over 12 years ago

I know it has been a while since I have requested your help, but I really do not like post without a solution for people who maybe have the same problems like me.
For short, it works now, but I have no idea why.
Because I do not wanted to be asked if the Firmware and Bios of both machines are up to date, I had first updated the firmware for the Intel network adapter.
This did not solved my problem. The link get lost after max. 30 minutes.
After this I had updated the Sophos UTM Version to 9.201-23.1 and restored our backup of the previous version. Switched the master slaves roles and manually updated the other host and reassigned the nodes to an HA Hot-Standby.
Same behavior.
Then we had a lot of trouble with one of both machines (temperature overheat) so we had to replace the complete motherboard of one HA member.
Because of this and the hint of Sophos, that I had to reinstall the whole machine (because of MAC-Address changes) I had done so.
Remember: the HA-Link is on the Intel network card and not on the onboard devices!
So after reinstalling everything a new problem appeared. The sync of the slave failed with this error messages (fw2 was at master state fw1 was the reinstalled machine):

2014:05:23-15:09:09 fw-2 postgres[23670]: [3-1] FATAL:  role "repmgr" does not exist
2014:05:23-15:09:09 fw-1 postgres[13856]: [3-1] FATAL:  could not connect to the primary server: FATAL:  role "repmgr" does not exist
2014:05:23-15:09:09 fw-1 postgres[13856]: [3-2]
2014:05:23-15:09:10 fw-2 postgres[23694]: [3-1] FATAL:  role "smtp" does not exist
2014:05:23-15:09:10 fw-2 postgres[23711]: [3-1] ERROR:  database "reporting" does not exist
2014:05:23-15:09:10 fw-2 postgres[23711]: [3-2] STATEMENT:  DROP DATABASE reporting;
2014:05:23-15:09:10 fw-2 postgres[23711]: [3-3]
2014:05:23-15:09:10 fw-2 postgres[23713]: [3-1] ERROR:  tablespace "reporting" does not exist
2014:05:23-15:09:10 fw-2 postgres[23713]: [3-2] STATEMENT:  DROP TABLESPACE reporting;
2014:05:23-15:09:10 fw-2 postgres[23715]: [3-1] ERROR:  role "reporting" does not exist
2014:05:23-15:09:10 fw-2 postgres[23715]: [3-2] STATEMENT:  DROP ROLE reporting;
2014:05:23-15:09:10 fw-2 postgres[23715]: [3-3]
2014:05:23-15:09:10 fw-2 postgres[23722]: [3-1] FATAL:  role "smtp" does not exist

2014:05:23-15:16:20 fw-2 postgres[23727]: [146-1] ERROR:  schema "repmgr_asg" does not exist at character 13
2014:05:23-15:16:20 fw-2 postgres[23727]: [146-2] STATEMENT:  insert into repmgr_asg.repl_monitor (
2014:05:23-15:16:20 fw-2 postgres[23727]: [146-3]   primary_node, standby_node,
2014:05:23-15:16:20 fw-2 postgres[23727]: [146-4]   last_monitor_time,
2014:05:23-15:16:20 fw-2 postgres[23727]: [146-5]   last_wal_primary_location, last_wal_standby_location,
2014:05:23-15:16:20 fw-2 postgres[23727]: [146-6]   replication_lag, apply_lag
2014:05:23-15:16:20 fw-2 postgres[23727]: [146-7] ) values (  $1, $2, $3, $4, $5,   pg_xlog_location_diff($4, $5), pg_xlog_location_diff($4, $6))
2014:05:23-15:16:23 fw-2 postgres[23727]: [147-1] ERROR:  schema "repmgr_asg" does not exist at character 13
2014:05:23-15:16:23 fw-2 postgres[23727]: [147-2] STATEMENT:  insert into repmgr_asg.repl_monitor (
2014:05:23-15:16:23 fw-2 postgres[23727]: [147-3]   primary_node, standby_node,
2014:05:23-15:16:23 fw-2 postgres[23727]: [147-4]   last_monitor_time,
2014:05:23-15:16:23 fw-2 postgres[23727]: [147-5]   last_wal_primary_location, last_wal_standby_location,
2014:05:23-15:16:23 fw-2 postgres[23727]: [147-6]   replication_lag, apply_lag
2014:05:23-15:16:23 fw-2 postgres[23727]: [147-7] ) values (  $1, $2, $3, $4, $5,   pg_xlog_location_diff($4, $5), pg_xlog_location_diff($4, $6))
2014:05:23-15:16:24 fw-1 postgres[14113]: [3-1] FATAL:  database system identifier differs between the primary and standby
2014:05:23-15:16:24 fw-1 postgres[14113]: [3-2] DETAIL:  The primary's identifier is 6016607257284791677, the standby's identifier is 6011361465832814575.

So this seemed to be a corrupted database.
The command /etc/init.d/postgresql rebuild like mentioned in this thread New UTM9 install corrupted postgres db did not worked for me. The logs always told me, that there are missing roles and schema.
I swear... if I had a flamethrower at this time I would have burned down our "firewall". ;-)

So... after another day at our datacenter between the romantic noises of the air-conditioning plant and fans and temperatures between ice cubes and melting iron (it was hot outside this day) I had first updated all firmwares and BIOS of both machines. Then reinstalled both machines with a classical DVD and not over the ISO-mounter over the Dell iDRAC System with Sophos UTM 9.201.23-1

What should I say... now it works.
No more HA Link errors, no database errors... all is fine.
Here a short overview of the actual BIOS and firmwares of our Dell Poweredge R420:
Version change detected for BIOS firmware. Previous version:1.5.2, Current version:2.2.0

Version change detected for PERC H310 Mini firmware. Previous version:20.12.0-0004, Current version:20.12.1-0002

Version change detected for Broadcom Gigabit Ethernet BCM5720 - 90:B1:1C:XX:XX:XX firmware. Previous version:7.6.14, Current version:7.8.16
Version change detected for Broadcom Gigabit Ethernet BCM5720 - 90:B1:1C:XX:XX:XX firmware. Previous version:7.6.14, Current version:7.8.16

Successfully updated iDRAC7, 1.56.55, A00.
Firmware version change detected on Lifecycle Controller, 1.4.0.128, A00. Previous version: 1.1.5.165, Current version: 1.4.0.128
Firmware version change detected on Enterprise UEFI Diagnostics, 4233A1, 4233.3. Previous version: 4225A2, Current version: 4233A1

Intel(R) Gigabit 4P I350-t Adapter - 15.0.28

Thank you all for this great Bulletin Board, your questions, tasks, answers and hints. :-)

One question left... how can I mark this thread as "solved"?
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 12 years ago

Thanks for posting the result, Bernd. I'll add [Solved] to the title.

In V9, the PostgreSQL command has changed: /etc/init.d/postgresql92 rebuild

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel