[7.911][BUG][DUPLICATE] Cluster not correctly processing traffic

Hi,

I've noticed that when traffic is routed via the HTTP/S proxy, the multipath rules are being ignored and the wrong route out for HTTP traffic is being used.
When traffic is not routed via the HTTP/S proxy, multipath rules are correctly observed.

This has not been an issue until the latest upgrade.

Active/Active cluster.

Cheers,

Darren

0 Astaro Beta Bot over 15 years ago

Astaro Beta Report

--------------------------------

Version: 7.911

Type: BUG

State: CLOSED/DUPLICATE

Reporter: darrenl++

Contributor: 

MantisID: 13738

Target version: 7.912

Fixed in version: 7.912

--------------------------------

0 darrenl over 15 years ago

Tried shutting down the slave node (2) and multipath rules are being correctly observed when HTTP/S proxy is back in transparent mode.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 darrenl over 15 years ago

Update:
Upon further investigation this is *not* a multipath issue but an interface/ cluster config issue:
Node 2 lost its config for Eth2 and placed the port into 'unassigned' so it no longer had the capability to route traffic over the 2nd WAN link (see attached pic).

When I restarted Node 2 after the previous tests according to the log file it went into a factory reset state (I had not told the cluster to do this!):

2010:05:14-23:51:06 mercury-1 ha_daemon[4154]: id="38A0" severity="info" sys="System" sub="ha" name="Activating sync process for config on node 2"
2010:05:14-23:51:07 mercury-1 ha_daemon[4154]: id="38A0" severity="info" sys="System" sub="ha" name="Activating sync process for config on node 2"
2010:05:14-23:51:09 mercury-2 ha_daemon[4376]: id="38A0" severity="info" sys="System" sub="ha" name="Reading cluster configuration"
2010:05:14-23:51:10 mercury-2 ha_factory_reset[5171]: id="38K0" severity="info" sys="System" sub="ha" name="HA factory reset runnnig on slave node! Aborting..."
2010:05:14-23:51:11 mercury-1 ha_daemon[4154]: id="38A0" severity="info" sys="System" sub="ha" name="HA daemon of node 2 is restarting, waiting 60 seconds before declaring node as dead"
2010:05:14-23:51:11 mercury-2 ha_daemon[4376]: id="38A0" severity="info" sys="System" sub="ha" name="--- Node is disabled ---"
2010:05:14-23:51:11 mercury-2 ha_daemon[4376]: id="38A0" severity="info" sys="System" sub="ha" name="HA daemon shutting down"

Following the second forced reboot by the cluster of node 2 I logged into webadmin on node 1 and saw that the cluster configuration was only showing the master node (1) and that the slave node (2) was no longer present.

I connected a laptop to one of the network ports on node 2 and logged into webadmin, it still had its configuration data intact but was missing the cluster configuration and Eth2 was in unassigned state (see attached pic).
I reallocated Eth2 to the correct port setting and then manually rejoined it to the cluster. The cluster finished syncing and came online.

All services are working but only Node 1 (master) is processing traffic - this is an issue previously seen and reported before.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 darrenl over 15 years ago

Correction: Node 2 is not processing traffic via the HTTP/S proxy, only Node 1 is processing traffic.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 darrenl over 15 years ago

This is when it seems to have gone pear shaped:

2010:05:14-19:53:39 mercury-1 slon[11905]: [43-1] ERROR  slon_connectdb: PQconnectdb("dbname=pop3 host=198.19.250.2 user=ha_sync password=slony") failed - could not connect to
2010:05:14-19:53:39 mercury-1 slon[11905]: [43-2]  server: Connection refused
2010:05:14-19:53:39 mercury-1 slon[11905]: [43-3] Is the server running on host "198.19.250.2" and accepting
2010:05:14-19:53:39 mercury-1 slon[11905]: [43-4] TCP/IP connections on port 5432?
2010:05:14-19:53:39 mercury-1 slon[11905]: [44-1] WARN   remoteListenThread_2: DB connection failed - sleep 10 seconds
2010:05:14-19:53:39 mercury-1 slon[11906]: [41-1] CONFIG cleanupThread: thread starts
2010:05:14-19:53:39 mercury-1 slon[11906]: [42-1] CONFIG cleanupThread: bias = 35383
2010:05:14-19:53:39 mercury-1 slon[11906]: [43-1] ERROR  slon_connectdb: PQconnectdb("dbname=reporting host=198.19.250.2 user=ha_sync password=slony") failed - could not connect
2010:05:14-19:53:39 mercury-1 slon[11906]: [43-2]  to server: Connection refused
2010:05:14-19:53:39 mercury-1 slon[11906]: [43-3] Is the server running on host "198.19.250.2" and accepting
2010:05:14-19:53:39 mercury-1 slon[11906]: [43-4] TCP/IP connections on port 5432?
2010:05:14-19:53:39 mercury-1 slon[11906]: [44-1] WARN   remoteListenThread_2: DB connection failed - sleep 10 seconds
2010:05:14-19:53:39 mercury-1 slon[11905]: [45-1] CONFIG version for "dbname=pop3 user=ha_sync" is 80403
2010:05:14-19:53:39 mercury-1 slon[11905]: [46-1] CONFIG version for "dbname=pop3 user=ha_sync" is 80403
2010:05:14-19:53:39 mercury-1 slon[11906]: [45-1] CONFIG version for "dbname=reporting user=ha_sync" is 80403
2010:05:14-19:53:39 mercury-1 slon[11906]: [46-1] CONFIG version for "dbname=reporting user=ha_sync" is 80403
2010:05:14-19:53:39 mercury-1 slon[11906]: [47-1] CONFIG version for "dbname=reporting user=ha_sync" is 80403
2010:05:14-19:53:39 mercury-1 slon[11905]: [47-1] CONFIG version for "dbname=pop3 user=ha_sync" is 80403
2010:05:14-19:53:39 mercury-1 slon[11905]: [48-1] CONFIG remoteWorkerThread_2: update provider configuration
2010:05:14-19:53:39 mercury-1 slon[11906]: [48-1] CONFIG remoteWorkerThread_2: update provider configuration
2010:05:14-19:53:43 mercury-1 ha_daemon[4154]: id="38A0" severity="info" sys="System" sub="ha" name="Activating sync process for config on node 2"
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Slonik error, process exited with value 255
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Failed to drop slony schemas for reporting, process exited with value 2!
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Slonik error, process exited with value 255
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Failed to drop slony schemas for pop3, process exited with value 2!
2010:05:14-19:53:43 mercury-2 slon_control[4553]: PostgreSQL down!
2010:05:14-19:53:43 mercury-2 slon_control[4553]: PostgreSQL down!
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Set mode to SLAVE
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Slonik error, process exited with value 255
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Failed to drop slony schemas for reporting, process exited with value 2!
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Starting replication from Node 1 to 2
2010:05:14-19:53:43 mercury-2 slon_control[4553]: Found no tables for reporting! process exited with value 2
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 da_merlin over 15 years ago

Configuration seems to get corrupted, please install workaround of
[7.904][BUG] Interfaces unassigned themselves
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 darrenl over 15 years ago

I have not had any unscheduled power downs on the servers - as seems the issue on the other thread, all are on protected power (UPS).
Node 2 was cleanly rebooted from webadmin and ran into the above issues.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 da_merlin over 15 years ago

Seems there were some driver issues with your realtek network card on eth2 and eth3.
Unfortunately the boot.log got overwritten each boot, so I'm unable to get the error message.

Fixed the bootl.log to be appended each boot. So next time this issue happens, please mail me your boot.log
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 darrenl over 15 years ago

Hi Ulrich,

Following your testing on the cluster yesterday I checked another Astaro server using the same Realtek cards (without a problem although that server is stand-alone vs. clustered) and found that server had been configured so that the plug and play setting in the BIOS was off.
I adjusted the setting on the node (2) that kept on losing its Eth2 config (plug and play is now off) and its been stable through a number of reboots since last night.

It's a little bizarre that this problem has suddenly become so obvious - unless Astaro changed drivers/ settings in the last beta issue :-)

Will continue to monitor....
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 darrenl over 15 years ago

Hi Ulrich,

I ran a new set of failover tests following some workarounds/changes to the cluster - emailed you with the results.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel