[7.950][BUG][FIXED] Cluster fails to upgrade to 7.950

Hi,

The cluster downloaded the updates and I tried to kick off the automatic update process. Slave node received the update first (as usual), it appeared to apply OK, automatically rebooted but never left the up2date process despite the log file showing that a successful sync had occurred. Left for about 8 hours and still no change, rebooted slave node, came back but master continued to report it was in the up2date mode.

Cheers,

Darren

0 Astaro Beta Bot over 15 years ago

Astaro Beta Report
--------------------------------
Version: 7.950
Type: BUG
State: MERGED/FIXED
Reporter: darrenl++
Contributor: 
MantisID: 14205
Target version: 8.000
Fixed in version: 8.000
--------------------------------

0 darrenl over 15 years ago

Only way I could get the cluster to update was to shut down the slave node (which had already updated to 7.950), reboot the master, then run the up2date on the master node so it updated to 7.950.  Once the master had rebooted I turned the slave node back on and the node sync was successfully performed which allowed both nodes to move back to active state.

Cluster had config error again - which seems to occur every time either a new node is joined or the up2date process runs:
2010:06:20-01:10:35 mercury-3 slon[15430]: [30-1] ERROR  cannot get sl_local_node_id - ERROR:  schema "_asg_cluster" does not exist
2010:06:20-01:10:35 mercury-3 slon[15430]: [30-2] LINE 1: select last_value::int4 from "_asg_cluster".sl_local_node_id
2010:06:20-01:10:35 mercury-3 slon[15430]: [30-3]                                      ^
2010:06:20-01:10:35 mercury-3 slon[15430]: [31-1] FATAL  main: Node is not initialized properly - sleep 10s

This error is usually fixed during the early morning automated database sync/clean-up process but sometimes it can take two cycles to clear.  Impact on environment: node showing the error does not process traffic until it's fixed.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 da_merlin over 15 years ago

UP2DATE state is used twice:
a) same version as Master: Slave is updating
b) higher version than Master: Slave cannot join Cluster due higher system version
and waits for Master to upgrade to same version.

So normal up2date behavior is:
1) Slave updates
2) Slave reboots, takes over
3) Old Master installs up2date

Seems in your up2date process was something broken.
Can you mail me your high-availability.log file? Thanks!

Cheers
Ulrich
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 da_merlin over 15 years ago

Any news? Are both nodes now on 7.950 ?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 da_merlin over 15 years ago

Ok found the error! Thanks for reporting!

Are both nodes running 7.950 now? If not, just trigger up2date
via WebAdmin again. After up2date finished reboot the node manual again.

Cheers
Ulrich
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 darrenl over 15 years ago

Hi Ulrich,

Sorry for the delay - I'm on holiday at the moment. The latest update failed as well but I think that has more to do with the network driver issue for the Realtek card that was previously found. Seems that with this latest update the driver issue was triggered and so the server won't go online as it's lost eth0/1. Any idea when this will be fixed?

Cheers,

Darren
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel
0 da_merlin over 15 years ago

Hi Darren,

please disable virtual MAC address on your system due broken Realtek network card:
via command line: cc set ha advanced virtual_mac 0

Cheers
Ulrich
Cancel
Vote Up 0 Vote Down

Sign in to reply

Verify Answer

Cancel