This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

HA unstable?

I have configured master/slave HA with latest 7.401 software. It's been work fine mostly for a few weeks, but from time to time, I am getting those emails saying node 2 (slave) becomes master, then another email saying it becomes slave, then another email saying HA is fully functional again. It's been crazy in the last two days, I got 50+ such emails every day.

If I connect to the web interface and look at the Management -> HA status page, I can see:
1 MASTER node1 ACTIVE 7.401 Tue Mar 24 12:13:51 2009
2 SLAVE Node2 ACTIVE 7.401 Tue Apr 22 xx 2009 (usually the last time I got the email).

I thought node 2 must have rebooted or something, but if ssh to node 1 and then ssh to node 2 via the linkbeat. I can check the system uptimes of both nodes. They are both many weeks long.

Looking at the heartbeat log, I see lots of errors about node 2:
2009:04:21-21:49:32 fw1n1-2 slon[13826]: [2-1] ERROR cannot get sl_local_node_id - ERROR: schema "_asg_cluster" does not exist
2009:04:21-21:49:32 fw1n1-2 slon[13826]: [3-1] FATAL main: Node is not initialized properly - sleep 10s

This thread was automatically locked due to age.

Parents

0 BAlfson over 17 years ago

Liug, please indicate hardware including RAM and show pics of CPU usage and memory usage for yesterday.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 BrucekConvergent over 17 years ago in reply to BAlfson

If this is using a licensed version of ASG and you have Gold or Platinum maintenance, I highly recommend starting a case with Astaro Support; in most cases HA troubleshooting is best carried out by them.

CTO, Convergent Information Security Solutions, LLC

https://www.convergesecurity.com

Advice given as posted on this forum does not construe a support relationship or other relationship with Convergent Information Security Solutions, LLC or its subsidiaries. Use the advice given at your own risk.
Cancel
Vote Up 0 Vote Down

Cancel
0 liug over 17 years ago in reply to BrucekConvergent

If this is using a licensed version of ASG and you have Gold or Platinum maintenance, I highly recommend starting a case with Astaro Support; in most cases HA troubleshooting is best carried out by them.

Thanks and I opened a case with them and sent them the ha logs.
BTW, when the cpu is high, "top" shows it is the postgresql, which may mean the schema errors in the ha log does indicate problems?
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 17 years ago in reply to liug

Here's a thread that might be of interest to you: https://community.sophos.com/products/unified-threat-management/astaroorg/f/51/t/20453

I have only one customer with clustered ASGs. We experienced the same phenomenon as you when the processor was pegged. Not every time. It seems that there's "100%" and "way over 100%" and it's only when it's way over that the funky stuff happens.

When we first upgraded them to 7.3, there was a glitch in a PostgreSQL database that caused the cluster to stay over 100% for an hour at times. Once that was repaired, the incidents of master/slave/master became very rare.

If Astaro doesn't get back to you by the end of your workday, you might try the fix mentioned in the link.

Cheers - Bob
PS Thanks, Bruce, for all I'm learning from you here.

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel

Reply

0 BAlfson over 17 years ago in reply to liug

Here's a thread that might be of interest to you: https://community.sophos.com/products/unified-threat-management/astaroorg/f/51/t/20453

I have only one customer with clustered ASGs. We experienced the same phenomenon as you when the processor was pegged. Not every time. It seems that there's "100%" and "way over 100%" and it's only when it's way over that the funky stuff happens.

When we first upgraded them to 7.3, there was a glitch in a PostgreSQL database that caused the cluster to stay over 100% for an hour at times. Once that was repaired, the incidents of master/slave/master became very rare.

If Astaro doesn't get back to you by the end of your workday, you might try the fix mentioned in the link.

Cheers - Bob
PS Thanks, Bruce, for all I'm learning from you here.

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel

Children

No Data