This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

[Solved]After update 9.203-3 high CPU load

After the update I noticed that my UTM 220 HA-Cluster has 98% CPU load (shown in the dash), but only on the master node (the slave node behaves as ususal). When I switch roles by rebooting the master the other node goes up to 98% load, while the other one does not have the high load anymore. Output of TOP command: see attached screenshot. The deamon syslog-ng shows a constant 25% load which is not normal I guess.

Anybody else experiencing this problem? Is this maybe caused by a switch to a new syslog deamon which now goes through the old logs and generates metadata or something? The firewall UI and internet traffic appears to be as responsive as before, so no crazy uncontrolled CPU hogging is going on.

This thread was automatically locked due to age.

0 DerBachmannRocker over 11 years ago

Looked at it further. The "System messages" log gets bombarded with the following entries:


2014:06:23-10:11:44 GwExt01-2 postgres[1519]: [3-1] ERROR: cache lookup failed for function 411182
2014:06:23-10:11:44 GwExt01-2 postgres[1519]: [3-2] STATEMENT: select ins_atp($1, $2, $3::int4, $4, $5, $6, $7, $8, $9, $10, $11::int4)

This and the tip of BAlfson to rebuild the PostgreSQL DB lead me to execute the following command:
/etc/init.d/postgresql92 rebuild

After that I got a missmatching Database ID error between slave and master node of my HA cluster. Makes sense, as the database is only rebuilt on the master and the slave still has the old copy stored. So I rebooted the slave node and waited for the sync to complete. CPU load is now back to normal. BUT: I now get the following error in system messages and RAM usage has rocketed to 83% (constant):


2014:06:23-11:26:42 GwExt01-2 postgres[18419]: [116-1] ERROR: schema "repmgr_asg" does not exist at character 13 
2014:06:23-11:26:42 GwExt01-2 postgres[18419]: [116-2] STATEMENT: insert into repmgr_asg.repl_monitor ( 
2014:06:23-11:26:42 GwExt01-2 postgres[18419]: [116-3] primary_node, standby_node, 
2014:06:23-11:26:42 GwExt01-2 postgres[18419]: [116-4] last_monitor_time, 
2014:06:23-11:26:42 GwExt01-2 postgres[18419]: [116-5] last_wal_primary_location, last_wal_standby_location, 
2014:06:23-11:26:42 GwExt01-2 postgres[18419]: [116-6] replication_lag, apply_lag 
2014:06:23-11:26:42 GwExt01-2 postgres[18419]: [116-7] ) values ( $1, $2, $3, $4, $5, pg_xlog_location_diff($4, $5), pg_xlog_location_diff($4, $6))

What does this error message even say? Does this mean replication between slave / master is broken at the moment?
HA Log:


2014:06:23-11:45:48 GwExt01-1 repctld[4661]: [c] sql_execute(2234): SQL execute: ERROR: schema "repmgr_asg" does not exist
2014:06:23-11:45:48 GwExt01-1 repctld[4661]: [c] update_monitor(1663): MONITOR: insert failed

What should I try next? Maybe rejoining the slave to the cluster?

0 BAlfson over 11 years ago

That sounds like a good idea since the rebuild only works on the Master.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 DerBachmannRocker over 11 years ago

Tried it yesterday. No luck. After the sync I still got the same errors / sync did not work propperly. Pulled all the ethernet cables from the slave (node 2) and rebuilt it using the current config from the master (node 1). After the rebuild I switched all ethernet cables from node 1 over to node 2. So I have the former slave working as single UTM now with only minimal downtime from switching the cables. Currently there are no database related errors anymore. Next step will be to rebuild node 1 as well and connect it back so it forms a HA cluster again.

Lots of work, as I don't do this stuff everyday. I hope the updates will be a bit more stable in the future.
Cancel
Vote Up 0 Vote Down

Cancel
0 DerBachmannRocker over 11 years ago

Ok, looking good so far. I rebuilt node 1 and joined it back to the HA cluster. So far there are no bad log entries either for HA or in System Messages. I guess the problem is "solved" by a complete rebuild of the firewall. Meh.
@Bob: tried to change the title of the thread. Didn't work. Could you please add [solved]?
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 11 years ago

DBR, by any chance, do you have one of the nodes listed as 'Preferred master'? If so, it appears that that old bug is not yet fixed.

Cheers - Bob
PS Will change title.

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 DerBachmannRocker over 11 years ago in reply to BAlfson

Nope, I did not set any node to be the prefered master. Neither before the rebuild, nor afterwards.

Thanks.
Cancel
Vote Up 0 Vote Down

Cancel
0 SmallAdmin over 11 years ago in reply to DerBachmannRocker

Hello Bob,
what are the effects of this HA "preferred master" bug?
We had some node failovers going mad lately, maybe we are affected?

Thanks,
SA
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 11 years ago

SA, try a Google on site:astaro.org "preferred master" and limit the search to the last year.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 SmallAdmin over 11 years ago in reply to BAlfson

SA, try a Google on site:astaro.org "preferred master" and limit the search to the last year.

Cheers - Bob

Found it, thanks. We had a lot of problems with crashing HA-sync. Sophos support never mentioned anything about a possible problem with the "preferred master" setting.

We've set a "preferred master" because we are using a dedicated "Backup"-Room, with all the failover stuff in one place...
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 11 years ago

Just be sure to remove the Preferred Master setting prior to any Up2Dates.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel