This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Errors in HA Live Log after 7.504 patch

We have a 3 node HA cluster that was running 7.502.  On Tuesday I applied the 7.503 patch, and that went smooth.  After all nodes were synced, I applied 7.504.  The master and worker report active, but the slave has been in "syncing" phase for two days now.  The HS Live Log reports the errors below:
--------------------------------------------------------------
2010:03:18-08:49:15 secgate-an-2 slon[13662]: [3-1] FATAL main: Node is not initialized properly - sleep 10s
2010:03:18-08:49:16 secgate-an-3 slon[7669]: [25070-1] ERROR slon_connectdb: PQconnectdb("dbname=pop3 host=198.19.250.2 user=ha_sync password=slony") failed - could not create
2010:03:18-08:49:16 secgate-an-3 slon[7669]: [25070-2] socket: Too many open files
2010:03:18-08:49:16 secgate-an-3 slon[7669]: [25071-1] WARN remoteListenThread_2: DB connection failed - sleep 10 seconds
2010:03:18-08:49:25 secgate-an-2 slon[11691]: [1-1] CONFIG main: slon version 1.2.20 starting up
2010:03:18-08:49:25 secgate-an-2 slon[13904]: [2-1] ERROR cannot get sl_local_node_id - ERROR: schema "_asg_cluster" does not exist

--------------------------------------------------------------
 Any recommendations for troubleshooting?


This thread was automatically locked due to age.
  • Just an update, I rebooted the node that was stuck in syncing mode, and it came up and synced right away and the errors are gone.

    We have a 3 node HA cluster that was running 7.502.  On Tuesday I applied the 7.503 patch, and that went smooth.  After all nodes were synced, I applied 7.504.  The master and worker report active, but the slave has been in "syncing" phase for two days now.  The HS Live Log reports the errors below:
    --------------------------------------------------------------
    2010:03:18-08:49:15 secgate-an-2 slon[13662]: [3-1] FATAL main: Node is not initialized properly - sleep 10s
    2010:03:18-08:49:16 secgate-an-3 slon[7669]: [25070-1] ERROR slon_connectdb: PQconnectdb("dbname=pop3 host=198.19.250.2 user=ha_sync password=slony") failed - could not create
    2010:03:18-08:49:16 secgate-an-3 slon[7669]: [25070-2] socket: Too many open files
    2010:03:18-08:49:16 secgate-an-3 slon[7669]: [25071-1] WARN remoteListenThread_2: DB connection failed - sleep 10 seconds
    2010:03:18-08:49:25 secgate-an-2 slon[11691]: [1-1] CONFIG main: slon version 1.2.20 starting up
    2010:03:18-08:49:25 secgate-an-2 slon[13904]: [2-1] ERROR cannot get sl_local_node_id - ERROR: schema "_asg_cluster" does not exist

    --------------------------------------------------------------
     Any recommendations for troubleshooting?
  • Two days seems long enough if you don't have a substantial email quarantine and lots of logfile history.  I would submit a trouble ticket and give it more time.  Are you seeing in the Content Filter log that all three nodes are being used?

    Cheers - Bob
     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • The errors cleared up after rebooting the slave, and all 3 nodes are synced and active now.

    Yes all 3 nodes are being used in the Content Filter log.

    The mail manager page takes awhile to load, close to a minute.  Before the recent reboots, they would not load period and it has been that way for about a week.  We seem to have trouble always loading the mail manager page.

    We currently have close to 42,000 quarantined items right now.

    Two days seems long enough if you don't have a substantial email quarantine and lots of logfile history.  I would submit a trouble ticket and give it more time.  Are you seeing in the Content Filter log that all three nodes are being used?

    Cheers - Bob
  • 42,000?  That might take awhile.  When you rebooted the Slave, wasn't the Worker promoted to Slave?  Did you notice what the transfer throughput was on the HA link while that sync was taking place?

    Have you roled out the Enduser Portal so users can manage their own quarantines and whitelists?
     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • >>

    CTO, Convergent Information Security Solutions, LLC

    https://www.convergesecurity.com

    Advice given as posted on this forum does not construe a support relationship or other relationship with Convergent Information Security Solutions, LLC or its subsidiaries.  Use the advice given at your own risk.

  • Sorry Bruce, this was a configuration mistake at the betabot. I mistyped the number of the post where the betabot should put the Known-Issues List at. I hope i did'nt delete too much of your work?
  • That explains it... for a minute I thought someone had nabbed my credentials.... no worries.

    CTO, Convergent Information Security Solutions, LLC

    https://www.convergesecurity.com

    Advice given as posted on this forum does not construe a support relationship or other relationship with Convergent Information Security Solutions, LLC or its subsidiaries.  Use the advice given at your own risk.