This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SMTPD dead, heartbeat dead, firewalls locking up xg330

I read a bunch of threads on this and currently been waiting over an hour on hold to talk to someone.

The problem is that all of a sudden today my SMTPD service on the firewall has failed with a status of DEAD. 

This happened a week ago as well and someone just restarted the firewall in that case. However of course when i tried that, the second firewall locked up and i had to drive down to the site :( and power cycle it to get it back online.) They are in the HA configuration.

I then tried to restart again, and the first firewall locked up. So its not a hardware issue as both locked up.

so right now, to work around, i turned smtps scanning OFF and all the mail is flowing again. But my firewall is still in this broken state.

I will update when sophos support tells me how to resolve this. 

i have tried using the console and restarting the services as that was mentioned in other posts and that has not helped so far. A lovely thing to happen at 3pm on a friday.... clearly your call center is severely understaffed. 



This thread was automatically locked due to age.
Parents
  • Hello Givemecontrol,

    Thank you for contacting the Sophos Community!

    So right now what is the status of the HA? Can you SSH to the unit?

    If you can SSH into it, from the Advanced Shell (5 > 3) paste the output of 

    # service -S | grep DEAD

    # csc custom status

    Check if there is any coredump in the XG related to SMTPd or Heartbeat

    # ls -lh /var/cores 

    Have you tried restarting the smtpd service from the backend?

    # service smtpd:restart -ds nosync

    # service heartbeat:restart -ds nosync

    If it gives you an error, please check on the applog and the smptd 

    For the issue you are describing it sounds like some files are not syncing, to which you might need to SSH to the AUX device to see if they are there, run the command in both units to see if the files match. 

     ls -lah /conf/sysfiles/heartbeatd/

    Regards,

  • well you are way faster than support thats for sure. on hold for 1h44m atm...

    XG330_WP01_SFOS 18.0.1 MR-1-Build396# service -S | grep DEAD
    awarrensmtp          DEAD
    heartbeat            DEAD
    XG330_WP01_SFOS 18.0.1 MR-1-Build396# csc custom status
    
    =====
     Fri Oct  2 17:00:13 2020
     Listerner is in  UNFREEZED  STATE
     Freeze INIT wait val :  10
     Freeze wait val :  120
     Opcode queue len :  50
     Service queue len:  50
     Freeze INIT timeout : Not effective    Freeze timeout : Not effective
     HANDLE_STATE_CHANGE OPCODE RUNNING(2) ?:- 0
     HA is enabled. HA status:  HA_PRIM
    
    Free Workers:
     2355
     2344
     2339
     2280
     2236
     2225
     491
     481
     21552
     21547
     21544
     21542
     21541
     21539
     21538
    
    Busy workers:
     C_FLAG: 1-DB_CONTROL, 2-NO_WAIT
     R_FLAG: 1-DEBUG, 2-NO_SYNC
     PID     OPCODE NAME                       C_FLAG     R_FLAG     EXEC_TIME
     25893   restart_tomcat                    0x 0       0x 0       Fri Oct  2 17:00:00 2020
    
    Busy service:
    
    
    XG330_WP01_SFOS 18.0.1 MR-1-Build396# ls -lh /var/cores
    -rw-------    1 root     0         114.2K Sep 23 06:21 632898dc-9c5c-477e-a806519d-dde8b2d0.dmp
    -rw-------    1 root     0         104.7M May 27  2019 core.dnscache
    -rw-------    1 root     0          81.8M Jul  4  2019 core.garner
    -rw-------    1 root     0           2.5M Mar  5  2020 core.smtpd
    XG330_WP01_SFOS 18.0.1 MR-1-Build396# service smtpd:restart -ds nosync
    200 OK
    XG330_WP01_SFOS 18.0.1 MR-1-Build396# service heartbeat:restart -ds nosync
    503 Service Failed
    
    
    server1
    XG330_WP01_SFOS 18.0.1 MR-1-Build396# ls -lah /conf/sysfiles/heartbeatd/
    drwxrwx---    3 root     heartbea    1.0K Oct  2 15:33 .
    drwxr-xr-x    5 root     0           1.0K Dec 13  2018 ..
    drwxr-xr-x    2 root     0           1.0K Jul 15 07:22 ca-certificates
    -rw-r-----    1 root     heartbea  404.0K Jun  5 10:23 certificate_store.db
    -rw-r--r--    1 heartbea heartbea   79.0K Jun  5 10:16 endpoint_store.db
    -rw-r--r--    1 root     0              2 Oct  2 15:33 sophos-central.json
    XG330_WP01_SFOS 18.0.1 MR-1-Build396#
    
    
    server2
    XG330_WP01_SFOS 18.0.1 MR-1-Build396# ls -lah /conf/sysfiles/heartbeatd/
    drwxrwx---    3 root     heartbea    1.0K Oct  2 14:00 .
    drwxr-xr-x    5 root     0           1.0K May 14  2018 ..
    drwxr-xr-x    2 root     0           1.0K Jul 15 07:26 ca-certificates
    -rw-r-----    1 root     heartbea  404.0K Oct  5  2018 certificate_store.db
    -rw-r--r--    1 heartbea heartbea   79.0K Oct  5  2018 endpoint_store.db
    -r--r-----    1 root     0           3.9K Feb  5  2018 ep_cert.crt
    -r--r-----    1 root     0           1.6K May 19  2017 server.crt
    -r--------    1 root     0           3.2K May 19  2017 server.key
    -rw-r--r--    1 root     0              2 Oct  2 14:00 sophos-central.json
    XG330_WP01_SFOS 18.0.1 MR-1-Build396#

    there are the results. yes some files dont match. not sure what to do with that info though....

  • does it have to do with certificates? i just renewed one last month and updated it on the firewall but i dont think it was in use. was a CA cert *.domain.ca type. But like i said i dont think it was active on anything. I did switch the SMTP cert to it about an hour ago, from the previously configured "Default" setting (whatever cert that was)...

    i should say that was well after the issue started. didnt do anything with certificates except one month ago and one hour ago. last wed failure and today failure appear random as it worked in between those two times.

  • Hello Givemecontrol,

    Thank you for the follow-up!

    Can you SSH in to the AUX device and run the following commands this to sync the heartbeat 
    # syncfile /conf/sysfiles/heartbeatd/server.key
    # syncfile /conf/sysfiles/heartbeatd/server.crt
    # syncfile /conf/sysfiles/heartbeatd/ep_cert.crt


    You should see a Sync File success 

    After that run 

    service heartbeat:restart -ds nosync

    that most likely will fix the heartbeat issue.

    As per the SMTPd I can see a coredump, so that point me at some point there was an issue with SMTPd, does the Aux device has a more recent core dumps?

  • Hello Givemecontrol,

    Shouldn't be an issue with certificates, I think is more a sync issue. 

    Regards,

  • you got it that did and my heartbeat is not dead anymore. so thats something.

    on the aux firewall it does say when i try and run that last command:

    XG330_WP01_SFOS 18.0.1 MR-1-Build396# service heartbeat:restart -ds nosync
    250 Unregistered
    

    but on the primary firewall it works fine.

    and now i get

    XG330_WP01_SFOS 18.0.1 MR-1-Build396# service -S | grep DEAD
    awarrensmtp          DEAD
    

    and after almost 2 hours sophos disconnected me from their hold! hahaha *cries*

    here we start enqueue again from zero. You guys should do that thing where you call people BACK, put in your phone number. because waiting two hours on hold and then being disconnected IS NOT COOL!

    OH and your other question, here is the core dump from the aux machine:

    XG330_WP01_SFOS 18.0.1 MR-1-Build396# ls -lh /var/cores
    -rw-------    1 root     0          34.6M Mar 29  2020 core.ctasd.bin
    -rw-------    1 root     0         114.5K Oct  2 11:32 d897f913-b94b-4de8-1bd99687-fe2cd9b6.dmp
    

    and the primary:

    XG330_WP01_SFOS 18.0.1 MR-1-Build396# ls -lh /var/cores
    -rw-------    1 root     0         114.2K Sep 23 06:21 632898dc-9c5c-477e-a806519d-dde8b2d0.dmp
    -rw-------    1 root     0         104.7M May 27  2019 core.dnscache
    -rw-------    1 root     0          81.8M Jul  4  2019 core.garner
    -rw-------    1 root     0           2.5M Mar  5  2020 core.smtpd
    XG330_WP01_SFOS 18.0.1 MR-1-Build396#
    

    and the aux IS interesting because 11:30am today is the last time i was successfully retrieving mail.

    and the 23rd was when we had the first failure! so those core dumps are certainly related. how do i pull them off to attach to my support case?

    using winSCP i cant SFTP the files for some reason ( access denied)

  • Hello Givemecontrol!

    Yes is the Aux device it would show as Unregistered that is normal.

    I would recommend you open a case with support and submit all of this information on the ticket.

    As this coredump needs to be investigated, I don't think you can submit the coredumps, but I would recommend you to submit

    applog.log, smtpd_main.log, csc.log, garner.log, msync.log 

    To pull out the logs please follow this KB

    Also once you have the Case ID please share it, so I can follow-up there.

    Sorry to hear the call got disconnected, and thank you for the comments on how to improve support I would pass them along to management.

    Regards,

  • k will do thanks. 03197282

    probably ill log for the night since the system is mostly up and its not an emergency anymore. will try and call back tomorrow. 

  • PS: Sounds like you have a certificate issue. 

    Likely that the SMTP can die, if the certificate is somehow corrupt. 

    If SMTP dies, do you see at this timeframe anything in smtpd_main.log or smtpd_*.log 

  • nothing in the logs really. all seemed to start fine. so i rebooted the firewalls and they came up fine now.

    was on hold again with sophos for the last hour with no one picking up again. this is at 8am pst. sad.

    but the issue appears resolved. must have been caused by those missing files. i cant really think of any other changes i made on friday.

    i hate problems like this. can missing files cause all these problems? perhaps... why didnt the firewalls resync the files automatically? sigh.... 

    EDIT: decided to keep the case open and have an engineer talk to me now. I want them to view the core dump files since this happened twice and get to the real bottom of it. i mean i pay so much for support.

    will write back if they come up with anything further.

Reply
  • nothing in the logs really. all seemed to start fine. so i rebooted the firewalls and they came up fine now.

    was on hold again with sophos for the last hour with no one picking up again. this is at 8am pst. sad.

    but the issue appears resolved. must have been caused by those missing files. i cant really think of any other changes i made on friday.

    i hate problems like this. can missing files cause all these problems? perhaps... why didnt the firewalls resync the files automatically? sigh.... 

    EDIT: decided to keep the case open and have an engineer talk to me now. I want them to view the core dump files since this happened twice and get to the real bottom of it. i mean i pay so much for support.

    will write back if they come up with anything further.

Children
No Data