Guest User!

You are not Sophos Staff.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Sophos XG all endpoints showing missing heartbeat

Good afternoon,

I have an XG 230 18.0.5 in an HR pair.

Yesterday it decided that all endpoints would show as missing heartbeats. I de-registered and re-registered in Central and now they are all showing connected. Why would that have happened?

Maybe related... yesterday the https and SSH access went offline as well. The only way we were able to regain access was to power cycle both units in the HA pair. It happened again this AM, so I am working on that. No errors in the log, no errors with HA.

Thoughts?

Thanks, Brent



This thread was automatically locked due to age.
Parents
  • Happened again this AM. Firewalls were down. Turned them both off and turned on 1 so no HA at the moment.

    ctsyncd.log shows

    [Wed Jul 28 07:06:56 2021] (pid=2528) [notice] using user-space event filtering
    [Wed Jul 28 07:06:56 2021] (pid=2528) [notice] netlink event socket buffer size has been set to 4194304 bytes
    [Wed Jul 28 07:06:56 2021] (pid=2528) [notice] initialization completed
    [Wed Jul 28 07:06:56 2021] (pid=2530) [notice] binded on cpu 0
    [Wed Jul 28 07:06:56 2021] (pid=2530) [notice] -- starting in daemon mode --
    [Wed Jul 28 07:06:56 2021] (pid=2530) [ERROR] no dedicated links available!
    [Wed Jul 28 07:06:57 2021] (pid=2530) [ERROR] no dedicated links available!
    [Wed Jul 28 07:06:57 2021] (pid=2530) [ERROR] no dedicated links available!
    [Wed Jul 28 07:06:59 2021] (pid=2530) [ERROR] no dedicated links available!
    [Wed Jul 28 07:06:59 2021] (pid=2530) [ERROR] no dedicated links available!
    [Wed Jul 28 07:07:00 2021] (pid=2530) [ERROR] no dedicated links available!

    msync.log - it just stopped @ 1829 yesterday until the reboot this AM

    Tue Jul 27 18:29:13 2021:222634:1372:MAST:MAST:DEBUG:event.c:492 ses_cnt :3
    Tue Jul 27 18:29:13 2021:222661:1372:MAST:MAST:DEBUG:sync_entity.c:951sesid:33486: cmd ipset -D hostset fqdn,580,0,52.22
    Tue Jul 27 18:29:13 2021:225053:1372:MAST:MAST:DEBUG:sync.c:921sesid:33486:ipset -D hostset fqdn,580,0,52.22
    Tue Jul 27 18:29:13 2021:225079:1372:MAST:MAST:DEBUG:sync.c:903sesid:33486 ipset -D hostset fqdn,580,0,52.22 3
    Tue Jul 27 18:29:13 2021:996260:1372:MAST:MAST:DEBUG:event.c:492 ses_cnt :3
    Tue Jul 27 18:29:13 2021:996370:1372:MAST:MAST:DEBUG:sync_entity.c:938sesid:33487: opcode HBAddEacEpRel
    Tue Jul 27 18:29:14 2021:132740:1372:MAST:MAST:DEBUG:sync.c:921sesid:33487:HBAddEacEpRel
    Tue Jul 27 18:29:14 2021:132779:1372:MAST:MAST:DEBUG:sync.c:903sesid:33487 HBAddEacEpRel 3
    Tue Jul 27 18:29:15 2021:357960:1372:MAST:MAST:DEBUG:event.c:492 ses_cnt :3
    Tue Jul 27 18:29:15 2021:357995:1372:MAST:MAST:DEBUG:sync_entity.c:951sesid:33488: cmd ipset -D hostset fqdn,752,0,142.2
    Tue Jul 27 18:29:15 2021:361405:1372:MAST:MAST:DEBUG:sync.c:921sesid:33488:ipset -D hostset fqdn,752,0,142.2
    Tue Jul 27 18:29:15 2021:361439:1372:MAST:MAST:DEBUG:sync.c:903sesid:33488 ipset -D hostset fqdn,752,0,142.2 3
    Tue Jul 27 18:29:17 2021:251772:1372:MAST:MAST:DEBUG:event.c:492 ses_cnt :3
    Tue Jul 27 18:29:17 2021:251838:1372:MAST:MAST:DEBUG:sync_entity.c:938sesid:33489: opcode HBAddEacEpRel
    Tue Jul 27 18:29:17 2021:371736:1372:MAST:MAST:DEBUG:sync.c:921sesid:33489:HBAddEacEpRel
    Tue Jul 27 18:29:17 2021:371774:1372:MAST:MAST:DEBUG:sync.c:903sesid:33489 HBAddEacEpRel 3
    Tue Jul 27 18:29:20 2021:500211:1372:MAST:MAST:DEBUG:event.c:492 ses_cnt :3
    Tue Jul 27 18:29:20 2021:500277:1372:MAST:MAST:DEBUG:sync_entity.c:938sesid:33490: opcode HBAddEacEpRel
    Tue Jul 27 18:29:20 2021:621209:1372:MAST:MAST:DEBUG:sync.c:921sesid:33490:HBAddEacEpRel
    Tue Jul 27 18:29:20 2021:621244:1372:MAST:MAST:DEBUG:sync.c:903sesid:33490 HBAddEacEpRel 3
    T:DEBUG:sync.c:903sesid:33497 /scripts/ha/managetimeropcodes.sh 3

  • did you or someone change something on a RED device before this began? Check the Admin Log.

    and: please open a support case - rebooting your complete HA to get it working again is a no-go for a firewall setup...

  • There were NO changes to the environment. They don't use REDs at all. It just decided to start failing in the middle iof the night... every night!  I agree nightly reboots should not be required!

    I am leaving HA off for now to see if it stabilizes. 

    I Have a ticket open with support, but I usually get better answers here!  :)

  • If this happens without changes, I think only support can figure it out.

    Is this the time when it died? Jul 28 07:06:56

    And are they really down so no traffic flowing or just stop responding to https and ssh management? probably they'll ask you to put machines to the console to get outpout in case they crash. You could already prepare some notebooks with serial cable

    btw. I fully agree to your last sentence.

  • 7:06 is when it was restarted and started logging again.

  • Hello Brent,

    I would recommend you to open a case with Support to get this investigated, you can share the Case ID with me.

    This morning when the issue happened, were you not able to access the GUI or SSH into the XG either?

    Do you see anything under /var/cores?

    If this is happening every night, can you try disabling firewall acceleration 

    console > system firewall-acceleration disable

    Additionally to this, as LHerzog suggested, once you get the Serial Connectiong going (Console Logging) you need to do the following:

    Using PuTTY, go to 'Session' - 'Logging.'

    Here, select "All session output', and set the file name to a folder and name for later retrieval.

    Configure the Serial connection to use the proper COM port on your PC and a Speed of 38400.

    Start the session, and log in to ensure it is all proper.

    Once logged in, you can leave it there or log out and leave the session at the password prompt. Either way, leave the session active and allow it to capture the output from the next reboot.

    Once that reboot occurs, you can end the Serial connection and provide the logs to support further investigation.

    Either if this "solves" the issue or not open a case with support.

    Regards,

  • I have opened a ticket 04264810.

    I got some updated info from the client this AM. They rebooted before I was able to log in and check things out, but this morning's outage was a little different that the last few days. Sounds like VOIP was working, but HTTPS was failing. I have seen something like that in the past when A/V was failing.

  • Hello Brent,

    Thank you for the Case ID and the additional notes. 

    If you haven't I'd put csc in debug mode

    # csc custom debug

    To disable you can run the same command, or after rebooting the device it’ll disable automatically.

    Regards,

Reply Children
No Data