This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

XG continually crashing on any version past 17.5.0

We have a few dozen XGs deployed, ranging from 105s to 210s. Most of these have very similar configurations, nothing crazy, AD authentication, Sophos Central Heartbeat, SSL and IPSec VPN, IPS/IDS, etc.

At one specific site we have an XG125 that was deployed with v16 something, and upgraded over the years. We have found that the last stable release on this site is 17.5.0. Any version beyond this and the device crashes every few hours. When the crash happens, we can continue accessing the device remotely, ssh and web, but internally all communication stops. We are unable to learn new mac addresses if we clear the arp table. Rebooting the XG restores connectivity.

We've worked extensively with support, enabled detailed debugging, made configuration tweaks, moved the device to a new switch, swapped hardware, wiped and restored the config from backup, and as of last night, I wiped the device, upgraded to 18.0.3 MR3, and configured from scratch, with a bare minimum configuration in the hopes that there was corruption in the database that was getting carried forward with the backup/restore process. Unfortunately this was not the case, and after wrapping up around 2am last night, the device crashed somewhere around 4am. I've called support a few times this morning, and there is a "Critical" case open, but have yet to receive a response.

I'm really not sure where to go from here. There is clearly something peculiar to this site that is causing the crash, but I don't know what. The XG is connected directly to the ISP's CPE (same ISP/CPE has most of our other sites) and the LAN is connected to a Cisco SG350X switch (same model we use at many other sites). All patch cords have been replaced, firewall has been connected to different switchports, and different switches.

The only thing even remotely unique in this deployment (after simplifying the topology last night, and starting with a blank config) is that we have remote phones that connect via IPSec back to the XG. I don't see how this could be causing the trouble, but it is possible, and about the only thing that is different about this site compared to other deployments.

Interestingly, when the device is in it's "crashed" state, we are unable to resolve DNS from the command line. This is despite the DNS servers being external DNS servers, not located on the customer LAN, so this should not be an issue.



This thread was automatically locked due to age.
  • FormerMember
    0 FormerMember

    Hi ,

    Thank you for reaching out to the Community! 

    We apologize for any inconvenience you have experienced. If you had concerns regarding a specific support case, please don’t hesitate to reach out to me via PM and I'll be happy to help follow up.

    Thanks,

  • Hello Jeremy,

    We have followed up with Management regarding your cases. 

    It seems you have a remote session today and a new engineer was assigned to the case and has provided information on the next steps.

    Regards,

  • Thanks. The device is currently crashed, and I've been on hold for an hour now waiting for someone to pick up, but no success yet. While I hurry up and wait some more, here is the current state of affairs:

    I am able to ssh to the device from outside the network, or log in in to the admin page

    The device cannot resolve FQDNs from the command line, or via nslookup via the default server, or specifying a remote server.

    The device cannot ping any LAN or WAN addresses from the command line, but if I specify the port (for instance, ping -I Port2 8.8.8.8 or ping -I Port1 192.168.1.2) I am able to get a response.

    Existing, established connections remain open. I have a remote connection to one of the servers on the LAN, and that session is still open. I am unable to establish a new session to another server though.

    tcpdump shows LAN and WAN traffic destined for the XG, but it is seemingly ignoring it.

    This is exactly the same as the failure mode we experienced on any software newer than 17.5.0.

    What we've done so far:

    Changed public IPs

    Connected the XG directly to the ISP CPEs (no switch in between)

    Swapped hardware

    Wiped the device, re-imaged, and restored from backup.

    Wiped the device, re-imaged, and built a basic configuration manually, restoring/importing no pre-existing configuration, entering everything manually.

    Moved the XG to a different LAN switch

    Swapped all patch cords

    Swapped power supplies/power source/UPS

    An hour and five minutes on hold now, on a supposedly "Priority 1" case.

  • No, that appears to be a failure of the upgrade process. In this case, the upgrade goes fine, but the device later crashes and stops passing traffic. If the device is rebooted it will run fine for roughly two to twelve hours before crashing again. The only permanent fix we have found at this particular site is to roll back to 17.5.0, which is less than ideal.

  • One hour and thirty minutes on hold so far. You guys are really hitting it out of the park with this customer service.

  • At approximately one hour forty eight someone answered, but they weren't on the XG team. They put me back on hold, and at the two hour mark, the call dropped. Now what? Do I call back and hope that someone answers within another two hours?

  • If you connect a display and keyboard to this appliance, can you notice the crash? Is there anything? 

  • I have not done that with v18, but we did with v17, there was no output to the console.'

    Is there a list of services on the XG and how to restart them? If possible, I'd love to restart the system services one by one while the device is in the crashed state, and see which one wakes it back up again. Once identified, I could even script the restart of the service to reduce the downtime to 60 seconds when it crashes, rather than waiting for someone to notice and then manually reboot.

  • Ok, I have likely narrowed down the crashes to the dynamic IPSec VPN services (Sophos Connect Client/Cisco VPN Client). I have half a dozen Avaya 9600 IP phones that establish IPSec VPN connections back to the XG. This all worked fine in 17.5.0 or earlier, but after we upgraded to anything post 17.5.0, the XG is crashing as described above. As a process of elimination, I have disabled the "Sophos Connect Client" on the XG, and it has now been up for 36 hours - the longest it ran in the past was about hours.