We have a few dozen XGs deployed, ranging from 105s to 210s. Most of these have very similar configurations, nothing crazy, AD authentication, Sophos Central Heartbeat, SSL and IPSec VPN, IPS/IDS, etc.
At one specific site we have an XG125 that was deployed with v16 something, and upgraded over the years. We have found that the last stable release on this site is 17.5.0. Any version beyond this and the device crashes every few hours. When the crash happens, we can continue accessing the device remotely, ssh and web, but internally all communication stops. We are unable to learn new mac addresses if we clear the arp table. Rebooting the XG restores connectivity.
We've worked extensively with support, enabled detailed debugging, made configuration tweaks, moved the device to a new switch, swapped hardware, wiped and restored the config from backup, and as of last night, I wiped the device, upgraded to 18.0.3 MR3, and configured from scratch, with a bare minimum configuration in the hopes that there was corruption in the database that was getting carried forward with the backup/restore process. Unfortunately this was not the case, and after wrapping up around 2am last night, the device crashed somewhere around 4am. I've called support a few times this morning, and there is a "Critical" case open, but have yet to receive a response.
I'm really not sure where to go from here. There is clearly something peculiar to this site that is causing the crash, but I don't know what. The XG is connected directly to the ISP's CPE (same ISP/CPE has most of our other sites) and the LAN is connected to a Cisco SG350X switch (same model we use at many other sites). All patch cords have been replaced, firewall has been connected to different switchports, and different switches.
The only thing even remotely unique in this deployment (after simplifying the topology last night, and starting with a blank config) is that we have remote phones that connect via IPSec back to the XG. I don't see how this could be causing the trouble, but it is possible, and about the only thing that is different about this site compared to other deployments.
Interestingly, when the device is in it's "crashed" state, we are unable to resolve DNS from the command line. This is despite the DNS servers being external DNS servers, not located on the customer LAN, so this should not be an issue.
This thread was automatically locked due to age.