This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Gateway 1 - Main ISP keeps getting disconnected, unable to find logs

Hi All,

I would like to seek your assistance regarding our issue.

We have DUAL WAN setup ISP#1 and ISP#2.

Since Thursday, ISP#1 keeps getting disconnected every several hours and it will stay disconnected for 15mins to 20mins.

The problem here is that even though we have a Failover, we still have several Webservers that are not setup for Failover.

So during the ISP#1 downtime, we lost connectivity to several webservers.

I have confirmed that during the 15-20mins downtime, the Internet is still flowing in the Mikrotik Router (ISP#1 router).

I am 100% sure that there is internet in the Mikrotik Router in the duration of the downtime in Sophos.

As I cannot find any meaningful errors in the Sophos logs, I replaced the Ethernet cables from ONT to Mikrotik Router, from Mikrotik to Sophos, and from Sophos to Switch.

I also upgraded to the latest SFOS 17.5.12 MR-12.HF052220.1 yesterday. 

However, this morning, the issue started happening again.

Is there a way to find any meaningful error besides the dgd.log?

I called Sophos Support and spent 2 days chasing Support and a total of 3 hours call but Escalation Engineer told me its an ISP Issue.

I am 100% confirmed that it is NOT an ISP issue as everything is working well in the Mikrotik Router every time ISP#1 went down in Sophos.

If you could just point me to the right direction or logs as to determine what could be causing the problem, I will very much appreciate it.

Thank you.

 


This thread was automatically locked due to age.
Parents
  • Hello Sophos User1499,

    Would it be possible for you to share the Case ID. 

    How are you checking that when the internet goes down on #ISP1 you still have internet connectivity? Have you tried to connect the WAN port that goes to the ISP#1 to a computer when the issue is happening and run a Ping test?

    I take you have changed the Failover rule to ping an IP address beyond the IP of the ISP router?

    Regards,

     

  • @emmosophos,

    I can confirm that the Internet is working on ISP#1.

    I have taken out Sophos Firewall and the internet is working directly from the Mikrotik Router which is the ISP#1.

    Also, from the 6 disconnects yesterday, I found out that there is a 20minute window that Sophos keeps showing that ISP#1 is down from GUI and CLI but when I checked the Public IP on all my devices, it is still getting ISP#1 IP Address. But most of the time, when ISP#1 is showing as down, it will failover to ISP#2, leaving my webservers on ISP#1 down.

    I have told all this to Sophos Escalation Engineer but he keeps insisting that it is an ISP issue which clearly it is NOT an ISP issue.

    I just want to know where to read more meaningful logs to understand why Sophos Firewall keeps disconnecting ISP#1.

    Case#: 9927707

  • Hi,

    that CPU usage is that about 5% free?

    What does TOP how as to what is hogging CPU?

    Ian

  •  

    I meant, CPU IDLE 95%, so only 5% is currently used over this weekend. 

    I generated a 48-hour report. So even with only 5% use on the CPU, I still experienced several disconnects.

    Without proper logs in Sophos Firewall, I am troubleshooting based on isolation only.

    I can confirm that when Sophos is showing ISP1 as disconnected, I can still ping the Sophos Public IP from outside network.

    I have now used a different PORT interface in the Sophos Firewall for ISP1 and recreated all Firewall Policies pointing to the new port interface for ISP1.

    However, I still experience the disconnect.

  • Hi,

    the issue will be with the ping and XG not seeing the ping response and as a result it thinks the link is down when in reality there is nothing wrong with the link.

    I haven't configured a fail over even though it is almost mandatory and you have to convince it not to activate on failure.

    Ian

  •  - you are right. the issue is now with Ping and XG, as the link is actually NOT down. I really appreciate you staying with me on this one.

  • It is going to be something stupid like a watch timer failing to handle two different requests for the service correctly and the supervisor timer cutting in after a delay.

    But that will be up to the Devs to identify and fix, by the way this is not the first time this issue has been raised.

    ian

  •  

    It is going to be another long night for me. I deleted the original Port interface of ISP1 and it is now on a completely different port interface.

    All Firewall rules related to ISP1 has been deleted and recreated. 

    I am now on a waiting game, the last disconnect is at 2:00pm today so I will wait until 2:00am.

    Hopefully, everything goes well and no more disconnect. Though if it happens again, I will completely remove the WAN Failover to see what will happen next.

    Thanks

  • OKay, so despite all the changes, I still got the ISP1 disconnect after 8 hours.

    I have now removed DUAL WAN Failover and will have to wait what happens next.

  •  

    So here is the kicker, after I removed DUAL WAN failover last night at 10:30pm, just 2 hours after that, I started receiving the Email Notice

    "User '-' failed to login from 'X.X.X.X' using ssh because of wrong credentials" (X.X.X.X - Public IP)

    The Email Notice is only 2-5minutes apart so I have gotten lots of emails since last night.

    The thing is, SSH over WAN and Ping over WAN is disabled or uncheck under Device Access.

    Removing the Dual WAN Failover setup has triggered attacks on our Firewall.

  • Hi,

    sorry, that doesn't make sense. I disabled those notifications because like you I have disabled those external features.

    More than likely you now have stable external DNS registration which is being used by the attackers. 

    I put a rule in place at the top to drop all connections to drop all external connections to my XG. How to create the rule can be found in this thread.

    https://community.sophos.com/products/xg-firewall/f/firewall-and-policies/118893/geoip

    It has stopped a lot of junk appearing in the log viewer.

    Ian

  •  - You are right, it does not make sense at all. So I did the trick you mentioned about saving the config without changing anything on the ISP1.

    I did the same on the Device access and rechecked and uncheck the SSH box on LAN, WAN, and VPN, so it has been 30 minutes and the attack did not happen yet.

    So far, it has been exactly 12 hours since I deleted DUAL WAN failover and ISP1 has not disconnect yet.

    Something is going on with Sophos that I cannot find out.

Reply
  •  - You are right, it does not make sense at all. So I did the trick you mentioned about saving the config without changing anything on the ISP1.

    I did the same on the Device access and rechecked and uncheck the SSH box on LAN, WAN, and VPN, so it has been 30 minutes and the attack did not happen yet.

    So far, it has been exactly 12 hours since I deleted DUAL WAN failover and ISP1 has not disconnect yet.

    Something is going on with Sophos that I cannot find out.

Children
  •  ,  

    So it has been 18 hours since I have deleted the ISP2 interface which disables the DUAL WAN FAilover.

    There has been no disconnect so far on ISP1. I am planning to add ISP2 interface back in the Sophos Firewall.

    My worry is that the ISP1 will be disconnected once I activate DUAL WAN failover.

    As it is clearly a problem on my Sophos device, is there anything else that I can look at in the XG to check why XG is failing over even though ISP1 line is active?

  •   

    It has been 23 hours since I disabled ISP2 and ISP1 went down again.

    I think something is really wrong with my Sophos but since Sophos Support is insisting it is an ISP issue without any evidence, I am unsure what else I can check on this.

    I can still ping Sophos ISP1 WAN IP even though Gateway is showing DOWN.

    I can also confirm that the Mikrotik router for ISP1 is still working

    TCPDump on ISP1 Port interface shows 0 packets dropped so the line is still really active