Guest User!

You are not Sophos Staff.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

{dnscache} dnsd keeps restarting

Running XG 18.0 MR5-Build586 on a pair of SG230's in HA (Active-Passive). We use the XG as a local cache and DNS relay, since we rely on AD DNS hosted in our AWS Virtual Private Cloud. We have DNS request routing setup so that only internal domains are going through the tunnel to the internal AD DNS servers, the rest should be our ISP's DNS. The firewall, as the closest to all of our endpoints, should be

Most recently, we discovered that we were having problems with random CPU spikes in a variety of processes (notably awarrenhttp, httpd, dnscache, and even garner) that would actually cause major network problems and inability to access the. As of today, we shut off firewall-acceleration. While we've only been monitoring for a couple of hours, our SG230's running XG 18.x do seem a lot happier, so I can recommend that to others having trouble.

However, a remaining problem, and one that has been a constant thorn in our side since migrating to XG 18.0 from UTM 9.x, is the DNS problem. We seem to have isolated this as its own separate issue and may explain a variety of threads of people complaining about poor DNS performance on XG 18. Here are our symptoms:

Every couple of minutes, we'll notice DNS resolution tanks. This is easiest to replicate by simply opening up CMD in Windows and running nslookup, then continually looking up various domains. After a minute or two, you'll hit a request and get nothing but timeouts on this latest lookup. Try again (which ends up being about 10 seconds later), and the request goes through and is, for the most part, nice and fast.

During this period, watching the top command on the XG indicates that the dnscache's CPU is spiking and consuming all available resources on 1 of the CPUs.

Also, during these periods, the {dnscache} dnsd process appears to be restarting. I've never seen this process last longer than about 2 minutes or consume more than about 20 seconds of CPU time before restarting.

The big problem is that dnscache itself doesn't have its own log file. dnsgrabber.log appears to be unhelpful, and simply refers you to the FQDND process. fqdnd.log, if you watch it for a couple of minutes, can show errors that correspond with the timing of the process restart, but it's only a symptom (for example, we see a lot of:

dnsd.log does show that it is frequently restarting:

(Note the timestamps indicative of a restart every couple of minutes), and yes, this does correspond with the DNS resolution problems.

Finally, greping csc.log for "dnsd" gives us the following:

Which again, corresponds with the restarts.

Okay, so now that I know what is happening, I simply cannot find out WHY it is happening. We've had several tickets open with support on this and have gotten nowhere. These restarts are impacting network performance and causing loading problems for users. They also seem to be at the root cause of our disconnects to Sophos Central (since one of the dns queries it fails on is the Sophos Central Amazon server), as well as other issues.

Any takers on the issue?



This thread was automatically locked due to age.
Parents Reply
  • Ok thats not it. I assume something is overloading the DNS cache and causing the restarts all the time. Do you have any kind of DNS Object, which could have a "High frequently changed IP" in it? 

    What about the VPN tunnels. Do you use a IP or a FQDN as remote gateway? 

Children
  • The VPN tunnels are IP based (as is the standard with S2S AWS VPN). We were using the built-in Sophos/AWS VPC wizard under UTM, until we upgraded to XG. Since then, we've had to figure out how live with a configuration we don't fully understand (admittedly, BGP isn't setup so we're running off of one tunnel at a time at the moment).

    As for DNS, it's pretty standard AD DNS, with a few static entries entered into the firewall (things like printers). We've been doing some troubleshooting and it looks like there is some squirrliness between the firewall and our cloud-based AD DNS, but nothing we can easily put our fingers on. Essentially, we know that the firewall should be reaching out to our Cox ISP DNS servers for anything other than what we've put in for request routing, but instead, everything is going through our AD DNS. However, that doesn't explain why the firewall is choking every couple of minutes For your edification (since I'm guessing you'll ask), here are our DNS settings:

  • Bumping this to the top. We are planning on revamping our tunnels this weekend during a planned maintenance window, and addressing our BGP issues. I'll post our results then but until then, we're still seeing the dropouts.

  • So, I'm being cautiously optimistic, but after our weekend outage where we addressed our BGP tunnel problems, we seem to be seeing an improvement. dnsd was restarting this morning a few times, but applying the fix for our SNAT here:

    https://community.sophos.com/sophos-xg-firewall/f/discussions/129161/xg-authenticating-to-remote-aws-active-directory-servers-via-s2s-bgp-tunnels---nat-problem/474202#474202

    And we've had no restarts in over half an hour. Before that, this morning, we were seeing restarts of dnsd about once every 10-20 minutes. I'm going to keep an eye on it, but that is already better than once every 2-4 minutes. Fingers crossed.

  • Just a quick update, dnsd has been stable for almost 24 hours now. It would appear that the issue was resolved by fixing our AWS VPC tunnels and properly implementing BGP, along with the system SNAT rule.