This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Strange drops

We have a customer with a phone switchboard application that periodically freezes, either at an application level (can't click anything), or it just won't show incoming calls. In both cases it can sometimes unfreeze, and then all the calls that have come in in the meantime suddenly flash on the screen. We've ruled out AV as the cause and are now looking into the problem being at the network layer.

drop-packet-capture shows this at the time of freezing:

2017-05-23 08:58:14 0101021 IP 10.10.90.2.8779 > 10.10.10.112.43470 : proto TCP: P 3007061919:3007062115(196) win 330 checksum : 55314
0x0000:  4500 00ec 18b4 4000 3f06 a9d2 0a0a 5a02  E.....@.?.....Z.
0x0010:  <remainder of the packet redacted>
Date=2017-05-23 Time=08:58:14 log_id=0101021 log_type=Firewall log_component=Firewall_Rule log_subtype=Denied log_status=N/A log_priority=Alert duration=N/A in_dev=Lag.90 out_dev=Lag.10 inzone_id=1 outzone_id=8 source_mac=00:1a:e8:8b:15:b4 dest_mac=00:e0:20:11:08:fc l3_protocol=IP source_ip=10.10.90.2 dest_ip=10.10.10.112 l4_protocol=TCP source_port=8779 dest_port=43470 fw_rule_id=0 policytype=1 live_userid=0 userid=0 user_gp=0 ips_id=0 sslvpn_id=0 web_filter_id=0 hotspot_id=0 hotspotuser_id=0 hb_src=0 hb_dst=0 dnat_done=0 proxy_flags=0 icap_id=0 app_filter_id=0 app_category_id=0 app_id=0 category_id=0 bandwidth_id=0 up_classid=0 dn_classid=0 source_nat_id=0 cluster_node=0 inmark=0x0 nfqueue=101 scanflags=0 gateway_offset=0 max_session_bytes=0 drop_fix=0 ctflags=33554472 connid=2341170016 masterid=0 status=398 state=3 sent_pkts=N/A recv_pkts=N/A sent_bytes=N/A recv_bytes=N/A tran_src_ip=N/A tran_src_port=N/A tran_dst_ip=N/A tran_dst_port=N/A

then the same again exactly 2 minutes later (even the checksum is the same)

The connection came good another minute later.

Any idea where to look next?

thanks

James



This thread was automatically locked due to age.
  • Yes using STAS (and SATC), but not enforced - user information is logged if available but not required by any rules. I had considered SATC as an issue as it integrates into the system a bit deeply, but this is happening on PC's as well as Citrix.

    I do notice from your packet that you seem to be using VLAN on LAG. Are you using LACP too?

    I might try disabling a port so the LAG only uses one port. It's not the same as not using LAG at all but is easy enough to try - disabling the LAG would require an uncomfortable amount of reconfiguration on XG.

    One thing I finally managed to capture is another case that i've seen a bit where users get "Page could not be displayed". A tcpdump on the PC shows SYN packets being sent but nothing is logged on the router - it's like the packets never reach it.

    If that's happening mid connection too then that might explain the mid-stream packet drops we've both seen.

  • I have been running a tcpdump on one of the servers on this network, filtering by SYN packets, and looking for SYN retransmissions.

    There are the expected retransmissions here and there which is expected, but there are definite periods where there are an excessive number of retransmissions, both to internal hosts that I know are up, and to external hosts. This matches my experience in the past where suddenly pages stop loading, even though already established connections seem to be okay (and my connection via Citrix appears unaffected).

    What could be causing this? I don't have any IPS or DoS turned on on these networks, and there is nothing in any logs at this time.

    The Sophos XG never logs any evidence of these packets, so I don't know if they are reaching the XG or not. My hope would be to run a tcpdump side by side on the server and the XG and compare results, but tcpdump is crippled on XG so this isn't possible.

    I have enough now to open a case though, so I guess i'll do that next.

    James

  • Hi,

    Could you share the case details so we may check on our end ? 

  • I have the exact same problem and I've already been told to look for asymmetric routing problem but the XG is the only way between my two networks as well. I've had 3 other sets of eyes on this including 2 CCIE's and they all point back at the XG. 

    "Strange Drops" is a good way to put it.

  • Matthew can you tell me a little about your setup? The other user who reported similar issues is using VLAN's on top of LAG interface. Are you doing this also? Are you using LACP? I have since tried disabling one of the LAG ports so it's now only on one port, but this isn't having any effect. If you are using LAG also then I might try removing the LAG setup to test if that is somehow affecting the problem.

    I have a case open with Sophos for this. I got great response from support initially but having tried all the easy things (check asymmetric routing, turn of micro app discovery, etc) I haven't heard anything in 2 days.

    It's a HA setup so one of the devices will be removed shortly and will have SG put back on it if I don't get anywhere with this. I'm really reluctant to do this as it means the problem will likely not get solved as I have no way to test it.

    James

  • It goes from a pair of Cisco Routers running BGP outside because I don't want to do 500Mbps BGP links on the XG, it'd probably choke then to a pair of Cisco Core Switches on a VLAN then back to the XGs and then from the XGs to my Core Switching then to my access switching on the Native VLAN. There's no lags or port channel group or anything except between the core and access switches. I have 2 XG 330s in HA.

    Again ISP(s) -> Router(s) -> Core on a VLAN -> XGs -> Core -> Access (This is a Channel from the Cores)

    What I keep seeing is that random clients lose connection to the Internet (LAN->WAN) for 30-45 seconds and during that time I'm dropping what appears to be valid traffic. I've tried the micro-app off, no IPS, no web, no app for a single test machine I generate some internet load on and it craps out randomly. It's not the same machine dropping every time and it's random machines. I get no drops of packets pinging between router/firewall/switch/etc at that time. Just no traffic from Client -> Internet for a random client at random times.

    We've already sat there and spanned some ports on my cores and sniffed out the traffic and it's not going asymmetrically. I am not incrementing errors on any interfaces, anywhere.

  • Hi guys, sorry for the delayed response. I haven't had much time in the last few days to look further into this. I have my case escalated to a tier 3 engineer who is going back and forth with the development team. The level 3 engineer is very intelligent about the XG and what we've come up with, is the XG seems to look for a user based firewall rule if a packet comes through with a, what I'll call a user header on it, even if its an inter-vlan rule and shouldn't. This seems to casue another connection to be tracked, thus making the firewall drop it because it not in the right order. It also seems to have bug for checking the user on a network even if that subnet is excluded in STAS. Right now it seems like multiple issues are causing this. I believe they have logged 2 bugs so far on my issue. To me, it seems like a very poor design of identifying users if they can't create a way to properly exclude networks. Further more, I wish it wasn't forced. I don't care what user is using inter-vlan routing. I created those rules by subnet but the tier 3 said if I chose to use STAS, I will have to created 2 rules for everything. One network based and 1 user based. Seems stupid to me.

    James, I took down the LAG and it didn't change anything. If you can, disable STAS completely and see if your issue goes away. That will surely verify if it is authentication related like mine. I know that is alot to ask.

    Matthew, I believe your issue may be helped by changing some of your settings from this link. https://community.sophos.com/kb/en-us/125468. What you are describing with the  is how the XG performs when it is in "learning mode" if you too are using STAS. The default unauth-traffic drop-period is 120 seconds.

    Mike

  • Oh yeah, I got STAS on. I'll try that.

  • When you say "using STAS", do you mean you have a rule that requires a user to have been identified? or just that "system auth cta enable" has been run?

    I have STAS and SATC enabled, but don't have any rules present that depend on a user having been identified. The user is just logged on a best effort basis.

    I have disabled "system auth cta", I will see what happens next

  • From what support is telling me, just having STAS enabled causes some packets to have a "user tag" on them. If the packet has the user tag but no user rule exist, the XG can drop it for whatever reason. I really hope this is a bug and not the way it was designed. That is a terrible approach IMHO. I too did not have any user based rules between networks. I was hoping the XG would just identify the users best effort like you but that does not seem to be the logic of the device. Support had me create user based rules to mimic the network rules and most drops have gone away. We are now trying to figure out why it is looking for user tags on traffic that should be excluded. I hope this helps you guys. It has be a very frustrating few weeks for me once again using the great "XG". Please let me know how it goes. I will update this thread as well when I hear back from support.