This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

17.0.8 MR8 is out... Hundreds of "VPN Down" logs in just an evening ... Imagine how many I received since 17.0.8 was installed.

For those who asked, our VPN setup is the result of the intervention of 2 seniors engineers at Sophos, Boston.  I normally need no help to setup a IKEv2 VPN. I was doing it with no sweat on any firewall more than a decade ago.



This thread was automatically locked due to age.
Parents
  • We are starting to see this same issue on a few of our sites.  Did Sophos come up with any solution?  It is only going down for a few minutes here and there but really shouldn't be going down at all.  I've verified the gateway is staying up and I can also ping the WAN interface on the XG.  When it is down, I am not able to get into the web interface of the unit which is interesting.  

  • Hello

    I opened a case since this VPN falls every 4 minutes for around one minute.

    I received these answer for Sophos' senior engineer responsible for VPNs:

    Answer One:

    Hi Paul

    I took a look at the log files and it seems there’s massive packet loss (at least for the IKE packets). For the time and tunnel that we looked at the mails for it seems the connection was faulty for some reason.

    The log on QC looks like:

    2018-05-14 11:37:04 06[ENC] <QC_to_AA-1|323> generating INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:08 30[IKE] <QC_to_AA-1|323> retransmit 1 of request with message ID 0

    2018-05-14 11:37:15 22[IKE] <QC_to_AA-1|323> retransmit 2 of request with message ID 0

    2018-05-14 11:37:28 12[IKE] <QC_to_AA-1|323> retransmit 3 of request with message ID 0

    2018-05-14 11:37:52 12[IKE] <QC_to_AA-1|323> retransmit 4 of request with message ID 0

    2018-05-14 11:38:22 26[ENC] <QC_to_AA-1|323> parsed INFORMATIONAL request 16 [ ]

    2018-05-14 11:38:22 26[ENC] <QC_to_AA-1|323> generating INFORMATIONAL response 16 [ ]

    2018-05-14 11:38:34 20[IKE] <QC_to_AA-1|323> retransmit 5 of request with message ID 0

    2018-05-14 11:38:34 06[ENC] <QC_to_AA-1|323> parsed INFORMATIONAL response 0 [ ]

    The log on AA looks like:

    2018-05-14 11:37:04 09[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:04 09[ENC] <AA_to_QC-1|325> generating INFORMATIONAL response 0 [ ]

    2018-05-14 11:37:04 11[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:37:08 22[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:08 22[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    2018-05-14 11:37:08 25[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:37:15 10[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:15 10[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    2018-05-14 11:37:15 20[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:37:28 19[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:28 19[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    2018-05-14 11:37:28 27[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:37:52 12[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:37:52 12[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    2018-05-14 11:37:52 05[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (ADC08921) from other side

    2018-05-14 11:38:22 05[IKE] <AA_to_QC-1|325> sending DPD request

    2018-05-14 11:38:22 05[ENC] <AA_to_QC-1|325> generating INFORMATIONAL request 16 [ ]

    2018-05-14 11:38:22 28[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL response 16 [ ]

    2018-05-14 11:38:34 24[ENC] <AA_to_QC-1|325> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 11:38:34 24[IKE] <AA_to_QC-1|325> received retransmit of request with ID 0, retransmitting response

    Note the child_alert messages on the AA side. I think that’s where the issues come from. CB seems to never receive the responses before AA sends its own DPD message (here 16). After 16 was processed all of a sudden the response to message 0 comes through as well. I suppose in the case where a message like 16 doesn’t come in time before the 5th retransmit, QB would have detected the peer as dead and shut down the connection just to re-establish it right away again.

    I will discuss this with the team tomorrow. Currently I’m not sure who’s to blame here. I’ll enable a higher debug level for the daemon, maybe that will reveal more info.

    I’ll keep you updated on the proceedings in this case.

    Answer Two:

    Hi Paul

    I think something in your network is not working right. In the tcpdump from AA I see that the response to the DPD packet destined to QB is actually received by AA instead of QB (that’s also where the invalid SPIs come from):

    2018-05-14 14:13:05 03[NET] received packet: from 207.134.161.10[500] to 205.237.70.202[500]

    2018-05-14 14:13:05 03[NET] waiting for data on sockets

    2018-05-14 14:13:05 20[NET] <AA_to_QC-1|334> received packet: from 207.134.161.10[500] to 205.237.70.202[500] (96 bytes)

    2018-05-14 14:13:05 20[ENC] <AA_to_QC-1|334> parsed INFORMATIONAL request 0 [ ]

    2018-05-14 14:13:05 20[IKE] <AA_to_QC-1|334> received retransmit of request with ID 0, retransmitting response

    2018-05-14 14:13:05 20[NET] <AA_to_QC-1|334> sending packet: from 205.237.70.202[500] to 207.134.161.10[500] (96 bytes)

    2018-05-14 14:13:05 04[NET] sending packet: from 205.237.70.202[500] to 207.134.161.10[500]

    2018-05-14 14:13:05 03[NET] received packet: from 205.237.70.202[500] to 205.237.70.202[500]

    2018-05-14 14:13:05 03[NET] waiting for data on sockets

    2018-05-14 14:13:05 24[DMN] [GARNER-LOGGING] (child_alert) ALERT: received IKE message with invalid SPI (375D9B4E) from other side

    2018-05-14 14:13:18 31[APP] [COP-UPDOWN][STATUS] (db_status_update) conn_name: AA_to_QC count: 15

    14:13:28.740389 Port2, IN: IP 207.134.161.10.500 > 205.237.70.202.500: isakmp: parent_sa inf2

            0x0000:  4500 007c 598a 4000 3711 649e cf86 a10a

            0x0010:  cded 46ca 01f4 01f4 0068 8943 4e9b 5d37

            0x0020:  8c16 11a1 f2c7 99d4 ebf9 cea6 2e20 2500

            0x0030:  0000 0000 0000 0060 0000 0044 1792 d7e8

            0x0040:  416a 0e04 98e3 d72a deff 06c3 b4a1 10ff

            0x0050:  a53e 6010 c634 65cc e48a c74b 66dd cf88

            0x0060:  ac55 3cff e761 5673 fa62 2e67 8a4b 2632

            0x0070:  dfc0 3b59 3de4 f345 ce8c 7ef2

    14:13:28.740614 Port2, OUT: IP 205.237.70.202.500 > 207.134.161.10.500: isakmp: parent_sa inf2

            0x0000:  4500 007c 8945 4000 4011 2be3 cded 46ca

            0x0010:  cf86 a10a 01f4 01f4 0068 d616 4e9b 5d37

            0x0020:  8c16 11a1 f2c7 99d4 ebf9 cea6 2e20 2528

            0x0030:  0000 0000 0000 0060 0000 0044 78b4 8d4b

            0x0040:  a0d0 dbdf 3450 66c8 398f 8c12 118d 4875

            0x0050:  cc8c d769 a5f1 1b1e 4a7b 13f9 3373 5570

            0x0060:  e30a 8e4d 6161 f525 19cb 1ce9 00af 717f

            0x0070:  7fb6 e228 9b93 8c77 6afb cbae

    14:13:28.740880 Port2, IN: IP 205.237.70.202.500 > 205.237.70.202.500: isakmp: parent_sa inf2

            0x0000:  4500 007c 8945 4000 3f11 88bc cded 46ca

            0x0010:  cded 46ca 01f4 01f4 0068 31f0 4e9b 5d37

            0x0020:  8c16 11a1 f2c7 99d4 ebf9 cea6 2e20 2528

            0x0030:  0000 0000 0000 0060 0000 0044 78b4 8d4b

            0x0040:  a0d0 dbdf 3450 66c8 398f 8c12 118d 4875

            0x0050:  cc8c d769 a5f1 1b1e 4a7b 13f9 3373 5570

            0x0060:  e30a 8e4d 6161 f525 19cb 1ce9 00af 717f

            0x0070:  7fb6 e228 9b93 8c77 6afb cbae

    I believe once this issue is fixed your IPsec connections will be stable. One thing that could help you get to the root of this is the fact that the DPD response is actually delivered to QB correctly after AA itself sent a DPD request (16 from the mail below). I’m not an expert in routing or NAT, but would assume that some misplaced rule could be causing this behavior. Please contact support for further assistance.

    Here, I think the engineer meant QC instead of QB.

    Conclusion massive packet lost probably because of a misplaced rule or routing ... So ... Nat or Networking is screwed up ...  I do not have a single route installed, so. Only possible route are those done automatically by the firewall itself.

    A had VPN down very often before, but not to that stratospheric level.

    Go figure.  Paul Jr

  • I don't buy a routing issue because I don't have any setup either.  I also tested connecting to the same site from three other sites and none of them worked.  I can ping the external interface of the XG as well as the gateway without issue.  However I cannot get into the XG using the admin interface.  I believe something is freezing up the XG.  I'll be at that site later this week and will attempt to confirm that.

  • Brian, do you also see log lines in /log/charon.log where the source and destination address are the same, like:

    received packet: from 205.237.70.202[500] to 205.237.70.202[500]

Reply Children
  • Hello Heiko

    I contacted support.  They are asking me to run with an IKEv2 setp with DH 31 and 18 removed and with SHA1 activated.

    I am no encryption expert, but weakness in SHA1 have been demonstrated more than a decade ago, and Microsoft themselves have been refusing SHA1 certificates for almost two years now.  Same with anyone else in the industry. CERT-US have urged anyone to stop using it for years.

    Even my dog knows SHA1 is dead !!!

    What is going here ?  Why are they asking something that outdated and unsecured ?

    Paul Jr

  • Ok ...  Sophos support asked me to test this IKEv2 policy.

    Exact same behaviour.  VPN falls every 4 minutes ...  Besides, this policy is absolutely not recommended.  SHA1 is dead and buried.  Key life is now 15 hours (54000 secondes), DH groups has 31, and 18 disabled.  Which in the end means quite an insecure and outdated policy.

    But the thing is: reducing encryption load do not change the VPN falling every 4 minutes.  The theory that XG and its own routing is guilty is even more probable.

    I am now living with VPNs that falls every so often for more than a year now.

    Paul Jr 

  • For those who ask, and as a follow up, nope !!!  this case still unsolved.  Apparently, we are not alone afflicted by this.

  • From horrible to even worse.  Sophos support made some changes to STAS and shutdown Dead Peer Detection.  So, we are effectively now back to IKEv1 like 15 years ago.  The VPN falls down and do not come back automatically anymore.

  • I was asked to re-activate Dead Peer Detection at both ends.  Seems to bring back the VPN automatically.

    Paul Jr