Hi
After years of issues managing IPsec tunnels to Cisco ASAs I have learned some stuff that I have not found anywhere and thought I would share in case others have similar issues. Our scenario is that we run a data warehouse which is doing daily ETL extracts from our remote customer’s servers (of which we have 60+ XGs running in Azure). The issue is that we get almost daily complaints that the ETL jobs fail and when we log into the XG WebAdmin, the VPN status is yellow, with many of the SAs down/red. For the longest time we have just reset the tunnel to bring it back up completely. We conduct calls with the remote network engineers to review the configs and find everything matches, but we still get these SAs that drop out.
Several weeks ago I setup my ASA and an XG in a lab and ran some wireshark captures between them, as well as enabled full debug on the ASA. My findings were interesting, and it helped me to implement what is now going on 3+ weeks of rock solid VPN and much happier customers.
Root Cause
Cisco ASA becomes the IKE initiator (100% of the time) on the first IKE rekey and the XG relies on the ASA to bring up the SA, but Cisco only does this when there is traffic. XG does not follow the RFC which states that either side can initiate the child_sa once Phase-1 is up, so many of the SAs never come back up and we are unable to reach the remote server.
Why does this happen? Most of our tunnels to the ASAs have a Phase-1 lifetime of 28800 or 86400 seconds. Cisco will rekey IKE at exactly 95% of the configured lifetime. If we do some quick math on a 28800s lifetime, that is 27360 seconds. On XG, the margin (excluding the random fuzz) will let you define a maximum of 999, which if our random/fuzz ended up being 0, we would never rekey sooner than 27801 seconds. Cisco WINS and becomes the new initiator every time.
The fix
According to the RFC, lifetimes are a local setting and need to match the remote peer. This means that the following fix can be applied just on the XG or in conjunction. If we set the lifetime to 14400 seconds, and the margin to 999 (we do a fuzz of 10%, but you could do more), we will rekey first. Cisco would rekey at 95% of 14400 which is 13680. XG would be 13401. XG WINS.
You might say, why not just lower the lifetime of the XG and set the Cisco to something higher. Well, in my packet captures I observed that during IKEv1 phase1 initial contact, the lifetime is actually shared as part of the transform set and Cisco will use the lower setting of the two. Meaning that if the ASA policy is set to 86400 and we set the XG to be something much smaller, 5400 for example, the Cisco will use the value that XG has.
Another important fix for ASAs is to have the idle SA timeout set to unlimited. Without this, the ASA will tear down an SA that has not received traffic after 30 minutes. This causes an issue because the XG likes to keep the SAs up and will try and replace this deleted SA. For some unknown reason the ASA will not establish this new SA for 1-2 minutes, resulting in NO_PROPOSAL_CHOSEN. This is set on the GroupPolicy.
group-policy GroupPolicy_1.1.1.1 attributes
vpn-idle-timeout none
Closing
Sophos Support has said to us numerous times, “set the ASA to be answer-only connection type”. While that helps on the initial VPN establishment, the Cisco will rekey the connection first if we do not configure the lifetimes as described above. It would also be nice if Sophos would follow the IPsec RFC and initiate the child_sa, using what Strongswan calls “traps”. That is like moving a mountain. Trying to get them to see this issue has been impossible, so I thought I would share, hoping this might help others.
TL; DR;
Set your Phase1 lifetime to 14400 seconds, and the margin to 999 (fuzz% 10). Phase 2 must match or be less than this. We use 3600 seconds.
This thread was automatically locked due to age.