I thought I would post an update to my thread.
We found a bug in the Cisco software running the GetVPN key-server routers - Cisco bug ID CSCuq18492. They would lose connection to each other (despite being able to ping back and forth with no problems) every so often. This caused an outage when both of the key-server routers went primary just before a re-key. Some of the GetVPN group member routers were associated to the real primary and others associated to the new primary. When the re-key happened both routers sent out different keys so when the group members received their new keys and began to use them traffic flows stopped but since EIGRP was excluded from encryption all of the new messed up links continued to appear to the routers as valid paths so traffic kept flowing down. How did I find this out? By a major WAN outage causing impact to the largest sites on my network.
After the code issue was resolved by upgrading to a flavor of 15.4 I then implemented some testing with a new design. Specifically, putting EIGRP into the encryption so if there were to be a problem with encryption the EIGRP traffic would also stop flowing correctly, any affected EIGRP adjacency would drop, and the path would no longer appear as valid for traffic.
This also brought up another issue... if there is no EIGRP adjacency then how will a remote site router know how to talk to the GetVPN key-server router? Floating static routes fixed this. I'm using a static route with an AD of 220 so when EIGRP comes up the lower AD value will edge out the static route and traffic will flow through normal paths established per EIGRP.
This has been working now for a couple weeks back into production after tons of testing with application owners and allowing smaller sites to burn in for a while to make sure. In fact, our provider had a cut fiber a week or so ago which dropped one of our GetVPN hub sites entirely. The remote site that was up at the time in testing/validation simply dropped that single EIGRP adjacency while continuing to stay connected to the other two hub sites with zero dropped traffic and no reports of issues.
I figured I'd post this here since I couldn't find any other information on this anywhere else on the Internet in my searching. I'm not saying it doesn't exist... I just couldn't find it.
We found a bug in the Cisco software running the GetVPN key-server routers - Cisco bug ID CSCuq18492. They would lose connection to each other (despite being able to ping back and forth with no problems) every so often. This caused an outage when both of the key-server routers went primary just before a re-key. Some of the GetVPN group member routers were associated to the real primary and others associated to the new primary. When the re-key happened both routers sent out different keys so when the group members received their new keys and began to use them traffic flows stopped but since EIGRP was excluded from encryption all of the new messed up links continued to appear to the routers as valid paths so traffic kept flowing down. How did I find this out? By a major WAN outage causing impact to the largest sites on my network.
After the code issue was resolved by upgrading to a flavor of 15.4 I then implemented some testing with a new design. Specifically, putting EIGRP into the encryption so if there were to be a problem with encryption the EIGRP traffic would also stop flowing correctly, any affected EIGRP adjacency would drop, and the path would no longer appear as valid for traffic.
This also brought up another issue... if there is no EIGRP adjacency then how will a remote site router know how to talk to the GetVPN key-server router? Floating static routes fixed this. I'm using a static route with an AD of 220 so when EIGRP comes up the lower AD value will edge out the static route and traffic will flow through normal paths established per EIGRP.
This has been working now for a couple weeks back into production after tons of testing with application owners and allowing smaller sites to burn in for a while to make sure. In fact, our provider had a cut fiber a week or so ago which dropped one of our GetVPN hub sites entirely. The remote site that was up at the time in testing/validation simply dropped that single EIGRP adjacency while continuing to stay connected to the other two hub sites with zero dropped traffic and no reports of issues.
I figured I'd post this here since I couldn't find any other information on this anywhere else on the Internet in my searching. I'm not saying it doesn't exist... I just couldn't find it.