Failover between data centers

Started by Shura182, January 20, 2021, 08:36:40 AM

Previous topic - Next topic

Shura182

Hello guys,


I'm working on a project to plan full redundancy for our client. There are going to be two data centers (primary and standby).

Each DC will have MPLS connectivity to the client's locations, Internet connectivity, connectivity to different vendors (each vendor will install and manage their own router at each DC). There is going to be L2 link between two DCs.

How can I advertise the network/networks from primary data center and in case of some router is down start advertising it from second data center?



Many thanks! 

Otanx

How many networks are you advertising from each? The way we do it is probably the easiest, but does not work for everyone. We have a /23 (x.y.0.0/23) and advertise that out both data centers. We then advertise the x.y.0.0/24 from DC A, and x.y.1.0/24 from DC B. This way under normal conditions the /24 route is primary for each. However, if either of the data centers go offline the /23 will take over for the missing /24 and send the data to the other DC.

IF you don't have a /23 then it becomes much more difficult. If you have a link between the two DCs you may just want to advertise the IP space from both, and deal with routing between them internally. This can lead to asymetric routing but as long as you handle that outside the firewalls you should be fine. Otherwise you have to do weird tricks to get your traffic routed correctly with no guarantee it will work the way you want. Typically prepending AS to your path is the first thing to look at if you go down that route.

-Otanx

deanwebb

Are you having the DCs act as both being active or will the secondary be syncing up with the primary?

Will this also have to work with multiple outbound links from the sites?

Would you want to do this with a load balancing or link balancing solution inline?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Shura182

Thank you Otanx! Probably there will be 3-4 networks (/28) for vendors and about 7-8 application networks (/24)

/23 is a good option. But I see you have different /24 subnets in each DC x.y.0.0/24 vs x.y.1.0/24
In case vendor's device (or circuit) is down, they will be sending the traffic to our second network, right?

Can I advertise network from primary DC, and just in case this network is down, the same network will be advertised from second DC?


Shura182

Quote from: deanwebb on January 20, 2021, 10:08:17 AM
Are you having the DCs act as both being active or will the secondary be syncing up with the primary?

Will this also have to work with multiple outbound links from the sites?

Would you want to do this with a load balancing or link balancing solution inline?


Second one will be standby, yes, will be syncing up with the primary.

Yes, each DC will have:
MPLS - telco company will create cascading failover between two MPLS circuits in primary DC and another one in second DC.
Internet - each location will have independent Internet connectivity.
Vendors external connectivity - each vendor will install at least one router at each location.

I thought to use LB for internal application, and some routing protocol to share the same subnets between two data centers (for connectivity with the vendors).

What will be the best approach to have the redundancy in place?



icecream-guy

I would suggest a discussion on what constitutes a disaster, what are the goals when declaring said disaster,   once you have a disaster plan,  you can put forth more logical remediations
as if a partial data center DR failover or entire datacenter DR failover.  you can't just go at this willy nilly and come up with valid DR scenarios. it will require a plan, a major plan.

:professorcat:

My Moral Fibers have been cut.

wintermute000

#6
yes you can do that, but the major issues are

1.) default gateway placement and behaviour
2.) routing symmetry, esp. if there are stateful devices (e.g. firewalls) in path
3.) failure domain isolation (or not)
4.) DCI split brain behaviour
5.) L2 stretch or not and how exactly (repeat above points)

this topic usually needs several solid whiteboard sessions to flesh out with a client, good luck

deanwebb

Put another way, there are some simple things that you can do now to get basic connectivity established. Otanx has given info on that end.

Myself, Ristau, and Wintermute know that those simple things can then lead to further questions. If this was just a homework question for a networking course, Otanx's answers would be all you need. But since this is a production environment, you'll need to go beyond those basic things very soon after implementing them.

If it's DC1 primary and DC2 syncs with it, then at a basic level, I'd have all traffic go to DC1 as a preferred route and DC2 ONLY when the route to DC1 is down.

That being said, there are two ways for the link to DC1 to go down. The first is when DC1 itself goes down - then all traffic goes to DC2, this is the disaster you were waiting for in the basic scenario.

The second way is for the WAN route to DC1 to be blocked. This can impact a single site. So now the rest of the enterprise is talking to DC1, but the single site is talking to DC2. Now updates have to flow from DC2 to DC1, and that's where the split brain issue that Wintermute mentioned comes into play.

If this is for a small enterprise, some discussion here can probably flesh out a solution. If this is for a large enterprise, we can help you get discussions started, but we sure won't be able to solve the whole set of issues here.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

I'm sure the OP is talking about internal private network addressing, being the same addressing at both DCs with a L2 link between the two DCs. We done this and I suggest the OP lab this up because if that L2 link goes then you have problems. If you're stretching subnets between DC1 and DC2 over that L2 link then you're unable to influence routing into the DC's separately because all subnets exist in both DCs and effectively DC1 and DC2 are just one DC. So, one L2 link really needs to be two L2 links from different providers.

I didnt think anyone bought mpls these days, seems like hassle for no gain.

Shura182

Thanks guys for your input!
Probably I will need to go with DNS solution, like f5 GSLB. In this case both sites can be active, and the traffic will be sent based on node/service availability.

Thanks again for you help.