Posted 2019-12-13Updated 2020-09-25operation5 minutes read (About 805 words)

Practice datacenter failover in production

Distributed system is like human body, it will have issues and break. There’s a theory that we feed it with issues deliberately and constantly, the body will be more and more stable and robust. It’s the same to system, put some issues to datacenters and let them failover automatically.

Multiple data centers

Companies use data center redundancy to implement service’s high availability, there will be multiple data centers existing with 3 main deployments:

Disaster Recovery
You will have your live traffic served in the primary data center, meanwhile disaster recovery data center is a backup to recover when the primary is down. Usually, it doesn’t allow you to run normal operations in the disaster recovery one.
Hot Standby
The primary data center is taking traffic, the hot standby data center is almost equivalent to primary but doesn’t take traffic. You can switch to hot standby anytime the primary data center is down.
Live traffic handling
There are multiple data centers and they’re taking traffic simultaneously.

What is DC failover?

When dc ( data center ) failure happens, the most emergent thing is to use the backup dc to replace the primary one and ensures the business keeps running, so the technical team will failover the data center to backup one.

Why do we need to do failover often?

In the deployments mentioned before, there will be one or more data centers serving user, once the primary one is down the backup one needs to take over as soon as possible, but it’s not often that the primary data center is down, as the time flies, likely, backup datacenter cannot replace the primary or is hard to replace.
To keep the backup dc up-to-date, we should do dc failover often regardless manually or automatically.

Preparation

Define the impact

Service is always for customers, although failover is a long-term project that will improve the service’s availability, it’s better not to interrupt the customer’s experience all of a sudden, so according to different company’s considerations, they need to define the impact they can undertake.

Define the scope

After the company decides to practice data center failover in production, there will be a lot of questions to answer, “What’s the scale of this failover?”, “Is it global or partial?”, “What consequences can I bear or how much budget do we have?”, “who should participate in”…
Such questions can define the scope of the failover, how many business lines should participate in it, how many teams, how many people will join.

Define the goal

Also, define the goal clearly, it’s a long term project that ensures there will be always multiple available data centers online, this failover is the first time and will repeat very soon, finally, it’ll be continuous and automatic like chaos engineering.

Team as unit

Every team involved should take care of the servers, services… they need to failover when the failover day comes, the team is the minimal unit.

Deadline

“Deadline is the first productivity”, since the failover may not seem to be important to everyone, teams may not put them to the priority, so the deadline is a clear signal it will happen sooner or later.

From an SRE’s perspective

As an SRE, I play a vital role in this failover and will execute the operations.

You should have a full list of your services.
Get your operation documentation ready, for example, the operation commands and monitoring dashboard addresses.
Mini failovers can be done gradually before the entire one.

When the day comes

A clear agenda

A clear agenda is a precondition that everything is under control even if something unexpected happens. Take the unexpected into consideration and make the agenda more flexible.

Instant communication

For the team, the progress must be understood by every team member and their clients.
Also, it’s necessary to put everything unexpected into a global channel and everyone is aware of it.

Roles we play

The coordinator is the one who’s responsible for connecting all the team players, (s)he’s responsible for recording the matters happening including success and failure, other team players should report what they’ve done and seen to the coordinator.

After the failover

After the failover, teams involved in this failover should look back and check what’s the good part and what needs to be improved.
Against the weakness, we can make some plans and put them into the backlog, we’ll be more confident facing the next failover.
I will spend more time on chaos engineering which is the continuous accidents injection to production and will bring production more resilience.