Architecture Series: Disaster Recovery

Welcome back to my continuing series on Architecture. In this next installment, we will be going over a disaster recovery design.

Disaster Recovery (DR) is a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. There are 2 big pieces to this planning. What is the most downtime you can stand to incur, and how much times worth of data can you stand to lose in the recovery. This is known as Recovery Time Objective (RTO) and Recovery Point Objective (RPO). We’ll go over these more later.

Now, what does this mean when you are designing an environment? Everything. What good is your design to the customer when it is wiped off the face of the earth by an F5 tornado?


The answer is: It’s no good. So what do we do? We implement a Disaster Recovery solution into the design. For this current design, our DR plan gives us the availability to “failover” all of the protected critical workloads in the event of a catastrophe from one physical data center, to another physical data center in a completely different geographic location. This is a good.


So let’s start from the racks in our Primary Data Center (PDC). DR doesn’t essentially just mean continuity from our PDC to our DR site. We take steps to ensure that uptime requirements are met at the PDC by dual-homing all of our devices. This is for power, network, and storage. All of our infrastructure is setup with an A-side and a B-side. This allows for a point of failure at just about any where in the physical hardware design, and the opposite side can withstand the outage without downtime.  This also makes maintenance on any of these services easy, as we can simply use the opposite side while one side is being worked on.

We also utilize some vCenter-level recovery options which help us to withstand points of failure. For example, we have vSphere HA enabled on our clusters. In a nutshell, if an ESX host suddenly fails, vCenter can automatically reboot all of the VM’s on other hosts in the cluster. While there is a bit of downtime for the reboot, it is an automated process to bring VM’s back online as quickly as possible in the event of hardware failure.


Duncan Epping has written the gold standard in books on HA that you can read up on here.

Now let’s move on to the big stuff. Large-scale natural or human disaster. What do you do when your PDC is destroyed or completely loses power for X period of time?



You’re starting to ask yourself now, “Ok, I need to plan for emergencies, but how do I do it??” This is where our DR solution comes into place. For this design, we will be using two main products. vSphere Replication (vR) and VMware Site Recovery Manager (SRM). These are two different products, that when run in tandem, give you a solid means to recover in the event of a disaster.

vR enables the continuous replication of a virtual machine from one side to another. The decision to use vR instead of Array-Based Replication was made so that the choice of what to replicate could be made on granular VM basis, as opposed to an entire datastore / LUN. vR is where we can specify our RPO. You can specify how often you want to replicate a VM, after the initial full-seed. Our RPO for this design is 15 minutes, so we set the replication time in vR to 15 minutes.


The next setup in our design is the actual failover component. SRM. In SRM there is two major pieces that you need to do in order to be ready to go. Protection Groups and Recovery Plans.

Protection Groups are simply logical  groupings of VM’s that you are trying to protect. In a 3-tier application stack, you’d want to protect the web servers, app servers, and database servers. As the DR site is not 1:1 hardware, the design decision was made to only protect 1 of the DB clusters, 1 set of App servers, and 2 Web servers. The bare necessities to run. If we had chosen Array-Based Replication, then we wouldn’t need to specify what VM’s. It would simply replicate and protect all VM’s on the chosen volumes.


The second piece is the Recovery Plan. This is where you configure SRM’s logic. Where is the primary site? What VM’s am I failing over? Where am I failing it over to? Should I start them in a particular order? Now, the second metric we need to meet is RTO. How long does it take you to recover? As long as vR and SRM is setup right, failing over is a fairly quick process. One of the biggest constraints here is how long it takes your recovery VM’s to power on, validate, and move on. Meeting your RTO is not just a software goal. This will require monitoring / engineering response + SRM Recovery Plan to meet the total goal of 60 minutes Recovery Time Objective.


The recovery plan is configured exactly as the failover is needed to go. There is a step-by-step logic here. From finalizing replication (if PDC is still available) to bringing down the original VM’s to bringing up the recovery VM’s. Here, is where VM prerequisites (priority) are set. Our apps are 3-tier designs. Our DB servers start first. App servers are second. Then the Web servers comes up once all other prerequisites are met.


SRM allows you to run “test” failover scenarios that will validate all the replication, recovery VM’s, etc. It is a great way to validate your Disaster Recovery plan, without actually failing over. Though, doing live failover tests to DR is very important to test all the external variables such as monitoring and engineering response. I have an article about a particular test scenario with SRM and some duct tape here.


Thanks for reading!