Category Archives: Architecture Series

This section is about the Architecture of my current production environment.

Architecture Series: Disaster Recovery

Welcome back to my continuing series on Architecture. In this next installment, we will be going over a disaster recovery design.

Disaster Recovery (DR) is a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. There are 2 big pieces to this planning. What is the most downtime you can stand to incur, and how much times worth of data can you stand to lose in the recovery. This is known as Recovery Time Objective (RTO) and Recovery Point Objective (RPO). We’ll go over these more later.

Now, what does this mean when you are designing an environment? Everything. What good is your design to the customer when it is wiped off the face of the earth by an F5 tornado?


The answer is: It’s no good. So what do we do? We implement a Disaster Recovery solution into the design. For this current design, our DR plan gives us the availability to “failover” all of the protected critical workloads in the event of a catastrophe from one physical data center, to another physical data center in a completely different geographic location. This is a good.


So let’s start from the racks in our Primary Data Center (PDC). DR doesn’t essentially just mean continuity from our PDC to our DR site. We take steps to ensure that uptime requirements are met at the PDC by dual-homing all of our devices. This is for power, network, and storage. All of our infrastructure is setup with an A-side and a B-side. This allows for a point of failure at just about any where in the physical hardware design, and the opposite side can withstand the outage without downtime.  This also makes maintenance on any of these services easy, as we can simply use the opposite side while one side is being worked on.

We also utilize some vCenter-level recovery options which help us to withstand points of failure. For example, we have vSphere HA enabled on our clusters. In a nutshell, if an ESX host suddenly fails, vCenter can automatically reboot all of the VM’s on other hosts in the cluster. While there is a bit of downtime for the reboot, it is an automated process to bring VM’s back online as quickly as possible in the event of hardware failure.


Duncan Epping has written the gold standard in books on HA that you can read up on here.

Now let’s move on to the big stuff. Large-scale natural or human disaster. What do you do when your PDC is destroyed or completely loses power for X period of time?



You’re starting to ask yourself now, “Ok, I need to plan for emergencies, but how do I do it??” This is where our DR solution comes into place. For this design, we will be using two main products. vSphere Replication (vR) and VMware Site Recovery Manager (SRM). These are two different products, that when run in tandem, give you a solid means to recover in the event of a disaster.

vR enables the continuous replication of a virtual machine from one side to another. The decision to use vR instead of Array-Based Replication was made so that the choice of what to replicate could be made on granular VM basis, as opposed to an entire datastore / LUN. vR is where we can specify our RPO. You can specify how often you want to replicate a VM, after the initial full-seed. Our RPO for this design is 15 minutes, so we set the replication time in vR to 15 minutes.


The next setup in our design is the actual failover component. SRM. In SRM there is two major pieces that you need to do in order to be ready to go. Protection Groups and Recovery Plans.

Protection Groups are simply logical  groupings of VM’s that you are trying to protect. In a 3-tier application stack, you’d want to protect the web servers, app servers, and database servers. As the DR site is not 1:1 hardware, the design decision was made to only protect 1 of the DB clusters, 1 set of App servers, and 2 Web servers. The bare necessities to run. If we had chosen Array-Based Replication, then we wouldn’t need to specify what VM’s. It would simply replicate and protect all VM’s on the chosen volumes.


The second piece is the Recovery Plan. This is where you configure SRM’s logic. Where is the primary site? What VM’s am I failing over? Where am I failing it over to? Should I start them in a particular order? Now, the second metric we need to meet is RTO. How long does it take you to recover? As long as vR and SRM is setup right, failing over is a fairly quick process. One of the biggest constraints here is how long it takes your recovery VM’s to power on, validate, and move on. Meeting your RTO is not just a software goal. This will require monitoring / engineering response + SRM Recovery Plan to meet the total goal of 60 minutes Recovery Time Objective.


The recovery plan is configured exactly as the failover is needed to go. There is a step-by-step logic here. From finalizing replication (if PDC is still available) to bringing down the original VM’s to bringing up the recovery VM’s. Here, is where VM prerequisites (priority) are set. Our apps are 3-tier designs. Our DB servers start first. App servers are second. Then the Web servers comes up once all other prerequisites are met.


SRM allows you to run “test” failover scenarios that will validate all the replication, recovery VM’s, etc. It is a great way to validate your Disaster Recovery plan, without actually failing over. Though, doing live failover tests to DR is very important to test all the external variables such as monitoring and engineering response. I have an article about a particular test scenario with SRM and some duct tape here.


Thanks for reading!



Architecture Series: Storage

In my continuing efforts to grow in design, I am writing my next installment of the Architecture Series. This next bit is going to focus on storage.

Let’s start from the server and move through the physical fabric. As with the networking for this design which uses 2 x dual-port NIC’s, storage uses 2 x 8g dual-port FC HBA’s. In our environment, we go with Emulex LPe1200’s. They are placed in the host as HBA, NIC on top and HBA, NIC on bottom:


The dual-ports on the HBA’s are split so that each HBA has a cable to the A-side and a cable to the B-side.


As shown in the diagram below, the fabric provides redundant paths to each side of the storage fabric. Each side of the storage fabric has a link to one of the controllers in the  storage frame. This gives each server 2 links to each fabric side, which each has 2 links into the storage controllers. This provides us failure tolerance of a storage controller, or an entire side of the FC fabric.


The storage frames all offer multiple tiers of storage for the customer. Tier 2a, Tier 2, Tier 3, and Tier 4. We have not had any use-case that requires an all-flash array at this point, so it is not currently available in our environment.

We cut LUN’s in 2TB size from the storage frame, and present them to the hosts. This is much smaller than the 64TB maximum allowed. We name the LUN’s based on the frame brand, frame ID, cluster, tier, and LUN #.


These are then grouped into their datastore clusters. The datastore clusters are broken down by frame brand, cluster, tier, and then a cluster ID. We do allow mixed frame ID’s in the clusters. We limit our clusters to 32 LUN’s even though the maximum supported is 64. This is simply to make things easier to manage from our perspective.



This has been another installment of the Architecture Series. Thanks for playing along!


Architecture Series: 10 Gig Pod

In efforts to transition myself to the Infrastructure side of the house, I decided to hit the white board a bit and explain the architecture of the current environment I am in. This is part of my Design Theory study (VCAP-Design) and is as much for reader benefit, as it is for my own learning benefit. I hope this brings forth questions and discussions. As a preemptive note: I am not the Principal Architect of this specific design. I merely inherited this design, and am learning it while taking over. This is not a post that I am going to whittle down to be perfect as if I was submitting for VCDX. I will try my best to keep it clear, concise, and in a proper order from the top down.

So let’s get this party started. The environment that I support now has several vCenter servers. These are spread across several geographic locations. We do have one “Primary” location, that has 2 different buildings. The “main” building houses our primary vCenter. This vCenter houses a couple legacy 1-Gig clusters, and our primary 10-Gig environment.

Our 10-Gig environment is currently split into 2 pods. These pods were built to be scalable, as needed. As a vEUC guy, I equate this design to the Horizon View “pod and block” type architecture. Scalable Pods that can be built out as needed. It’s a popular concept these days. Maybe not in this exact design, but scalability is important.

Our Pods are built in sets of 3 racks. Unlike our 1-Gig environment where we run all cabling to the distribution switches, our 10-Gig pods utilize 2 x Force10 Z9000’s in a Top-of-Rack or TOR setup for each 3-rack pod. Each TOR switch and Server in the pod are dual-homed with A/B power to separate PDU’s. The building has multiple street-power providers, and is rated to withstand an F5 tornado. Here is a visual representation of the pod:


The switches are setup in an A/B setup cross-connected to each other. The switches reside in the center rack in each pod, as it services the cabinet it resides in, as well as the neighbors to the left and right:


The switch ports are all configured as Trunk. We handle all of our tagging at the vSwitch. Each ESX host (R710’s or R720’s) house 2 x 8GB HBA’s for storage, and 2 x Dual-Port 10-Gig NIC’s for networking. We use 1:4 fan-out 40-Gig cables for network connectivity, like the ones here. Each 40-Gig cable has 4 ends (A,B,C,D). Each 40-Gig cable services the “A” or “B” switch side of 2 hosts, with 2 connections each.


This leaves each host with an A + B (or C & D) from the “A Switch” and an A + B (or C & D) from the “B Switch”.  These are split out on the host to 2 x Virtual Distributed Switches across the environment:



vCenter Networking

Two of the links (1 x A-side, 1 x B-side) go to the VM-Network Virtual Distributed Switch. The other 2 links go to the vKernel-Network Virtual Distributed Switch. The vKernel Switch for each host has 1 x Management and 1 x vMotion virtual adapter configured. The VM-Network Switch contains the tagged port-groups for all of the VLAN’s needed for our virtual machine traffic.


This concludes the first of what I hope to be many Design Theory / Architectural posts. Thanks for playing along!