Category Archives: Design Study

Achievement Unlocked: VCAP5-DCD

My day has come! I am finally part of the VCAP club. Today, I sat, and passed, the VCAP5-DCD exam. I have to say, it was the most brutal 3 hours of my life. So many ups and downs. I had no idea how things were going to go when I clicked “End Exam”. It took me a few seconds of staring when I saw the score for it to sink in. I sat in the chair staring at the screen for a couple minutes. I even forgot to finalize the exam ending and walked out to the proctor to get the score sheet. When it didn’t come out, we realized that I didn’t click to close out. Then there it was. My score sheet. Pass.

The score was, in all honesty, better than I expected. I got a 337 out of 500. While you only need 300 to pass, making 337 not that impressive, I truly only had hoped for a 301. I am thrilled with what I got.

Let’s go for a little exam prep / experience. My exam prep was not what I recommend for someone wanting to sit down and crank out the DCD. I’ve been working towards this for a couple years now. I’ve been working on vSphere / VDI Design for a couple years. So a lot of it was practical knowledge. I also went over the VCAP5-DCD Simulator until I felt I was only answering from memory. Then I didn’t go back for months, then I did it again. It’s great. 100% use the blueprint and go through some of the PDF’s on consulting (risk, constraint, assumption, requirement type stuff). There is a Google Plus group you can join with a ton of great info / people / stories.

Also, I did the I really never sat back and said “Alright, lets do this!” and put my nose to the grindstone. I really only went for it because I had been kicking it along for a while. Decided today was the day. I don’t recommend this.

If you’ve never done a Design exam, or even a VCAP, I highly recommend you mentally prepare. Sleep well, have a good breakfast, take the exam early (unless you’re an afternoon kind of person.) I did not spend my evening before studying, as that bums me out, and I figured I wouldn’t get any better in 1 last night. The exam is 22 questions. Most are drag-and-drop style questions. There are 9 Visio-style design tool questions, 1 of which is a “Master Design.” My master design item came at like halfway through the exam. I immediately marked it and moved on. I highly recommend saving this for the end. They say to allow 30 minutes for it, I ended up starting this with 1 hr flat left. I truly can’t remember how long it took, I know I finished early.

As far as tips, I really have 1 big one. READ THE QUESTION. Did you read that? Did you get it? Read it again. Then read it again. Who knows, the question could give you part of the answer. You’d never know if you didn’t read it.

So that’s it. Not a lot of info, but enough. I just needed to tell anyone else that would listen that I passed. Best of luck for everyone else in your endeavors. Feel free to contact me if you have any questions about it. I’ll be glad to tell you what I can.



Architecture Series: Disaster Recovery

Welcome back to my continuing series on Architecture. In this next installment, we will be going over a disaster recovery design.

Disaster Recovery (DR) is a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. There are 2 big pieces to this planning. What is the most downtime you can stand to incur, and how much times worth of data can you stand to lose in the recovery. This is known as Recovery Time Objective (RTO) and Recovery Point Objective (RPO). We’ll go over these more later.

Now, what does this mean when you are designing an environment? Everything. What good is your design to the customer when it is wiped off the face of the earth by an F5 tornado?


The answer is: It’s no good. So what do we do? We implement a Disaster Recovery solution into the design. For this current design, our DR plan gives us the availability to “failover” all of the protected critical workloads in the event of a catastrophe from one physical data center, to another physical data center in a completely different geographic location. This is a good.


So let’s start from the racks in our Primary Data Center (PDC). DR doesn’t essentially just mean continuity from our PDC to our DR site. We take steps to ensure that uptime requirements are met at the PDC by dual-homing all of our devices. This is for power, network, and storage. All of our infrastructure is setup with an A-side and a B-side. This allows for a point of failure at just about any where in the physical hardware design, and the opposite side can withstand the outage without downtime.  This also makes maintenance on any of these services easy, as we can simply use the opposite side while one side is being worked on.

We also utilize some vCenter-level recovery options which help us to withstand points of failure. For example, we have vSphere HA enabled on our clusters. In a nutshell, if an ESX host suddenly fails, vCenter can automatically reboot all of the VM’s on other hosts in the cluster. While there is a bit of downtime for the reboot, it is an automated process to bring VM’s back online as quickly as possible in the event of hardware failure.


Duncan Epping has written the gold standard in books on HA that you can read up on here.

Now let’s move on to the big stuff. Large-scale natural or human disaster. What do you do when your PDC is destroyed or completely loses power for X period of time?



You’re starting to ask yourself now, “Ok, I need to plan for emergencies, but how do I do it??” This is where our DR solution comes into place. For this design, we will be using two main products. vSphere Replication (vR) and VMware Site Recovery Manager (SRM). These are two different products, that when run in tandem, give you a solid means to recover in the event of a disaster.

vR enables the continuous replication of a virtual machine from one side to another. The decision to use vR instead of Array-Based Replication was made so that the choice of what to replicate could be made on granular VM basis, as opposed to an entire datastore / LUN. vR is where we can specify our RPO. You can specify how often you want to replicate a VM, after the initial full-seed. Our RPO for this design is 15 minutes, so we set the replication time in vR to 15 minutes.


The next setup in our design is the actual failover component. SRM. In SRM there is two major pieces that you need to do in order to be ready to go. Protection Groups and Recovery Plans.

Protection Groups are simply logical  groupings of VM’s that you are trying to protect. In a 3-tier application stack, you’d want to protect the web servers, app servers, and database servers. As the DR site is not 1:1 hardware, the design decision was made to only protect 1 of the DB clusters, 1 set of App servers, and 2 Web servers. The bare necessities to run. If we had chosen Array-Based Replication, then we wouldn’t need to specify what VM’s. It would simply replicate and protect all VM’s on the chosen volumes.


The second piece is the Recovery Plan. This is where you configure SRM’s logic. Where is the primary site? What VM’s am I failing over? Where am I failing it over to? Should I start them in a particular order? Now, the second metric we need to meet is RTO. How long does it take you to recover? As long as vR and SRM is setup right, failing over is a fairly quick process. One of the biggest constraints here is how long it takes your recovery VM’s to power on, validate, and move on. Meeting your RTO is not just a software goal. This will require monitoring / engineering response + SRM Recovery Plan to meet the total goal of 60 minutes Recovery Time Objective.


The recovery plan is configured exactly as the failover is needed to go. There is a step-by-step logic here. From finalizing replication (if PDC is still available) to bringing down the original VM’s to bringing up the recovery VM’s. Here, is where VM prerequisites (priority) are set. Our apps are 3-tier designs. Our DB servers start first. App servers are second. Then the Web servers comes up once all other prerequisites are met.


SRM allows you to run “test” failover scenarios that will validate all the replication, recovery VM’s, etc. It is a great way to validate your Disaster Recovery plan, without actually failing over. Though, doing live failover tests to DR is very important to test all the external variables such as monitoring and engineering response. I have an article about a particular test scenario with SRM and some duct tape here.


Thanks for reading!



Architecture Series: Storage

In my continuing efforts to grow in design, I am writing my next installment of the Architecture Series. This next bit is going to focus on storage.

Let’s start from the server and move through the physical fabric. As with the networking for this design which uses 2 x dual-port NIC’s, storage uses 2 x 8g dual-port FC HBA’s. In our environment, we go with Emulex LPe1200’s. They are placed in the host as HBA, NIC on top and HBA, NIC on bottom:


The dual-ports on the HBA’s are split so that each HBA has a cable to the A-side and a cable to the B-side.


As shown in the diagram below, the fabric provides redundant paths to each side of the storage fabric. Each side of the storage fabric has a link to one of the controllers in the  storage frame. This gives each server 2 links to each fabric side, which each has 2 links into the storage controllers. This provides us failure tolerance of a storage controller, or an entire side of the FC fabric.


The storage frames all offer multiple tiers of storage for the customer. Tier 2a, Tier 2, Tier 3, and Tier 4. We have not had any use-case that requires an all-flash array at this point, so it is not currently available in our environment.

We cut LUN’s in 2TB size from the storage frame, and present them to the hosts. This is much smaller than the 64TB maximum allowed. We name the LUN’s based on the frame brand, frame ID, cluster, tier, and LUN #.


These are then grouped into their datastore clusters. The datastore clusters are broken down by frame brand, cluster, tier, and then a cluster ID. We do allow mixed frame ID’s in the clusters. We limit our clusters to 32 LUN’s even though the maximum supported is 64. This is simply to make things easier to manage from our perspective.



This has been another installment of the Architecture Series. Thanks for playing along!


Architecture Series: 10 Gig Pod

In efforts to transition myself to the Infrastructure side of the house, I decided to hit the white board a bit and explain the architecture of the current environment I am in. This is part of my Design Theory study (VCAP-Design) and is as much for reader benefit, as it is for my own learning benefit. I hope this brings forth questions and discussions. As a preemptive note: I am not the Principal Architect of this specific design. I merely inherited this design, and am learning it while taking over. This is not a post that I am going to whittle down to be perfect as if I was submitting for VCDX. I will try my best to keep it clear, concise, and in a proper order from the top down.

So let’s get this party started. The environment that I support now has several vCenter servers. These are spread across several geographic locations. We do have one “Primary” location, that has 2 different buildings. The “main” building houses our primary vCenter. This vCenter houses a couple legacy 1-Gig clusters, and our primary 10-Gig environment.

Our 10-Gig environment is currently split into 2 pods. These pods were built to be scalable, as needed. As a vEUC guy, I equate this design to the Horizon View “pod and block” type architecture. Scalable Pods that can be built out as needed. It’s a popular concept these days. Maybe not in this exact design, but scalability is important.

Our Pods are built in sets of 3 racks. Unlike our 1-Gig environment where we run all cabling to the distribution switches, our 10-Gig pods utilize 2 x Force10 Z9000’s in a Top-of-Rack or TOR setup for each 3-rack pod. Each TOR switch and Server in the pod are dual-homed with A/B power to separate PDU’s. The building has multiple street-power providers, and is rated to withstand an F5 tornado. Here is a visual representation of the pod:


The switches are setup in an A/B setup cross-connected to each other. The switches reside in the center rack in each pod, as it services the cabinet it resides in, as well as the neighbors to the left and right:


The switch ports are all configured as Trunk. We handle all of our tagging at the vSwitch. Each ESX host (R710’s or R720’s) house 2 x 8GB HBA’s for storage, and 2 x Dual-Port 10-Gig NIC’s for networking. We use 1:4 fan-out 40-Gig cables for network connectivity, like the ones here. Each 40-Gig cable has 4 ends (A,B,C,D). Each 40-Gig cable services the “A” or “B” switch side of 2 hosts, with 2 connections each.


This leaves each host with an A + B (or C & D) from the “A Switch” and an A + B (or C & D) from the “B Switch”.  These are split out on the host to 2 x Virtual Distributed Switches across the environment:



vCenter Networking

Two of the links (1 x A-side, 1 x B-side) go to the VM-Network Virtual Distributed Switch. The other 2 links go to the vKernel-Network Virtual Distributed Switch. The vKernel Switch for each host has 1 x Management and 1 x vMotion virtual adapter configured. The VM-Network Switch contains the tagged port-groups for all of the VLAN’s needed for our virtual machine traffic.


This concludes the first of what I hope to be many Design Theory / Architectural posts. Thanks for playing along!


Is it REALLY Engineering?

This topic has been floating around my head the past week. The strange part about it is that my reason for thinking about this happened months ago over the summer. I really didn’t think anything of it at the time.

I was at a large multi-family function. My family was there as my wife is a cousin of a cousin or something along those lines. We were all sitting around a cooler late one night, just talking. One of the gentlemen there makes his living as a Civil Engineer. I brought up the fact that at my job, I am referred to as an Engineer. Systems Engineer, Virtualization Engineer. Whatever you wish to call it, it’s all the same. Now, while he was extremely polite about it, he seemed offended that someone would go around calling them self an Engineer “when they are not.” At the time I really didn’t think anything of it. He stated that he went to school to learn what he knows, and has to keep up his certifications to keep doing what he does. I went to school for Systems Engineering. I have certifications that expire that I have to keep up with. Does that not count?

Let’s look at the definition of Engineering:

The branch of science and technology concerned with the design, building, and use of engines, machines, and structures.

Science and Technology. Technology being a key word there.

Design, building, and use of engines, machines, and structures.

I design, build, and use machines. Are compute / storage / networking / etc not considered machines in a sense?

I have been doing a lot of studying on consulting and design methodology lately. Prepping for the advanced vmware horizon design certification exam that is soon to be relased. I didn’t start revisiting this conversation in my head until I started digging into these topics. I guess realizing what kind of “Engineering” went into designing VMware environments made my wheels start turning again.

Let’s look at how a Civil Engineer might go about widening an existing road. They’re not going to just run in and have at it with heavy machinery, are they? No. They’re going to do a study first. How many cars come through there on a given day? Is there a bad traffic backup during rush hour? Do we need 2 lanes or 3 lanes in each direction? That is the assessment. Do we think traffic will exponentially grow within X number of years? Should we plan the road bigger than we need to accommodate right now? That is capacity planning. How about the existing road. Is there any of it that we can re-use? Can we simply expand the existing road, or do we need to build all new road? Is there a creek that the existing road goes over that a bridge will need to be built for? This is design work. Can I re-use existing compute / networking / storage equipment, or do I need to go with all new hardware? This is also design work. Blueprints? Geographic Surveys? Soil Surveys? How about Pictures of the existing Data Center space? Visio documents of logical and physical design? Detailed information on the existing build? Sounds to me like a lot of the same info, just a little bit different in what they need. Could consulting information such as Requirements, Constraints, Risks, and Assumptions be needed in both the road and the system scenario? Absolutely.

Would a Civil Engineer that designs and builds roads and bridges think that an Aerospace Engineer who designs and builds rockets is not an Engineer? Probably not. So why would a Civil Engineer think that a Systems Engineer who designs and builds complex, highly-available computer systems, is not an Engineer?

Now I know that this 1 person doesn’t think or speak for all “real engineers”. I’m sure plenty of “real” engineers are perfectly fine with Systems Engineers being called as such. But I also believe that there are quite a few that think that simply “playing with computers” all day doesn’t involve any bit of “real” Engineering. Does that come from a simple lack of understanding about how complex technology can be? Or they possibly just think that sitting on a computer all day simply isn’t the same as going outside and getting your hands dirty?

I’d love to hear what everybody’s input is on this topic. Any and all viewpoints are welcome. Thanks for reading!