Monthly Archives: January 2016

SRM: Fail Back With No Reprotect

As promised in my last “Job Role” related post, I have taken over Infrastructure Management responsibilities, and transitioned vEUC Management role to my 2nd in command. As promised, here is some new content, related to one of my new roles. Disaster Recovery! While I was not the SME for Disaster Recovery operations now, I am doing my best to takeover, as my former colleague was the man in charge when it came to DR stuff.

Let me get this out of the way now. I do not like waking up early on a Saturday to participate in a critical application DR test. The part that I really don’t like? How the team running the application wants to do the test. They do not simply want to run the fail-over, let the app run for a while, then fail-back. They don’t want to test or touch any production data at the primary site. Now, you may be asking yourself why this is a problem. Let’s bring SRM into the picture now. Note that this blog, technically, is going to assume you’re already familiar with vR and SRM to a degree.

Site Recovery Manager or SRM is VMware’s tool for Disaster Recovery in the vCenter environment. Paired with vSphere Replication or vR, the software set allows you to continuously replicate active VM’s over to your DR site, then when prompted, fail-over to that DR site. SRM allows you to setup what is called a Recovery Plan. This is essentially a script telling vCenter what to do in the event of a fail-over event. This Recovery Plan is setup inside a Protection Group. A Protection Group is a logical grouping of servers that make up an application stack. In our case today, this is a database server, some app servers, and some web servers. The basic setup is pretty easy.

01 - Basic SRM

Now, in a normal SRM fail-over and fail back, the process would be pretty straight forward.

Fail-Over:

  1. Replicate last bit of data
  2. Power off Primary Site VM’s
  3. Synchronize Storage
  4. Power on DR VM’s
  5. Verify

Fail Back:

  1. Start “reverse replication” back to Primary Site
  2. Power off DR Site VM’s
  3. Synchronize Storage
  4. Power on Primary VM’s
  5. Verify

This is a very straightforward process. The process of reverse replicating data back from DR to the Primary site is called “Re-Protect” in SRM. This way, data is sent from Primary to DR, then from DR back to Primary to ensure no data was lost in the process of failing over and back. This is not what my application team wants to do. They want to fail-over to the DR site with all the replicated data, then do the testing, then fail back without replicating any of the test data back to the Primary Site.

02 - No Reprotect

This is all fine and dandy, except 1 major issue. SRM doesn’t support re-protection without reverse-replication back to the primary site. This is where my problem is. Is there a solution? Of course there is. Is it as easy as hitting the play button for SRM? Not a chance. Let’s run through exactly how we’re accomplishing this DR test, the hard way.

06 - frown

The first phase is incredibly easy. SRM and vR are fantastic pieces of software magic. It takes me longer to confirm on the conference call that everyone is ready to fail-over than it does to start the fail-over. To begin, we go into SRM. Find the Protection Group you want to fail-over, and hit the big red play button to Run Recovery Plan:

03 - SRM

04 - SRM

05 - SRM

As you can see in the Recovery Plan progress area, step by step is listed out with status, times, and progress bars. Very automated. Very cool. Now, a change is made to our external DNS, and the application is now running at the DR site. Great.

Now, we’re ready to fail back. This is where the “unofficial procedure” comes into place. This is my first time completing this procedure, and I am not going to dive in to every single “click next” on this procedure. As stated before, I assume you have a general working knowledge of vR and SRM.

First things first. Let’s shut down the VM’s at the DR site. This will put the VM’s at the Primary site offline, and the DR site. At this time, you can also make the DNS change back to Primary site, as it will need time to propagate.

07 - Power off VM's

Once that is done, let’s delete the protection group from SRM. It will ask you to confirm, just go for it, as long as you know the settings that you will need to re-create it.

08 - Delete Protection Group

Now, since we pushed a “fail-over” of the Primary site VM’s in SRM, the old Primary site VM’s were flagged so that we cannot simply power them back on. The way to get around this is to remove the VM from inventory. Make a note of it’s Datastore and Cluster / Resource Group as you will need to browse the datastore to manually re-add the VM to inventory. Once it is re-added, you’ll have full control of power. Go ahead and flip it back on now.

09 - Remove Inventory

Once you confirm all the VM’s are back online, let’s get replication going again. Go into the Web Client. Yes I could have done everything in that… but I don’t want to. You need the Web Client to do anything with vR or SRM.

Make sure when you go to Configure Replication, that you select all the VM’s at once (or at least, more than 1). When you do it 1 by 1 the option to select an existing replication seed isn’t there. If you’re ok with completely replicating the seed over again, then do whatever you’d like.

10 - Configure Replication

Once that is setup, and all the configuration tasks are done, go into SRM and re-create the Protection Group.  Use the same settings you had before.

11 - Configure Protection Group

Now that your Protection Group is setup, if you didn’t remove the DR site VM’s from inventory, you’ll get an error that the Placeholder VM name already exists. I went into the DR vCenter and removed those VM’s from inventory. Once that was complete, I went in 1 by 1 in the Protection Group and selected Re-Create Placeholder. I used defaults.

12 - ReCreate Placeholder

Once that is all complete, the only thing left to do is re-associate the Recovery Plan with the Protection Group. Go in to the Recovery Plans, and Edit the one you want. During the wizard, just re-select the Protection Group you want to use. Make sure you noted down your test networks that you’ll need for the Recovery Plan.

13 - Reconfigure Recovery Plan

Once the Recovery Plan was associated back with the Protection Group, SRM should say that your Recovery Plan is back to Ready Status.

14 - All Done

We’re done! I hope this was good enough info to help someone else out in the future. If you want more detailed info, please feel free to reach out. Thanks!

vTimD

 

Quick Post – Work Update

So, there has been a LOT of changes going on at Dell Services recently. Maybe you have heard some rumors… While this is big news for my future at this current point, it doesn’t’ affect my daily work with my contract. What DOES affect my daily work is the staffing changes that have gone down. Due to recent team shifting and loss, I am drifting from the VDI lead to help and try to cover our Infrastructure Manager position. This gets me away from View and way more into overall Infrastructure Architecture. Host build outs, storage and network design, and the biggest project is our upgrade from 5.5 to 6.0. I hope to get a bunch of good blogs out of this transition. It’s not something I’m brand new too, but I have obviously been knee deep in Horizon for 2 years now. Gotta brush off some of the vSphere / vCenter skills. More to come. Sit tight!

-vTimD