Category Archives: vSphere

This section is about ESX / vCenter infrastructure components.

SRM: Fail Back With No Reprotect

As promised in my last “Job Role” related post, I have taken over Infrastructure Management responsibilities, and transitioned vEUC Management role to my 2nd in command. As promised, here is some new content, related to one of my new roles. Disaster Recovery! While I was not the SME for Disaster Recovery operations now, I am doing my best to takeover, as my former colleague was the man in charge when it came to DR stuff.

Let me get this out of the way now. I do not like waking up early on a Saturday to participate in a critical application DR test. The part that I really don’t like? How the team running the application wants to do the test. They do not simply want to run the fail-over, let the app run for a while, then fail-back. They don’t want to test or touch any production data at the primary site. Now, you may be asking yourself why this is a problem. Let’s bring SRM into the picture now. Note that this blog, technically, is going to assume you’re already familiar with vR and SRM to a degree.

Site Recovery Manager or SRM is VMware’s tool for Disaster Recovery in the vCenter environment. Paired with vSphere Replication or vR, the software set allows you to continuously replicate active VM’s over to your DR site, then when prompted, fail-over to that DR site. SRM allows you to setup what is called a Recovery Plan. This is essentially a script telling vCenter what to do in the event of a fail-over event. This Recovery Plan is setup inside a Protection Group. A Protection Group is a logical grouping of servers that make up an application stack. In our case today, this is a database server, some app servers, and some web servers. The basic setup is pretty easy.

01 - Basic SRM

Now, in a normal SRM fail-over and fail back, the process would be pretty straight forward.

Fail-Over:

  1. Replicate last bit of data
  2. Power off Primary Site VM’s
  3. Synchronize Storage
  4. Power on DR VM’s
  5. Verify

Fail Back:

  1. Start “reverse replication” back to Primary Site
  2. Power off DR Site VM’s
  3. Synchronize Storage
  4. Power on Primary VM’s
  5. Verify

This is a very straightforward process. The process of reverse replicating data back from DR to the Primary site is called “Re-Protect” in SRM. This way, data is sent from Primary to DR, then from DR back to Primary to ensure no data was lost in the process of failing over and back. This is not what my application team wants to do. They want to fail-over to the DR site with all the replicated data, then do the testing, then fail back without replicating any of the test data back to the Primary Site.

02 - No Reprotect

This is all fine and dandy, except 1 major issue. SRM doesn’t support re-protection without reverse-replication back to the primary site. This is where my problem is. Is there a solution? Of course there is. Is it as easy as hitting the play button for SRM? Not a chance. Let’s run through exactly how we’re accomplishing this DR test, the hard way.

06 - frown

The first phase is incredibly easy. SRM and vR are fantastic pieces of software magic. It takes me longer to confirm on the conference call that everyone is ready to fail-over than it does to start the fail-over. To begin, we go into SRM. Find the Protection Group you want to fail-over, and hit the big red play button to Run Recovery Plan:

03 - SRM

04 - SRM

05 - SRM

As you can see in the Recovery Plan progress area, step by step is listed out with status, times, and progress bars. Very automated. Very cool. Now, a change is made to our external DNS, and the application is now running at the DR site. Great.

Now, we’re ready to fail back. This is where the “unofficial procedure” comes into place. This is my first time completing this procedure, and I am not going to dive in to every single “click next” on this procedure. As stated before, I assume you have a general working knowledge of vR and SRM.

First things first. Let’s shut down the VM’s at the DR site. This will put the VM’s at the Primary site offline, and the DR site. At this time, you can also make the DNS change back to Primary site, as it will need time to propagate.

07 - Power off VM's

Once that is done, let’s delete the protection group from SRM. It will ask you to confirm, just go for it, as long as you know the settings that you will need to re-create it.

08 - Delete Protection Group

Now, since we pushed a “fail-over” of the Primary site VM’s in SRM, the old Primary site VM’s were flagged so that we cannot simply power them back on. The way to get around this is to remove the VM from inventory. Make a note of it’s Datastore and Cluster / Resource Group as you will need to browse the datastore to manually re-add the VM to inventory. Once it is re-added, you’ll have full control of power. Go ahead and flip it back on now.

09 - Remove Inventory

Once you confirm all the VM’s are back online, let’s get replication going again. Go into the Web Client. Yes I could have done everything in that… but I don’t want to. You need the Web Client to do anything with vR or SRM.

Make sure when you go to Configure Replication, that you select all the VM’s at once (or at least, more than 1). When you do it 1 by 1 the option to select an existing replication seed isn’t there. If you’re ok with completely replicating the seed over again, then do whatever you’d like.

10 - Configure Replication

Once that is setup, and all the configuration tasks are done, go into SRM and re-create the Protection Group.  Use the same settings you had before.

11 - Configure Protection Group

Now that your Protection Group is setup, if you didn’t remove the DR site VM’s from inventory, you’ll get an error that the Placeholder VM name already exists. I went into the DR vCenter and removed those VM’s from inventory. Once that was complete, I went in 1 by 1 in the Protection Group and selected Re-Create Placeholder. I used defaults.

12 - ReCreate Placeholder

Once that is all complete, the only thing left to do is re-associate the Recovery Plan with the Protection Group. Go in to the Recovery Plans, and Edit the one you want. During the wizard, just re-select the Protection Group you want to use. Make sure you noted down your test networks that you’ll need for the Recovery Plan.

13 - Reconfigure Recovery Plan

Once the Recovery Plan was associated back with the Protection Group, SRM should say that your Recovery Plan is back to Ready Status.

14 - All Done

We’re done! I hope this was good enough info to help someone else out in the future. If you want more detailed info, please feel free to reach out. Thanks!

vTimD

 

vSAN ROBO – A Call to Arms! [Updated / Resolved]

Hey all,

We have a problem that we are stuck against a wall with. I am going to lay it out here, and any info that anybody can provide, or even some input, would be great! Here’ goes:

We have 3 DELL R710s for vSAN at a ROBO Location. The servers have the following networking configuration

 

Management/vMotion:

vSwitch     : vStandard Switch

IP Segment  : 172.18.18.1/26 GW 172.18.18.1

VMNics      : TWO on each servers connected to Switch S1 and S2

Switch Port : ACCESS Ports (vLAN 24)

 

vSAN

vSwitch     : Tried both vSS and vDS

IP Segment  : 10.149.12.1/28 GW 10.149.x.1

VMNic       : Single port on each server connected to S1 only

Switch Port : ACCESS Ports (vLAN 12)

 

Cisco Switch Details:

Cisco WS-C6509-E with 12.2(33)SXH5 IOS running on them.

 

IGMP Snooping is ON (but querier is disabled), switch info below.

 

Vlan12 is up, line protocol is up

Internet address is 10.149.12.3/28

IGMP is disabled on interface

Multicast routing is disabled on interface

Multicast TTL threshold is 0

No multicast groups joined by this system

IGMP snooping is globally enabled

IGMP snooping CGMP-AutoDetect is globally enabled

IGMP snooping is enabled on this interface

IGMP snooping fast-leave (for v2) is disabled and querier is disabled

IGMP snooping explicit-tracking is enabled

IGMP snooping last member query response interval is 1000 ms

IGMP snooping report-suppression is enabled

 

Note: The vLAN 12 was newly created just for vSAN traffic.

 

Issue:

Can’t ping the vSAN IP segment GW (10.149.12.1) from any of the ESX hosts, can’t ping the vSAN vmk ports between ESX hosts. Network guys says he is not learning the MAC either on the switch end. At the same time, we have a windows server in the same segment as ESXi management network and it can ping the vSAN GW (10.149.12.1) just fine. Tried multicast testing using this command “tcpdump-uw -i vmk2 -n -s0 -t -c 20 udp port 12345 or udp port 23451” it’s only returning UDP packets from the same ESX hosts not from other two which is in the cluster.

 

Anyone who could possibly help out, or even if you know anyone, please share this around. Any and all input would be greatly appreciated, and I would love to buy you a beer for your troubles at VMWorld this year!

-vTimD

 

 

UPDATE:

Thanks to @vkoolaid, @chuckhollis, @DuncanYB, and @leedilworth for all of the responses and info. After reading a troubleshooting document and then checking out a VMTN article listed below, I see that querier must be enabled if IGMP Snooping is enabled. We are going to need a code upgrade for this, so this is our next path. More updates to come.

https://communities.vmware.com/message/2367348

 

 

Update 2 / Resolution:

Thank you so much to every single person who pinged me, or read, or provided some form of input. Cisco was engaged and we were able to enable the querier without a code upgrade on the switch. The 3 hosts are now visible to each other, and the vSAN is passing traffic as it should!

vSAN Fault Tolerance Testing!

Ok, so… This is my first blog post. Welcome. Shout out to @vHipster for getting me into the social media game.

At the request of twitter user @vKoolaid, I am blogging about my vSAN Fault Tolerance testing today. I have been meaning to setup a blog and go for it for a while, and this is a good reason to start. So let’s get some info out of the way:

I am a Sr. Virtualization Administrator for Dell Services. I am in the Cloud Services team for one of our biggest customers. There is a team of 9 of us who are dedicated to this customer. This vSAN testing was done in our vLab, as are testing this as a POC for a customer ROBO site with no existing storage devices. We have 4 Dell r720’s on a 10 gig Ethernet network (Intel HBA’s with Force 10 Switches). This is the same infrastructure setup as our main production VM and VDI infrastructure. vSphere and vCenter version is 5.5 update 2. Each of the 4 nodes has 1 x 200GB SAS SSD, and 1 x 600GB SAS HDD dedicated to vSAN.

With the 800GB of raw SSD space and 2400GB of raw HDD space, we came up with 2.18GB of usable vSAN volume.

I placed 1 lab VDI connection server, and 1 lab Workspace linux appliance on the vSAN volume for testing. Our first test was a single-disk tolerance test. I setup a constant ping to both servers. We pulled a disk from the first node. We found that the server stayed online. This was exactly as expected. What we didn’t expect was that we didn’t receive any form of notification from the vSphere Web Client. We refreshed the Datastore Summary page, and the capacity of the vSAN datastore went from its original capacity of 2.18 to 1.64TB. We realized it automatically adjusted the capacity based on the drive. Placing the drive back in, waiting a moment, then refreshing, brought the capacity back up to the stock 2.18.

We noticed from the tests that the activity lights on the SSD’s were constant, but the activity lights on the HDD’s were pretty much focused on 1 HDD at a time (at this time, it was the 4th node in our vSAN volume.) We then decided to run the same test, but pulled the most active HDD in the cluster. This came up with the same result. The server stayed online and active.

It was then time to setup the full-host tolerance test. We were not sure how this was going to go, as far as storage and compute tolerance went. We decided to go with the first node in the cluster for this test. We setup a file transfer from my VDI (on the 10 gig network) to the test server. I had 3.66GB of data in my download folder. I pasted that into the admin share C$ of the test server. I did not realize that it would transfer faster than we could get to the back of the server. I moved 3.66GB of data in about 10-20 seconds, BUT that is for another blog post. Wow.

So we have 1 person setup at the console and 1 person setup at the back of the server. The second the transfer is started, we pull the plug on the host. The host goes black.  The transfer gets to about 25% and holds…. about 5 seconds later, it kicks right back off and completes. We’re floored. We killed an entire compute node, with storage, and it picks right back up. Now, what we didn’t know was that our lab vCenter was on this node and we killed. There is apparently some quirky things regarding vSAN and HA that I need to read more on. We had to disable it during the vSAN setup. More can be found here.

It was at this point that we decided to break stuff, and see what happened. We put everything back where it was, and refreshed. We had all our space back on the vSAN volume, and all servers were online and responding.  I start a new file copy and we pulled the HDD out of nodes 1 and 4. Servers….still pinging, BUT… the file copy died. The test server stopped responding to the console. It was dead. It continued to ping, but the server was 100% dead. The file copy and the console were shot. We placed 1 drive back in the 1st node. Refreshed, and go the space back on the volume. Server didn’t come back. I powered it off and back on in vSphere. Error. The VM showed inaccessible. It was gone. Once I placed the 2nd disk back in the server and gave it some time, I refreshed. All of the space came back to the vSAN volume. Seconds later, I was able to start the VM’s back no problem. This proved to us that we couldn’t stand to lose more than 1 disk. Unfortunately, we didn’t have the disks on hand to verify if we could lose 1 disk from each host without faulting. This will be for future testing.

This concluded our fault tolerance testing. As I stated, this will be for a 3-host ROBO site at a customer site. Our testing next week will involve trying to offload the vSAN traffic onto a 10 gig ad-hock network, without the use of a 10 gig switch. We want to use 1 gig links (existing in place at this time) for vKernel and vMotion traffic (2 NICs), as well as VM traffic (2 NICs). We are not sure if we can put in 1 x 2 port HBA into each host, setup the networking as peer-to-peer, and allow the vSAN to communicate over 10 gig HBA’s without the switch. Hopefully we can get some good results. I’ll post this to twitter, and maybe I can get some info or open up some dialog on this. Thanks for reading.

 

-vTimD