vSAN Fault Tolerance Testing!

Ok, so… This is my first blog post. Welcome. Shout out to @vHipster for getting me into the social media game.

At the request of twitter user @vKoolaid, I am blogging about my vSAN Fault Tolerance testing today. I have been meaning to setup a blog and go for it for a while, and this is a good reason to start. So let’s get some info out of the way:

I am a Sr. Virtualization Administrator for Dell Services. I am in the Cloud Services team for one of our biggest customers. There is a team of 9 of us who are dedicated to this customer. This vSAN testing was done in our vLab, as are testing this as a POC for a customer ROBO site with no existing storage devices. We have 4 Dell r720’s on a 10 gig Ethernet network (Intel HBA’s with Force 10 Switches). This is the same infrastructure setup as our main production VM and VDI infrastructure. vSphere and vCenter version is 5.5 update 2. Each of the 4 nodes has 1 x 200GB SAS SSD, and 1 x 600GB SAS HDD dedicated to vSAN.

With the 800GB of raw SSD space and 2400GB of raw HDD space, we came up with 2.18GB of usable vSAN volume.

I placed 1 lab VDI connection server, and 1 lab Workspace linux appliance on the vSAN volume for testing. Our first test was a single-disk tolerance test. I setup a constant ping to both servers. We pulled a disk from the first node. We found that the server stayed online. This was exactly as expected. What we didn’t expect was that we didn’t receive any form of notification from the vSphere Web Client. We refreshed the Datastore Summary page, and the capacity of the vSAN datastore went from its original capacity of 2.18 to 1.64TB. We realized it automatically adjusted the capacity based on the drive. Placing the drive back in, waiting a moment, then refreshing, brought the capacity back up to the stock 2.18.

We noticed from the tests that the activity lights on the SSD’s were constant, but the activity lights on the HDD’s were pretty much focused on 1 HDD at a time (at this time, it was the 4th node in our vSAN volume.) We then decided to run the same test, but pulled the most active HDD in the cluster. This came up with the same result. The server stayed online and active.

It was then time to setup the full-host tolerance test. We were not sure how this was going to go, as far as storage and compute tolerance went. We decided to go with the first node in the cluster for this test. We setup a file transfer from my VDI (on the 10 gig network) to the test server. I had 3.66GB of data in my download folder. I pasted that into the admin share C$ of the test server. I did not realize that it would transfer faster than we could get to the back of the server. I moved 3.66GB of data in about 10-20 seconds, BUT that is for another blog post. Wow.

So we have 1 person setup at the console and 1 person setup at the back of the server. The second the transfer is started, we pull the plug on the host. The host goes black.  The transfer gets to about 25% and holds…. about 5 seconds later, it kicks right back off and completes. We’re floored. We killed an entire compute node, with storage, and it picks right back up. Now, what we didn’t know was that our lab vCenter was on this node and we killed. There is apparently some quirky things regarding vSAN and HA that I need to read more on. We had to disable it during the vSAN setup. More can be found here.

It was at this point that we decided to break stuff, and see what happened. We put everything back where it was, and refreshed. We had all our space back on the vSAN volume, and all servers were online and responding.  I start a new file copy and we pulled the HDD out of nodes 1 and 4. Servers….still pinging, BUT… the file copy died. The test server stopped responding to the console. It was dead. It continued to ping, but the server was 100% dead. The file copy and the console were shot. We placed 1 drive back in the 1st node. Refreshed, and go the space back on the volume. Server didn’t come back. I powered it off and back on in vSphere. Error. The VM showed inaccessible. It was gone. Once I placed the 2nd disk back in the server and gave it some time, I refreshed. All of the space came back to the vSAN volume. Seconds later, I was able to start the VM’s back no problem. This proved to us that we couldn’t stand to lose more than 1 disk. Unfortunately, we didn’t have the disks on hand to verify if we could lose 1 disk from each host without faulting. This will be for future testing.

This concluded our fault tolerance testing. As I stated, this will be for a 3-host ROBO site at a customer site. Our testing next week will involve trying to offload the vSAN traffic onto a 10 gig ad-hock network, without the use of a 10 gig switch. We want to use 1 gig links (existing in place at this time) for vKernel and vMotion traffic (2 NICs), as well as VM traffic (2 NICs). We are not sure if we can put in 1 x 2 port HBA into each host, setup the networking as peer-to-peer, and allow the vSAN to communicate over 10 gig HBA’s without the switch. Hopefully we can get some good results. I’ll post this to twitter, and maybe I can get some info or open up some dialog on this. Thanks for reading.

 

-vTimD

 

 

Leave a Reply