Tag Archives: vSAN

vSAN ROBO – A Call to Arms! [Updated / Resolved]

Hey all,

We have a problem that we are stuck against a wall with. I am going to lay it out here, and any info that anybody can provide, or even some input, would be great! Here’ goes:

We have 3 DELL R710s for vSAN at a ROBO Location. The servers have the following networking configuration

 

Management/vMotion:

vSwitch     : vStandard Switch

IP Segment  : 172.18.18.1/26 GW 172.18.18.1

VMNics      : TWO on each servers connected to Switch S1 and S2

Switch Port : ACCESS Ports (vLAN 24)

 

vSAN

vSwitch     : Tried both vSS and vDS

IP Segment  : 10.149.12.1/28 GW 10.149.x.1

VMNic       : Single port on each server connected to S1 only

Switch Port : ACCESS Ports (vLAN 12)

 

Cisco Switch Details:

Cisco WS-C6509-E with 12.2(33)SXH5 IOS running on them.

 

IGMP Snooping is ON (but querier is disabled), switch info below.

 

Vlan12 is up, line protocol is up

Internet address is 10.149.12.3/28

IGMP is disabled on interface

Multicast routing is disabled on interface

Multicast TTL threshold is 0

No multicast groups joined by this system

IGMP snooping is globally enabled

IGMP snooping CGMP-AutoDetect is globally enabled

IGMP snooping is enabled on this interface

IGMP snooping fast-leave (for v2) is disabled and querier is disabled

IGMP snooping explicit-tracking is enabled

IGMP snooping last member query response interval is 1000 ms

IGMP snooping report-suppression is enabled

 

Note: The vLAN 12 was newly created just for vSAN traffic.

 

Issue:

Can’t ping the vSAN IP segment GW (10.149.12.1) from any of the ESX hosts, can’t ping the vSAN vmk ports between ESX hosts. Network guys says he is not learning the MAC either on the switch end. At the same time, we have a windows server in the same segment as ESXi management network and it can ping the vSAN GW (10.149.12.1) just fine. Tried multicast testing using this command “tcpdump-uw -i vmk2 -n -s0 -t -c 20 udp port 12345 or udp port 23451” it’s only returning UDP packets from the same ESX hosts not from other two which is in the cluster.

 

Anyone who could possibly help out, or even if you know anyone, please share this around. Any and all input would be greatly appreciated, and I would love to buy you a beer for your troubles at VMWorld this year!

-vTimD

 

 

UPDATE:

Thanks to @vkoolaid, @chuckhollis, @DuncanYB, and @leedilworth for all of the responses and info. After reading a troubleshooting document and then checking out a VMTN article listed below, I see that querier must be enabled if IGMP Snooping is enabled. We are going to need a code upgrade for this, so this is our next path. More updates to come.

https://communities.vmware.com/message/2367348

 

 

Update 2 / Resolution:

Thank you so much to every single person who pinged me, or read, or provided some form of input. Cisco was engaged and we were able to enable the querier without a code upgrade on the switch. The 3 hosts are now visible to each other, and the vSAN is passing traffic as it should!

vSAN Fault Tolerance Testing!

Ok, so… This is my first blog post. Welcome. Shout out to @vHipster for getting me into the social media game.

At the request of twitter user @vKoolaid, I am blogging about my vSAN Fault Tolerance testing today. I have been meaning to setup a blog and go for it for a while, and this is a good reason to start. So let’s get some info out of the way:

I am a Sr. Virtualization Administrator for Dell Services. I am in the Cloud Services team for one of our biggest customers. There is a team of 9 of us who are dedicated to this customer. This vSAN testing was done in our vLab, as are testing this as a POC for a customer ROBO site with no existing storage devices. We have 4 Dell r720’s on a 10 gig Ethernet network (Intel HBA’s with Force 10 Switches). This is the same infrastructure setup as our main production VM and VDI infrastructure. vSphere and vCenter version is 5.5 update 2. Each of the 4 nodes has 1 x 200GB SAS SSD, and 1 x 600GB SAS HDD dedicated to vSAN.

With the 800GB of raw SSD space and 2400GB of raw HDD space, we came up with 2.18GB of usable vSAN volume.

I placed 1 lab VDI connection server, and 1 lab Workspace linux appliance on the vSAN volume for testing. Our first test was a single-disk tolerance test. I setup a constant ping to both servers. We pulled a disk from the first node. We found that the server stayed online. This was exactly as expected. What we didn’t expect was that we didn’t receive any form of notification from the vSphere Web Client. We refreshed the Datastore Summary page, and the capacity of the vSAN datastore went from its original capacity of 2.18 to 1.64TB. We realized it automatically adjusted the capacity based on the drive. Placing the drive back in, waiting a moment, then refreshing, brought the capacity back up to the stock 2.18.

We noticed from the tests that the activity lights on the SSD’s were constant, but the activity lights on the HDD’s were pretty much focused on 1 HDD at a time (at this time, it was the 4th node in our vSAN volume.) We then decided to run the same test, but pulled the most active HDD in the cluster. This came up with the same result. The server stayed online and active.

It was then time to setup the full-host tolerance test. We were not sure how this was going to go, as far as storage and compute tolerance went. We decided to go with the first node in the cluster for this test. We setup a file transfer from my VDI (on the 10 gig network) to the test server. I had 3.66GB of data in my download folder. I pasted that into the admin share C$ of the test server. I did not realize that it would transfer faster than we could get to the back of the server. I moved 3.66GB of data in about 10-20 seconds, BUT that is for another blog post. Wow.

So we have 1 person setup at the console and 1 person setup at the back of the server. The second the transfer is started, we pull the plug on the host. The host goes black.  The transfer gets to about 25% and holds…. about 5 seconds later, it kicks right back off and completes. We’re floored. We killed an entire compute node, with storage, and it picks right back up. Now, what we didn’t know was that our lab vCenter was on this node and we killed. There is apparently some quirky things regarding vSAN and HA that I need to read more on. We had to disable it during the vSAN setup. More can be found here.

It was at this point that we decided to break stuff, and see what happened. We put everything back where it was, and refreshed. We had all our space back on the vSAN volume, and all servers were online and responding.  I start a new file copy and we pulled the HDD out of nodes 1 and 4. Servers….still pinging, BUT… the file copy died. The test server stopped responding to the console. It was dead. It continued to ping, but the server was 100% dead. The file copy and the console were shot. We placed 1 drive back in the 1st node. Refreshed, and go the space back on the volume. Server didn’t come back. I powered it off and back on in vSphere. Error. The VM showed inaccessible. It was gone. Once I placed the 2nd disk back in the server and gave it some time, I refreshed. All of the space came back to the vSAN volume. Seconds later, I was able to start the VM’s back no problem. This proved to us that we couldn’t stand to lose more than 1 disk. Unfortunately, we didn’t have the disks on hand to verify if we could lose 1 disk from each host without faulting. This will be for future testing.

This concluded our fault tolerance testing. As I stated, this will be for a 3-host ROBO site at a customer site. Our testing next week will involve trying to offload the vSAN traffic onto a 10 gig ad-hock network, without the use of a 10 gig switch. We want to use 1 gig links (existing in place at this time) for vKernel and vMotion traffic (2 NICs), as well as VM traffic (2 NICs). We are not sure if we can put in 1 x 2 port HBA into each host, setup the networking as peer-to-peer, and allow the vSAN to communicate over 10 gig HBA’s without the switch. Hopefully we can get some good results. I’ll post this to twitter, and maybe I can get some info or open up some dialog on this. Thanks for reading.

 

-vTimD