Home > VMware > Does ESX lack storage resiliency?

Does ESX lack storage resiliency?

Back in April, I posted about how our primary storage vendor disavowed us after one of our arrays failed by saying, “It’s not our problem”.  (You can read about it here).  Well, this same vendor had to do some “routine” maintenance on one of our arrays that was so “routine” that the vendor claimed it would not have any impact on our servers.  The vendor technician came onsite to do the work and reaffirmed that it should have no visible impact.   This routine maintenance was just a reboot of one controller, wait for it come back online, and then a reboot of the other.  Over 50 servers went down and it took us three hours to recover.

While I could go on and rant about the vendor, I really want to focus on something I noticed about the outage.  Almost all of our physical Windows servers tolerated the outage and suffered no major problems, but our ESX hosts are another story altogether.

All of our ESX hosts that were attached to the array in question basically “froze”.  It was really weird.  Virtual Center said all the virtual servers were up and running, but we couldn’t do anything with them.  Rebooted VC, no change.    I logged into the service consoles of the hosts to run various iterations of vmware-cmd to manipulate the virtual servers, but nothing worked.   I figured the only thing I could do at this point was to reboot the hosts.  Since my hosts are attached to multiple arrays, I tried to vMotion the known good virtual servers to a single host so I wouldn’t have to take them down.  No go.  Basically, I had lost all control of my hosts.

OK, time for a reboot.  Did that and I lost all access to my LUNs.  A quick looksie into UCSM showed all my connections were up.  So did Fabric Manager.   I could see during the reboots that ESX was complaining about not being able to read a label for a volume or two.  Reviewing various host log files showed a number of weird entries that I have no idea how to interpret.  Many were obviously disk related, others weren’t.

After multiple reboots, HBA rescans (initiated via VC and service console), and such we still couldn’t see the LUNs.  Keep in mind; we were three hours into a major outage.  That is the point where I have to get real creative in coming up with solutions.  I am not going to say that these solutions are ideal, but they will get us up and running.  In this case, I was thinking to repurpose our dev ESX hosts to our production environment.  All it would take would be to add them to the appropriate cluster, present LUNs, manually register any really messed up virtual servers, and power up the virtual servers.

Before I presented this idea to management, I don’t know what or why, but something triggered a memory of my first ESX host failure.   Way back in the ESX 2.x days, I had a problem where a patch took out access to my LUNs.  The fix was to run the command ‘esxcfg-boot –b’.   Ran it, problem fixed.

I know that the esxcfg-boot command rejiggers inits and such, but I really don’t know why it fixed the problem.  Did something happen to my HBA drivers/config?

What really bothers me about this is that almost all of my Windows servers and clusters came back online by themselves.  If they can do it, why can’t VMware program a bit more resiliency into ESX?  I hate say this, but incidents like this make me question my choice of hypervisor.  Since the latest version of Hyper-V relies on Windows Failover Clustering, would it have responded like my existing clusters and tolerated the outage appropriately?  Anyone know?

Advertisements
  1. JeffS
    October 27, 2010 at 6:47 pm

    I’m curious, were the ESX hosts booting from the same SAN that went down? Also, was this ESX 3.5 or below? I ask since it is my recollection that “esxcfg-boot -b” is not required in ESX(i) 4 and above as it’s now part of the shutdown process i.e. if the hosts were ESX 4(i), the problem should have resolved itself on a ESX host reboot (assuming the LUNs were presented properly). Since you mention the command fixed the problem, would it have been avoided had you been on 4.x or 4.1 of the hypervisor?

    I’m also curious if you, as part of a “certify for production” process, test all the possible failure scenarios e.g. simulate failures starting from simple cable faults, switch faults, and on to SAN controller loss or even complete power failure?

    I just went through this process with a new EMC CX4/Cisco UCS/vSphere 4.1 deployment, and I tested UCS/EMC firmware updates as well as every failure scenario I could think of, up to and including an array (as well as complete) power loss. This process was extremely helpful in understanding what would likely happen should the same issue(s) occur in production, and more importantly, assisted in development of recovery procedures for each scenario (minimize surprises). After all was said and done, I did find the above configuration pretty darn invincible. Between EMC’s PowerPath\VE (ESX multipath driver) and the Cisco UCS’s redundant-everything, only the SAN/UCS power loss resulted in a service loss (as expected), but everything recovered after a power up of the SAN and reboot of the ESXi 4.1 hosts.

    • October 28, 2010 at 7:23 am

      Thanks for commenting and good questions. First, we do not boot from SAN and this was on ESX 4.0 with UCS hardware. We did see similar behavior on ESX 3.5 and HP servers when we were doing some Exchange disaster scenario tests over a yr ago. It’s funny, but I completely forgot about it. When I searched the VM forums, I found a post of mine that described the situation and got back this may be a “design feature” of ESX. The purpose being to prevent split-brain and disk corruption.

      My reboot scenario above was a hard power down, so it never had a chance to run esxcfg-boot on its own. I tried various iterations of shutdown, reboot, kill (lots of processes), but the hosts wouldn’t go down gracefully.

      We ran through all the certification tests that you mentioned, but on a smaller scale…less workloads, less LUNs, less hosts. I am wondering if the scale factors into things. It’s also possible that it has something to do with our arrays (almost 5yrs old). We are currently in the process of purchasing new storage so you can bet we will test for this.

      I am going to open a ticket with support and see if the answers come back differently this time (as compared to over a yr ago).

  2. Jason Yarberry
    October 29, 2010 at 10:13 am

    This sound quite familiar to the LUN/SCSI locking issue that was occuring with ESX 3.0.2, but was supposed to be fixed. We would run the vmkfstool -L -P, -B, -R commands. We patched and upgrade over time to ESX 3.5 when the issue occured again. We were performing a failback on a NetApp storage process. The night before the storage processors caused a panic due to a separate issue. The next evening our VM environment was stable so we attempted to fail back. The issue began with single VM then a ESX host. Within 30 minutes the entire cluster was crashing down. Now this was a fiber attached ESX host cluster, our NFS cluster, while on the same NetApp filers was stable the entire time. After only 30 minutes I had VMware support and NetApp on the line. After server hours and trial/error, we shutdown the VMs, ESX hosts, unmapped/zoned the storage and took the storage offline within the NetApp storage processors. When then remounted a single LUN, mapped/zoned it back, brought up a single ESX host, scanned the LUN, verfied the stability. Once it was verfied it was stable, the remaining LUNs were brought up one at a time until all of the LUNs were restored. Finally we brought up VMs one at a time, verified the stability and then move on to other ESX hosts, one at a time. We could never get a clear cause other then SCSI LUN locking but no one accepted the responsibility. Do to this issue, we are now migrating to 10 Gig NFS storage. At the same time ESXi 4.1 support offloading the SCSI LUN locking to storage device.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: