Does ESX lack storage resiliency?
Back in April, I posted about how our primary storage vendor disavowed us after one of our arrays failed by saying, “It’s not our problem”. (You can read about it here). Well, this same vendor had to do some “routine” maintenance on one of our arrays that was so “routine” that the vendor claimed it would not have any impact on our servers. The vendor technician came onsite to do the work and reaffirmed that it should have no visible impact. This routine maintenance was just a reboot of one controller, wait for it come back online, and then a reboot of the other. Over 50 servers went down and it took us three hours to recover.
While I could go on and rant about the vendor, I really want to focus on something I noticed about the outage. Almost all of our physical Windows servers tolerated the outage and suffered no major problems, but our ESX hosts are another story altogether.
All of our ESX hosts that were attached to the array in question basically “froze”. It was really weird. Virtual Center said all the virtual servers were up and running, but we couldn’t do anything with them. Rebooted VC, no change. I logged into the service consoles of the hosts to run various iterations of vmware-cmd to manipulate the virtual servers, but nothing worked. I figured the only thing I could do at this point was to reboot the hosts. Since my hosts are attached to multiple arrays, I tried to vMotion the known good virtual servers to a single host so I wouldn’t have to take them down. No go. Basically, I had lost all control of my hosts.
OK, time for a reboot. Did that and I lost all access to my LUNs. A quick looksie into UCSM showed all my connections were up. So did Fabric Manager. I could see during the reboots that ESX was complaining about not being able to read a label for a volume or two. Reviewing various host log files showed a number of weird entries that I have no idea how to interpret. Many were obviously disk related, others weren’t.
After multiple reboots, HBA rescans (initiated via VC and service console), and such we still couldn’t see the LUNs. Keep in mind; we were three hours into a major outage. That is the point where I have to get real creative in coming up with solutions. I am not going to say that these solutions are ideal, but they will get us up and running. In this case, I was thinking to repurpose our dev ESX hosts to our production environment. All it would take would be to add them to the appropriate cluster, present LUNs, manually register any really messed up virtual servers, and power up the virtual servers.
Before I presented this idea to management, I don’t know what or why, but something triggered a memory of my first ESX host failure. Way back in the ESX 2.x days, I had a problem where a patch took out access to my LUNs. The fix was to run the command ‘esxcfg-boot –b’. Ran it, problem fixed.
I know that the esxcfg-boot command rejiggers inits and such, but I really don’t know why it fixed the problem. Did something happen to my HBA drivers/config?
What really bothers me about this is that almost all of my Windows servers and clusters came back online by themselves. If they can do it, why can’t VMware program a bit more resiliency into ESX? I hate say this, but incidents like this make me question my choice of hypervisor. Since the latest version of Hyper-V relies on Windows Failover Clustering, would it have responded like my existing clusters and tolerated the outage appropriately? Anyone know?


