Back in April, I posted about how our primary storage vendor disavowed us after one of our arrays failed by saying, “It’s not our problem”. (You can read about it here). Well, this same vendor had to do some “routine” maintenance on one of our arrays that was so “routine” that the vendor claimed it would not have any impact on our servers. The vendor technician came onsite to do the work and reaffirmed that it should have no visible impact. This routine maintenance was just a reboot of one controller, wait for it come back online, and then a reboot of the other. Over 50 servers went down and it took us three hours to recover.
While I could go on and rant about the vendor, I really want to focus on something I noticed about the outage. Almost all of our physical Windows servers tolerated the outage and suffered no major problems, but our ESX hosts are another story altogether.
All of our ESX hosts that were attached to the array in question basically “froze”. It was really weird. Virtual Center said all the virtual servers were up and running, but we couldn’t do anything with them. Rebooted VC, no change. I logged into the service consoles of the hosts to run various iterations of vmware-cmd to manipulate the virtual servers, but nothing worked. I figured the only thing I could do at this point was to reboot the hosts. Since my hosts are attached to multiple arrays, I tried to vMotion the known good virtual servers to a single host so I wouldn’t have to take them down. No go. Basically, I had lost all control of my hosts.
OK, time for a reboot. Did that and I lost all access to my LUNs. A quick looksie into UCSM showed all my connections were up. So did Fabric Manager. I could see during the reboots that ESX was complaining about not being able to read a label for a volume or two. Reviewing various host log files showed a number of weird entries that I have no idea how to interpret. Many were obviously disk related, others weren’t.
After multiple reboots, HBA rescans (initiated via VC and service console), and such we still couldn’t see the LUNs. Keep in mind; we were three hours into a major outage. That is the point where I have to get real creative in coming up with solutions. I am not going to say that these solutions are ideal, but they will get us up and running. In this case, I was thinking to repurpose our dev ESX hosts to our production environment. All it would take would be to add them to the appropriate cluster, present LUNs, manually register any really messed up virtual servers, and power up the virtual servers.
Before I presented this idea to management, I don’t know what or why, but something triggered a memory of my first ESX host failure. Way back in the ESX 2.x days, I had a problem where a patch took out access to my LUNs. The fix was to run the command ‘esxcfg-boot –b’. Ran it, problem fixed.
I know that the esxcfg-boot command rejiggers inits and such, but I really don’t know why it fixed the problem. Did something happen to my HBA drivers/config?
What really bothers me about this is that almost all of my Windows servers and clusters came back online by themselves. If they can do it, why can’t VMware program a bit more resiliency into ESX? I hate say this, but incidents like this make me question my choice of hypervisor. Since the latest version of Hyper-V relies on Windows Failover Clustering, would it have responded like my existing clusters and tolerated the outage appropriately? Anyone know?
For the past few days, I’ve been working on troubleshooting a problem in UCS that I, admittedly, caused. The problem in question has to do with an error code/msg that I received when trying to move a service profile from one blade to another. The error code is: F0327.
According to the UCS error code reference guide, it translates as:
Service profile [name] configuration failed due to [configQualifier]
The named configuration qualifier is not available. This fault typically occurs because Cisco UCS Manager cannot successfully deploy the service profile due to a lack of resources that meet the named qualifier. For example, the following issues can cause this fault to occur:
•The service profile is configured for a server adapter with vHBAs, and the adapter on the server does not support vHBAs.
•The local disk configuration policy in the service profile specifies the No Local Storage mode, but the server contains local disks.
If you see this fault, take the following actions:
Step 1 Check the state of the server and ensure that it is in either the discovered or unassociated state.
Step 2 If the server is associated or undiscovered, do one of the following:
–Discover the server.
–Disassociate the server from the current service profile.
–Select another server to associate with the service profile.
Step 3 Review each policy in the service profile and verify that the selected server meets the requirements in the policy.
Step 4 If the server does not meet the requirements of the service profile, do one of the following:
–Modify the service profile to match the server.
–Select another server that does meet the requirements to associate with the service profile.
Step 5 If you can verify that the server meets the requirements of the service profile, execute the show tech-support command and contact Cisco Technical Support.
While helpful in providing me lots of things to try to fix the problem, none of them worked. It took me a while, but I figured out how to reproduce the error, a possible cause, and a workaround.
Here’s how to produce the error:
- Create a service profile without assigning any HBAs. Shutdown the server when the association process has completed.
- After the profile is associated, assign an HBA or two.
- You should receive this dialog box:
You will then see this in the general tab of the service profile in question:
Now here is where the error can be induced:
- Don’t power on. Keep in mind that the previous dialog box said that changes wouldn’t be applied until the blade was rebooted (powered on).
- Now disassociate the profile and associate it with another blade. The “error” is carried over to the new blade and the config process (association process) does not run.
Powering up the newly associated blade does not correct the issue. What has happened is that the disassociation/association process that is supposed to occur above does not take place due to the service profile being in an error state.
- Reboot after adding the HBA. This will complete the re-configuration process, thus allowing disassociation/association processes to perform normally. This is also the proper procedure. Or
- Go to the Storage tab of the affected service profile and click on “Change World Wide Node Name”. This forces the re-configuration to take place.
I’ve opened a ticket with TAC on this asking for a few documentation updates. The first update is to basically state the correct method for applying the HBAs and that if not followed, the error msg will appear.
The second update is for them to update the error code guide with a 6th option – Press “Change World Wide Node Name” button.
I am going to go out on a limb and say that they probably didn’t count on people like me doing things that they shouldn’t be doing or in an improper manner when they wrote the manuals. 🙂