Archive

Archive for October, 2010

Does ESX lack storage resiliency?

October 27, 2010 3 comments

Back in April, I posted about how our primary storage vendor disavowed us after one of our arrays failed by saying, “It’s not our problem”.  (You can read about it here).  Well, this same vendor had to do some “routine” maintenance on one of our arrays that was so “routine” that the vendor claimed it would not have any impact on our servers.  The vendor technician came onsite to do the work and reaffirmed that it should have no visible impact.   This routine maintenance was just a reboot of one controller, wait for it come back online, and then a reboot of the other.  Over 50 servers went down and it took us three hours to recover.

While I could go on and rant about the vendor, I really want to focus on something I noticed about the outage.  Almost all of our physical Windows servers tolerated the outage and suffered no major problems, but our ESX hosts are another story altogether.

All of our ESX hosts that were attached to the array in question basically “froze”.  It was really weird.  Virtual Center said all the virtual servers were up and running, but we couldn’t do anything with them.  Rebooted VC, no change.    I logged into the service consoles of the hosts to run various iterations of vmware-cmd to manipulate the virtual servers, but nothing worked.   I figured the only thing I could do at this point was to reboot the hosts.  Since my hosts are attached to multiple arrays, I tried to vMotion the known good virtual servers to a single host so I wouldn’t have to take them down.  No go.  Basically, I had lost all control of my hosts.

OK, time for a reboot.  Did that and I lost all access to my LUNs.  A quick looksie into UCSM showed all my connections were up.  So did Fabric Manager.   I could see during the reboots that ESX was complaining about not being able to read a label for a volume or two.  Reviewing various host log files showed a number of weird entries that I have no idea how to interpret.  Many were obviously disk related, others weren’t.

After multiple reboots, HBA rescans (initiated via VC and service console), and such we still couldn’t see the LUNs.  Keep in mind; we were three hours into a major outage.  That is the point where I have to get real creative in coming up with solutions.  I am not going to say that these solutions are ideal, but they will get us up and running.  In this case, I was thinking to repurpose our dev ESX hosts to our production environment.  All it would take would be to add them to the appropriate cluster, present LUNs, manually register any really messed up virtual servers, and power up the virtual servers.

Before I presented this idea to management, I don’t know what or why, but something triggered a memory of my first ESX host failure.   Way back in the ESX 2.x days, I had a problem where a patch took out access to my LUNs.  The fix was to run the command ‘esxcfg-boot –b’.   Ran it, problem fixed.

I know that the esxcfg-boot command rejiggers inits and such, but I really don’t know why it fixed the problem.  Did something happen to my HBA drivers/config?

What really bothers me about this is that almost all of my Windows servers and clusters came back online by themselves.  If they can do it, why can’t VMware program a bit more resiliency into ESX?  I hate say this, but incidents like this make me question my choice of hypervisor.  Since the latest version of Hyper-V relies on Windows Failover Clustering, would it have responded like my existing clusters and tolerated the outage appropriately?  Anyone know?

First Impressions of VMware CapacityIQ

October 25, 2010 Leave a comment

I’ve always wondered how good of a job am I doing with my virtualization project.  Yes, I know that I have saved my organization a few hundred thousand dollars by NOT having to purchase over 100 new servers.  But could I do better?  Am I sizing my hosts and guests correctly?  To answer that question, I downloaded an evaluation copy of VMware’s CapacityIQ and have been running it for a bit over a week now.

My overall impression is that CapacityIQ needs some work.  Visually, the product is fine.  The product is also easy to use.  I’m just a bit dubious of the results though.

Before I get into the details, here are some details about my virtual environment.

  • Hypervisor is vSphere 4.0 build 261974.
  • CapacityIQ version is CIQ-ovf-1.0.4.1091-276824
  • Hosts are Cisco B250-M2 blades with 96GB RAM,  dual Xeon X5670 CPU, and Palo

 

So what results do I see after one week’s run?  All my virtual servers are oversized.   It’s not that I don’t believe it; it’s just that I don’t believe it.

I read, and then re-read the documentation and noticed that using a 24hr time setting was not considered a best practice since all the evening idle time would be factored into the sizing calculations.  So I adjusted the time calculations to be based on a 6am – 6pm Mon-Thurs schedule, which are our core business hours.  All other settings were left at the defaults.

The first thing I noticed is that by doing this, I miss all peak usage events that occur at night for those individual servers that happen to be busy at night.  The “time” setting is a global setting so it can’t set it on a per-vm basis.  Minus 1 point for this limitation.

The second item I noticed between reading the documentation, a few whitepapers, and posts on the VMware Communities forums is that CapacityIQ does not take peak usage into account (I’ll come back to this later).  The basic formula for sizing calculations is fairly simple.  No calculus used here.

The third thing I noticed is that the tool isn’t application aware.  It’s telling me that my Exchange mailbox cluster servers are way over provisioned when I am pretty sure this isn’t the case.  We sized our Exchange mailbox cluster servers by running multiple stress tests and fiddling with various configuration values to get to something that was stable.  If I lower any of the settings (RAM and/or vCPU), I see failover events, customers can’t access email, and other chaos ensues.   CapacityIQ is telling me that I can get by with 1 vCPU and 4GB of RAM for a server hosting a bit over 4500 mailboxes.  That’s a fair-sized reduction from my current setting of 4 vCPU and 20GB of RAM.

It’s not that CapacityIQ is completely wrong in regards to my Exchange servers.  It’s just that the app occasionally wants all that memory and CPU and if it doesn’t get it and has to swap, the nastiness begins.  This is where application awareness  comes in handy.

Let’s get back to peak usage.  What is the overreaching, ultimate litmus test of proper vm sizing?  In my book, the correct answer is “happy customers”.  If my customers are complaining, then something is not right.   Right or wrong, the biggest success factor for any virtualization initiative is customer satisfaction.  The metric used to determine customer satisfaction may change from organization to organization.  For some it may be dollars saved.  For my org, it’s a combination of dollars saved and customer experience.

Based on the whole customer experience imperative, I cannot noticeably degrade performance or I’ll end up with business units buying discrete servers again.  If peak usage is not taken into account, then it’s fairly obvious that CapacityIQ will recommend smaller than acceptable virtual server configurations.  It’s one thing to take an extra 5 seconds to run a report, quite another to add over an hour or two, yet based on what I am seeing, that is exactly what CapacityIQ is telling me to do.

I realize that this is a new area for VMware so time will be needed for the product to mature.  In the meantime, I plan on taking a look at Hyper9.  I hear the sizing algorithms it uses are a bit more sophisticated so I may get more realistic results.

Anyone else have experience with CapacityIQ ?  Let me know.  Am I off in what I am seeing?  I’ll tweak some of the threshold variables to see what affects they have on the results I am seeing.  Maybe the defaults are just impractical.

Use Cases for Cisco UCS Network Isolation

October 4, 2010 Leave a comment

Based on my last post, a couple of people have emailed me asking, “what is the value of keeping UCS links alive when the network has gone down?”  The answer is: It Depends.  It depends on your applications and environment.  In my case, I have a number of multi-tiered apps that are session oriented, batch processors, etc.

The easiest use case to describe involves batch processing.  We have a few applications that do batch processing late at night.  It just so happens that “late at night” is also the window for performing network maintenance.  When the two bump (batch jobs and maintenance), we either reschedule something (batch or maintenance), take down the application, or move forward and hope nothing goes awry.  Having these applications in UCS and taking advantage of the configuration in my previous post  means I can do network maintenance without having to reschedule batch jobs, or take down the application.

I could probably achieve similar functionality outside of UCS by having a complex setup that makes use of multiple switches and running NIC teaming drivers at the o/s level. However, some of my servers are using all of their physical NICs for different uses, with different IP addresses.  In these cases, teaming drivers may add unnecessary complexity.  Not to mention that the premise of this use case is the ability to do network maintenance.  Any way to avoid relying on the network is a plus in my book in regards to this use case.

Now let’s consider session oriented applications.  In our case, we have a multi-tiered app that requires that open sessions are maintained from one tier to the next.  If there is a hiccup in the connection, the session closes and the app has to be restarted.  Typically, this means rebooting.  Fabric failover prevents the session from closing so the app keeps running.  In this particular case, UCS isolation would prevent this app from doing any work since no clients will be able get to it.  Where it helps us is in restoring service faster when the network comes back due to removing the need for a reboot.

I am going to guess that this can be done with other vendor’s blade systems, but with additional equipment.  What I mean is that with other blade systems, the unit of measure is the chassis.  You can probably configure the internal switches to pass traffic from one blade to another without having to go to another set of switches.  But if you need a blade in chassis A to talk to a blade in chassis B, you will probably need to involve an additional switch, or two, mounted either Top-of-Rack or End-of-Row.  In the case of UCS, the unit of measure is the fabric.  Any blade can communicate with any other blade, provided they are in the same VLAN and assuming EHV mode.  Switch mode may offer more features, but I am not versed in it.

I hope this post answers your questions.  I am still thinking over the possibilities that UCS isolation can bring to the table.  BTW, I made up the term “UCS isolation”.  If anyone has an official term, or one that better describes the situation, please let me know.

Categories: cisco, UCS Tags: , ,