OK, so I wasn’t really on vacation. But it sure felt like it at times. By some quirk of fate, I was able to attend both Cisco Live and VMworld this summer. And I had a blast at both of them.
I was at Cisco Live solely to work in VCE’s booth. For four days, I spent open to close talking to folks as they wandered the expo near the VCE booth. I got to meet existing customers, potential customers, co-workers, past co-workers from other places where I’ve worked, and more.
Since I work on the Product Management team, I tried to get people to tell me their stories. I wanted to know what their daily IT life was like, how was their infrastructure working for them, what were their plans for the near & long-terms. I heard some doozies in regards to plans, but I am not sure they are appropriate for a technology blog ;)
Anyway….I heard a lot of recurring themes: need to do more with less, need better management tools, need to learn about cloud, need to learn how to operate (or operate better) in a virtual world. Excuse me? The last one threw me a bit, but after a little more digging I found that some folks thought virtualization would solve their operational issues.
Folks, you’ve read this before on numerous other blogs but I am going to repeat it: if you have bad operational practices in the physical world and you don’t change them when you enter the virtual world, then you still have bad operational practices. Fix your bad practices before virtualizing. It will save you a lot of heartache and finger pointing. /soapbox off/
What I heard a lot of was, “Please help me”. There is just so much change going on in our industry now that it can be quite daunting to know what to do and where to go. Do I go cloud? Do I not go cloud? What is cloud? Can I have my own infrastructure? Can I just get my feet wet? All good questions and all that have the same answer: It depends. It’s usually at this point I would bring in one of our vArchitects to help me. I can answer most of the questions, but when someone asks me how many switches will I need, or how much capacity needs to be reserved for sparing, it’s best to leave it to the more knowledgeable folks.
My highlight of Cisco Live was when a customer came to the VCE booth with a friend and then proceeded to try to sell his friend a Vblock. It got so far as whiteboarding, drawing designs, and then some. A few vArchitects were listening in to clarify statements when needed, but pretty much just left them alone. The customer was doing an amazing job and was so enthusiastic about his Vblock he just had to get his friend to buy one (or at least into the concept).
It’s one thing for an employee to sell and be enthusiastic about products, it something else when a customer does it.
VMworld was a different story. I got to go as a mighty ATTENDEE (cue angels singing). I spent most of my time either in sessions or on the expo floor checking out all the other products. There is a lot of interesting work going on out there. I was surprised a few companies were still around from last year given that VMware entered their niche with some of the new features in vSphere 5.0. But after talking to them, the surprise went away. Some of these niche products do one thing, but they do it very well compared to VMware’s implementation and that keeps the customers coming to them.
As for sessions, I focused on vCloud Director and storage. I hit about 10 sessions covering the two topics. A lot for me to learn there. I was decently versed in the storage side of vSphere, but wanted a primer on the new storage features of vSphere 5.1. When it came to vCloud Director, I was fairly ignorant. I’m still ignorant on this topic, just less so. It’s definitely an area I want to learn more about. Time to cozy up with a book or two….
While at VMworld, I decided to run an experiment and wear my official VCE logoed shirt during the sessions. I wanted to see if people would stop me to ask questions. You now what? They did. In almost all the sessions I attended, at least one person came up to me with questions about VCE and Vblocks. There was one session where I had four people (non-related) stop me to answer questions.
So what did I come away with? 2 Kindle Fires, an Apple TV, and the VMworld plague. Been sick almost a week now. Awful stuff.
What else did I come away with? Some knowledge of vCD, some new friends, and a change in perspective on how VCE and Vblocks are viewed. Good times indeed.
A few days ago, both Cisco Press and VMware Press announced a few opportunities to win a few goodies via Facebook. Details below:
VMware Press Launches Sweepstakes!
VMware Press, the official publisher of VMware books and training materials, has launched a 60 day Facebook sweepstakes beginning May 1 and running through June 30th. Prize offerings include a $100 Amazon gift card and three VMware Press books of the winner’s choice; nine second prize winners will win an eBook of their choice.
Cisco Press Offers Free Trip to Cisco Live!
The official publisher of Cisco launched the annual Cisco Press Facebook sweepstakes today, offering free to Cisco Live 2013 including travel ($1,000 American Express gift card) and registration and a choice of three Cisco Press print or eBooks! Nine second prize winners will also win three print or eBooks of their choice for a total of 10 winners in all. The Cisco Press Sweepstakes begin May 1 and run through June 30th http://ow.ly/aBv08.
People go to VMworld for many reasons. Some go because it’s their job to ”man the booth”. Others go to party. And still others go “just because”. However, the most common reason why people go to VMworld is to learn about VMware products and its ecosystem. If I were still in the position of IT Architect, that would have been my primary reason too. This year is different. I changed jobs at the beginning of 2011 and went from an IT position that held responsibility for the care and feeding of the virtual infrastructure platform to a Product Management position. As such, my VMworld focus has changed from learning about VMware products to learning about VMware’s customers.
One of the basic tenets of Product Management/Development is to build products that customers want/need to buy. So how does one go about finding out what customers want and/or need? Simple. Ask them. I’ll be roaming the Solutions Exchange talking to attendees about their jobs, roadmaps, challenges, and desires (within the context of the datacenter). I want to gather as much information as I can to help me excel in my new”ish” position. I want to collect contact info so that I can reach out to folks later and see how things change as time passes. I want to know if your efforts are successful or not. Basically, I want to “know” and “learn” about you.
So if you happen to see me, introduce yourself. Tell me about your company, your datacenter challenges, and more. Help me develop a better product.
If you can’t find me, send a me a tweet - @ITVirtuality – and let’s schedule a time to meet.
Back in April, I posted about how our primary storage vendor disavowed us after one of our arrays failed by saying, “It’s not our problem”. (You can read about it here). Well, this same vendor had to do some “routine” maintenance on one of our arrays that was so “routine” that the vendor claimed it would not have any impact on our servers. The vendor technician came onsite to do the work and reaffirmed that it should have no visible impact. This routine maintenance was just a reboot of one controller, wait for it come back online, and then a reboot of the other. Over 50 servers went down and it took us three hours to recover.
While I could go on and rant about the vendor, I really want to focus on something I noticed about the outage. Almost all of our physical Windows servers tolerated the outage and suffered no major problems, but our ESX hosts are another story altogether.
All of our ESX hosts that were attached to the array in question basically “froze”. It was really weird. Virtual Center said all the virtual servers were up and running, but we couldn’t do anything with them. Rebooted VC, no change. I logged into the service consoles of the hosts to run various iterations of vmware-cmd to manipulate the virtual servers, but nothing worked. I figured the only thing I could do at this point was to reboot the hosts. Since my hosts are attached to multiple arrays, I tried to vMotion the known good virtual servers to a single host so I wouldn’t have to take them down. No go. Basically, I had lost all control of my hosts.
OK, time for a reboot. Did that and I lost all access to my LUNs. A quick looksie into UCSM showed all my connections were up. So did Fabric Manager. I could see during the reboots that ESX was complaining about not being able to read a label for a volume or two. Reviewing various host log files showed a number of weird entries that I have no idea how to interpret. Many were obviously disk related, others weren’t.
After multiple reboots, HBA rescans (initiated via VC and service console), and such we still couldn’t see the LUNs. Keep in mind; we were three hours into a major outage. That is the point where I have to get real creative in coming up with solutions. I am not going to say that these solutions are ideal, but they will get us up and running. In this case, I was thinking to repurpose our dev ESX hosts to our production environment. All it would take would be to add them to the appropriate cluster, present LUNs, manually register any really messed up virtual servers, and power up the virtual servers.
Before I presented this idea to management, I don’t know what or why, but something triggered a memory of my first ESX host failure. Way back in the ESX 2.x days, I had a problem where a patch took out access to my LUNs. The fix was to run the command ‘esxcfg-boot –b’. Ran it, problem fixed.
I know that the esxcfg-boot command rejiggers inits and such, but I really don’t know why it fixed the problem. Did something happen to my HBA drivers/config?
What really bothers me about this is that almost all of my Windows servers and clusters came back online by themselves. If they can do it, why can’t VMware program a bit more resiliency into ESX? I hate say this, but incidents like this make me question my choice of hypervisor. Since the latest version of Hyper-V relies on Windows Failover Clustering, would it have responded like my existing clusters and tolerated the outage appropriately? Anyone know?
I’ve always wondered how good of a job am I doing with my virtualization project. Yes, I know that I have saved my organization a few hundred thousand dollars by NOT having to purchase over 100 new servers. But could I do better? Am I sizing my hosts and guests correctly? To answer that question, I downloaded an evaluation copy of VMware’s CapacityIQ and have been running it for a bit over a week now.
My overall impression is that CapacityIQ needs some work. Visually, the product is fine. The product is also easy to use. I’m just a bit dubious of the results though.
Before I get into the details, here are some details about my virtual environment.
- Hypervisor is vSphere 4.0 build 261974.
- CapacityIQ version is CIQ-ovf-220.127.116.111-276824
- Hosts are Cisco B250-M2 blades with 96GB RAM, dual Xeon X5670 CPU, and Palo
So what results do I see after one week’s run? All my virtual servers are oversized. It’s not that I don’t believe it; it’s just that I don’t believe it.
I read, and then re-read the documentation and noticed that using a 24hr time setting was not considered a best practice since all the evening idle time would be factored into the sizing calculations. So I adjusted the time calculations to be based on a 6am – 6pm Mon-Thurs schedule, which are our core business hours. All other settings were left at the defaults.
The first thing I noticed is that by doing this, I miss all peak usage events that occur at night for those individual servers that happen to be busy at night. The “time” setting is a global setting so it can’t set it on a per-vm basis. Minus 1 point for this limitation.
The second item I noticed between reading the documentation, a few whitepapers, and posts on the VMware Communities forums is that CapacityIQ does not take peak usage into account (I’ll come back to this later). The basic formula for sizing calculations is fairly simple. No calculus used here.
The third thing I noticed is that the tool isn’t application aware. It’s telling me that my Exchange mailbox cluster servers are way over provisioned when I am pretty sure this isn’t the case. We sized our Exchange mailbox cluster servers by running multiple stress tests and fiddling with various configuration values to get to something that was stable. If I lower any of the settings (RAM and/or vCPU), I see failover events, customers can’t access email, and other chaos ensues. CapacityIQ is telling me that I can get by with 1 vCPU and 4GB of RAM for a server hosting a bit over 4500 mailboxes. That’s a fair-sized reduction from my current setting of 4 vCPU and 20GB of RAM.
It’s not that CapacityIQ is completely wrong in regards to my Exchange servers. It’s just that the app occasionally wants all that memory and CPU and if it doesn’t get it and has to swap, the nastiness begins. This is where application awareness comes in handy.
Let’s get back to peak usage. What is the overreaching, ultimate litmus test of proper vm sizing? In my book, the correct answer is “happy customers”. If my customers are complaining, then something is not right. Right or wrong, the biggest success factor for any virtualization initiative is customer satisfaction. The metric used to determine customer satisfaction may change from organization to organization. For some it may be dollars saved. For my org, it’s a combination of dollars saved and customer experience.
Based on the whole customer experience imperative, I cannot noticeably degrade performance or I’ll end up with business units buying discrete servers again. If peak usage is not taken into account, then it’s fairly obvious that CapacityIQ will recommend smaller than acceptable virtual server configurations. It’s one thing to take an extra 5 seconds to run a report, quite another to add over an hour or two, yet based on what I am seeing, that is exactly what CapacityIQ is telling me to do.
I realize that this is a new area for VMware so time will be needed for the product to mature. In the meantime, I plan on taking a look at Hyper9. I hear the sizing algorithms it uses are a bit more sophisticated so I may get more realistic results.
Anyone else have experience with CapacityIQ ? Let me know. Am I off in what I am seeing? I’ll tweak some of the threshold variables to see what affects they have on the results I am seeing. Maybe the defaults are just impractical.