OK, so I wasn’t really on vacation. But it sure felt like it at times. By some quirk of fate, I was able to attend both Cisco Live and VMworld this summer. And I had a blast at both of them.
I was at Cisco Live solely to work in VCE’s booth. For four days, I spent open to close talking to folks as they wandered the expo near the VCE booth. I got to meet existing customers, potential customers, co-workers, past co-workers from other places where I’ve worked, and more.
Since I work on the Product Management team, I tried to get people to tell me their stories. I wanted to know what their daily IT life was like, how was their infrastructure working for them, what were their plans for the near & long-terms. I heard some doozies in regards to plans, but I am not sure they are appropriate for a technology blog ;)
Anyway….I heard a lot of recurring themes: need to do more with less, need better management tools, need to learn about cloud, need to learn how to operate (or operate better) in a virtual world. Excuse me? The last one threw me a bit, but after a little more digging I found that some folks thought virtualization would solve their operational issues.
Folks, you’ve read this before on numerous other blogs but I am going to repeat it: if you have bad operational practices in the physical world and you don’t change them when you enter the virtual world, then you still have bad operational practices. Fix your bad practices before virtualizing. It will save you a lot of heartache and finger pointing. /soapbox off/
What I heard a lot of was, “Please help me”. There is just so much change going on in our industry now that it can be quite daunting to know what to do and where to go. Do I go cloud? Do I not go cloud? What is cloud? Can I have my own infrastructure? Can I just get my feet wet? All good questions and all that have the same answer: It depends. It’s usually at this point I would bring in one of our vArchitects to help me. I can answer most of the questions, but when someone asks me how many switches will I need, or how much capacity needs to be reserved for sparing, it’s best to leave it to the more knowledgeable folks.
My highlight of Cisco Live was when a customer came to the VCE booth with a friend and then proceeded to try to sell his friend a Vblock. It got so far as whiteboarding, drawing designs, and then some. A few vArchitects were listening in to clarify statements when needed, but pretty much just left them alone. The customer was doing an amazing job and was so enthusiastic about his Vblock he just had to get his friend to buy one (or at least into the concept).
It’s one thing for an employee to sell and be enthusiastic about products, it something else when a customer does it.
VMworld was a different story. I got to go as a mighty ATTENDEE (cue angels singing). I spent most of my time either in sessions or on the expo floor checking out all the other products. There is a lot of interesting work going on out there. I was surprised a few companies were still around from last year given that VMware entered their niche with some of the new features in vSphere 5.0. But after talking to them, the surprise went away. Some of these niche products do one thing, but they do it very well compared to VMware’s implementation and that keeps the customers coming to them.
As for sessions, I focused on vCloud Director and storage. I hit about 10 sessions covering the two topics. A lot for me to learn there. I was decently versed in the storage side of vSphere, but wanted a primer on the new storage features of vSphere 5.1. When it came to vCloud Director, I was fairly ignorant. I’m still ignorant on this topic, just less so. It’s definitely an area I want to learn more about. Time to cozy up with a book or two….
While at VMworld, I decided to run an experiment and wear my official VCE logoed shirt during the sessions. I wanted to see if people would stop me to ask questions. You now what? They did. In almost all the sessions I attended, at least one person came up to me with questions about VCE and Vblocks. There was one session where I had four people (non-related) stop me to answer questions.
So what did I come away with? 2 Kindle Fires, an Apple TV, and the VMworld plague. Been sick almost a week now. Awful stuff.
What else did I come away with? Some knowledge of vCD, some new friends, and a change in perspective on how VCE and Vblocks are viewed. Good times indeed.
A few days ago, both Cisco Press and VMware Press announced a few opportunities to win a few goodies via Facebook. Details below:
VMware Press Launches Sweepstakes!
VMware Press, the official publisher of VMware books and training materials, has launched a 60 day Facebook sweepstakes beginning May 1 and running through June 30th. Prize offerings include a $100 Amazon gift card and three VMware Press books of the winner’s choice; nine second prize winners will win an eBook of their choice.
Cisco Press Offers Free Trip to Cisco Live!
The official publisher of Cisco launched the annual Cisco Press Facebook sweepstakes today, offering free to Cisco Live 2013 including travel ($1,000 American Express gift card) and registration and a choice of three Cisco Press print or eBooks! Nine second prize winners will also win three print or eBooks of their choice for a total of 10 winners in all. The Cisco Press Sweepstakes begin May 1 and run through June 30th http://ow.ly/aBv08.
People go to VMworld for many reasons. Some go because it’s their job to ”man the booth”. Others go to party. And still others go “just because”. However, the most common reason why people go to VMworld is to learn about VMware products and its ecosystem. If I were still in the position of IT Architect, that would have been my primary reason too. This year is different. I changed jobs at the beginning of 2011 and went from an IT position that held responsibility for the care and feeding of the virtual infrastructure platform to a Product Management position. As such, my VMworld focus has changed from learning about VMware products to learning about VMware’s customers.
One of the basic tenets of Product Management/Development is to build products that customers want/need to buy. So how does one go about finding out what customers want and/or need? Simple. Ask them. I’ll be roaming the Solutions Exchange talking to attendees about their jobs, roadmaps, challenges, and desires (within the context of the datacenter). I want to gather as much information as I can to help me excel in my new”ish” position. I want to collect contact info so that I can reach out to folks later and see how things change as time passes. I want to know if your efforts are successful or not. Basically, I want to “know” and “learn” about you.
So if you happen to see me, introduce yourself. Tell me about your company, your datacenter challenges, and more. Help me develop a better product.
If you can’t find me, send a me a tweet - @ITVirtuality – and let’s schedule a time to meet.
Back in April, I posted about how our primary storage vendor disavowed us after one of our arrays failed by saying, “It’s not our problem”. (You can read about it here). Well, this same vendor had to do some “routine” maintenance on one of our arrays that was so “routine” that the vendor claimed it would not have any impact on our servers. The vendor technician came onsite to do the work and reaffirmed that it should have no visible impact. This routine maintenance was just a reboot of one controller, wait for it come back online, and then a reboot of the other. Over 50 servers went down and it took us three hours to recover.
While I could go on and rant about the vendor, I really want to focus on something I noticed about the outage. Almost all of our physical Windows servers tolerated the outage and suffered no major problems, but our ESX hosts are another story altogether.
All of our ESX hosts that were attached to the array in question basically “froze”. It was really weird. Virtual Center said all the virtual servers were up and running, but we couldn’t do anything with them. Rebooted VC, no change. I logged into the service consoles of the hosts to run various iterations of vmware-cmd to manipulate the virtual servers, but nothing worked. I figured the only thing I could do at this point was to reboot the hosts. Since my hosts are attached to multiple arrays, I tried to vMotion the known good virtual servers to a single host so I wouldn’t have to take them down. No go. Basically, I had lost all control of my hosts.
OK, time for a reboot. Did that and I lost all access to my LUNs. A quick looksie into UCSM showed all my connections were up. So did Fabric Manager. I could see during the reboots that ESX was complaining about not being able to read a label for a volume or two. Reviewing various host log files showed a number of weird entries that I have no idea how to interpret. Many were obviously disk related, others weren’t.
After multiple reboots, HBA rescans (initiated via VC and service console), and such we still couldn’t see the LUNs. Keep in mind; we were three hours into a major outage. That is the point where I have to get real creative in coming up with solutions. I am not going to say that these solutions are ideal, but they will get us up and running. In this case, I was thinking to repurpose our dev ESX hosts to our production environment. All it would take would be to add them to the appropriate cluster, present LUNs, manually register any really messed up virtual servers, and power up the virtual servers.
Before I presented this idea to management, I don’t know what or why, but something triggered a memory of my first ESX host failure. Way back in the ESX 2.x days, I had a problem where a patch took out access to my LUNs. The fix was to run the command ‘esxcfg-boot –b’. Ran it, problem fixed.
I know that the esxcfg-boot command rejiggers inits and such, but I really don’t know why it fixed the problem. Did something happen to my HBA drivers/config?
What really bothers me about this is that almost all of my Windows servers and clusters came back online by themselves. If they can do it, why can’t VMware program a bit more resiliency into ESX? I hate say this, but incidents like this make me question my choice of hypervisor. Since the latest version of Hyper-V relies on Windows Failover Clustering, would it have responded like my existing clusters and tolerated the outage appropriately? Anyone know?
I’ve always wondered how good of a job am I doing with my virtualization project. Yes, I know that I have saved my organization a few hundred thousand dollars by NOT having to purchase over 100 new servers. But could I do better? Am I sizing my hosts and guests correctly? To answer that question, I downloaded an evaluation copy of VMware’s CapacityIQ and have been running it for a bit over a week now.
My overall impression is that CapacityIQ needs some work. Visually, the product is fine. The product is also easy to use. I’m just a bit dubious of the results though.
Before I get into the details, here are some details about my virtual environment.
- Hypervisor is vSphere 4.0 build 261974.
- CapacityIQ version is CIQ-ovf-184.108.40.2061-276824
- Hosts are Cisco B250-M2 blades with 96GB RAM, dual Xeon X5670 CPU, and Palo
So what results do I see after one week’s run? All my virtual servers are oversized. It’s not that I don’t believe it; it’s just that I don’t believe it.
I read, and then re-read the documentation and noticed that using a 24hr time setting was not considered a best practice since all the evening idle time would be factored into the sizing calculations. So I adjusted the time calculations to be based on a 6am – 6pm Mon-Thurs schedule, which are our core business hours. All other settings were left at the defaults.
The first thing I noticed is that by doing this, I miss all peak usage events that occur at night for those individual servers that happen to be busy at night. The “time” setting is a global setting so it can’t set it on a per-vm basis. Minus 1 point for this limitation.
The second item I noticed between reading the documentation, a few whitepapers, and posts on the VMware Communities forums is that CapacityIQ does not take peak usage into account (I’ll come back to this later). The basic formula for sizing calculations is fairly simple. No calculus used here.
The third thing I noticed is that the tool isn’t application aware. It’s telling me that my Exchange mailbox cluster servers are way over provisioned when I am pretty sure this isn’t the case. We sized our Exchange mailbox cluster servers by running multiple stress tests and fiddling with various configuration values to get to something that was stable. If I lower any of the settings (RAM and/or vCPU), I see failover events, customers can’t access email, and other chaos ensues. CapacityIQ is telling me that I can get by with 1 vCPU and 4GB of RAM for a server hosting a bit over 4500 mailboxes. That’s a fair-sized reduction from my current setting of 4 vCPU and 20GB of RAM.
It’s not that CapacityIQ is completely wrong in regards to my Exchange servers. It’s just that the app occasionally wants all that memory and CPU and if it doesn’t get it and has to swap, the nastiness begins. This is where application awareness comes in handy.
Let’s get back to peak usage. What is the overreaching, ultimate litmus test of proper vm sizing? In my book, the correct answer is “happy customers”. If my customers are complaining, then something is not right. Right or wrong, the biggest success factor for any virtualization initiative is customer satisfaction. The metric used to determine customer satisfaction may change from organization to organization. For some it may be dollars saved. For my org, it’s a combination of dollars saved and customer experience.
Based on the whole customer experience imperative, I cannot noticeably degrade performance or I’ll end up with business units buying discrete servers again. If peak usage is not taken into account, then it’s fairly obvious that CapacityIQ will recommend smaller than acceptable virtual server configurations. It’s one thing to take an extra 5 seconds to run a report, quite another to add over an hour or two, yet based on what I am seeing, that is exactly what CapacityIQ is telling me to do.
I realize that this is a new area for VMware so time will be needed for the product to mature. In the meantime, I plan on taking a look at Hyper9. I hear the sizing algorithms it uses are a bit more sophisticated so I may get more realistic results.
Anyone else have experience with CapacityIQ ? Let me know. Am I off in what I am seeing? I’ll tweak some of the threshold variables to see what affects they have on the results I am seeing. Maybe the defaults are just impractical.
We’ve migrated most of our virtual servers over to UCS and vSphere. I’d say we are about 85% done, with this phase being completed by Aug 29. It’s not that it’s taking 10+ days to actually do the rest of the migrations. It’s more of a scheduling issue. From my perspective, I have three more downtimes to go. Not much at all.
The whole process of migrating from ESX to vSphere and updating all the virtual servers has been interesting to say the least. We haven’t encountered any major problems; just some small items related to the VMtools/VMhardware version (4 to 7) upgrades. For example, our basic VMTools upgrade process is to right-click on a guest in the VIC and click on the appropriate items to perform an automatic upgrade. When it works, the guest installs VMTools, reboots, and comes back up without admin intervention. For some reason, this would not work for our MS Terminal Servers unless we were logged into the target terminal server.
Here’s another example, this time involving Windows Server 2008: The automatic upgrade process wouldn’t work either. Instead, we had to login and launch VMTools from the System Tray and select upgrade. The only operating system that went perfectly was Windows Server 2003 with no fancy extras (terminal services, etc). Luckily, that’s the o/s most of our virtual workloads are running. I am going to hazard a guess and say that some of these oddities are related to our various security settings, GPOs, and the like.
All-in-all, the vm migration has gone very smoothly. I must say that I am happy with the quality of the VMware hyerpvisor, Virtual Center, and other basic components. There has been plenty of opportunity for something to go extremely wrong, but so far, nada. (knock on wood)
So what’s next? We are preparing to migrate our SQL servers onto bare metal blades. In reality, we are building new servers from scratch and installing SQL server. The implementation of UCS has given us the opportunity to update our SQL servers to Windows Server 2008 and SQL Server 2008. Other planned moved include some Oracle app servers (on RedHat) as well as domain controllers, file share clusters, and maybe some tape backup servers. This should take us into September.
Once we finish with the blades, we’ll start deploying the Cisco C-series rackmount servers. We still have a number of instances where we have to go rackmount. Servers in this category typically need multiple NICs, telephony boards, or other specialized expansion boards.
It’s amazing how many misconfigured, or perceived misconfigured, items can show up when doing maintenance and/or upgrades. In the past three weeks, we have found at least four production items that fit this description that no one noticed because things appeared to be working. Here’s a sampling:
During our migration from our legacy vm host hardware to UCS, we broke a website that was hardware load-balanced across two different servers. Traffic should have been directed to Server A, then Server B, then Server C. After the migration traffic was only going to Server C, which just hosts a page that says the site is down. It’s a “maintenance” server, meaning that whenever we take a public facing page down, the traffic gets directed to Server C so that people can see a nice screen that says, “Sorry down for maintenance …..”
Everything looked right in the load balancer configuration. While delving deeper, we noticed that server A was configured to be the primary node for a few other websites. An application analyst whose app was affected chimed in and said that the configuration was incorrect. Website 1 traffic was to go first to Server A, then B. Website 2 traffic was supposed to go in the opposite order. All our application documentation agreed with the analyst. Of course, he wrote the documentation so it better agree with him :) Here is the disconnect: we track all our changes in a Change Management system and no one ever put the desired configuration change into the system. As far as our network team is concerned; the load balancer is configured properly. Now this isn’t really a folly since our production system/network matched what our change management and CMDB systems were telling us. This is actually GOODNESS. If we ever had to recover due to a disaster, we would reference our CMDB and change management systems so they had better be in agreement.
Here’s another example: We did a mail server upgrade about six months ago and everything worked as far as we could tell. What we didn’t know was that some things were not working but no one noticed because mail was getting through. When we did notice something not correct (a remote monitoring system) and fixed the cause, it led us to another item, and so on and so on. Now, not everything was broken at the same time. In a few cases, the fix of one item actually broke something else. What’s funny is that if we didn’t correct the monitoring issue, everything would have still worked. It was a fix that caused all the other problems. In other words, one misconfiguration proved to be a correct configuration for other misconfigured items. In this case, multiple wrongs did make a right. Go Figure.
My manager has a saying for this: “If you are going to miss, miss by enough”.
I’ve also noticed that I sometimes don’t understand concepts when I think I do. As part of our migration to UCS, we are also upgrading from ESX3.5 to vSphere. Since I am new to vSphere, I did pretty much what every SysAdmin does: click all the buttons/links. One of those buttons is the “Advanced Runtime Info” link that is part of the VMware HA portion of the main Virtual Center screen.
This link brings up info on slot sizes and usage. You would think that numbers would add up, but clearly they don’t.
How does 268 -12 = 122? I’m either obviously math challenged or I really need to go back and re-read the concept of Slots.