I started this blog to write about the activities in the life of an IT admin. Other than posting about an odd event or two, I’ve mostly posted about our migration to Cisco UCS and vSphere. So without further ado, let’s take a look at another “Day in the Life” of an IT admin.
This past weekend was supposed to be fairly easy. All I had to do was migrate our FTP server from a rackmount running W2K3 to a virtual server running W2K8. The virtual server was built ahead of time, data was copied over to it, FTP sites were setup, etc. It looked like everything was good to go. I should have known better.
The first problem I ran into was name resolution. I wanted to keep the old server name as an alias so all my app folks wouldn’t have to change their scripts and applications. The W2K8 server would not respond to the alias. Turns that W2K8 handles aliases differently. In W2K3, if you want to have the server respond to multiple names, you create a registry key called DisableStrictNameChecking and set it to a DWORD value of 1. I added the key, but the server wouldn’t respond as I expected. It seems that W2K8 sort of ignores that key. Instead, you must use the setspn command to register the additional name(s). Ok, name resolution taken care of so I thought. Turns out that the spn is for TCP/IP type queries, meaning DNS is used for name resolution. What happens if your app uses WINS for lookups because it is making standard SMB/NetBIOS type calls? For that there is another registry key called OptionalNames. This key is of type Multi-String and it contains the aliases that you want the server to register in WINS.
With all name resolution issues taken care of so I must be good to go, right? Not so fast. My apps folks are complaining that all their scripts are failing on transferring data to the FTP server. I looked at a few of the scripts and saw that they were transferring files using the “put” command and wildcards. In W2K3, this worked fine. W2K8 doesn’t like it. It wants to be more standards-based: “put” is for single files, “mput” is for multiple files. Great…how much recoding are my apps folks going to have to do?
I had one person change his script to see if it would solve all his problems. It got him past most items, but then we ran into the dreaded “505-Access Denied” error. The permissions between the two servers match up so W2K8 must need something slightly different.
My window for this conversion was from 10pm to 1am. As I approached 12:40am, I made the call to roll back. I had a simple plan: power off the virtual server, power on the rackmount. I thought it was pretty good, but I was wrong. I came back in after getting some sleep and went to power up the virtual server in the morning to work on it. That pesky spn and OptionalNames reg key still played into things. It was like I had two servers with the same name on the network, but no errors/alarms being thrown. The new server took over the WINS entries for the old server. Once I figured out to remove the spn and OptionalNames reg key, all went back to normal.
Now I can work on those permissions and try again some other Sunday night.
DISCLAIMER: All technical info was gathered in the heat of the moment Sunday night. I may be wrong on the affects of the spn on name resolution methods (ie..DNS vs WINS), but it seems to make sense to me. So unless someone has a better reason, I’m sticking with mine.
I’ve always wondered how good of a job am I doing with my virtualization project. Yes, I know that I have saved my organization a few hundred thousand dollars by NOT having to purchase over 100 new servers. But could I do better? Am I sizing my hosts and guests correctly? To answer that question, I downloaded an evaluation copy of VMware’s CapacityIQ and have been running it for a bit over a week now.
My overall impression is that CapacityIQ needs some work. Visually, the product is fine. The product is also easy to use. I’m just a bit dubious of the results though.
Before I get into the details, here are some details about my virtual environment.
- Hypervisor is vSphere 4.0 build 261974.
- CapacityIQ version is CIQ-ovf-184.108.40.2061-276824
- Hosts are Cisco B250-M2 blades with 96GB RAM, dual Xeon X5670 CPU, and Palo
So what results do I see after one week’s run? All my virtual servers are oversized. It’s not that I don’t believe it; it’s just that I don’t believe it.
I read, and then re-read the documentation and noticed that using a 24hr time setting was not considered a best practice since all the evening idle time would be factored into the sizing calculations. So I adjusted the time calculations to be based on a 6am – 6pm Mon-Thurs schedule, which are our core business hours. All other settings were left at the defaults.
The first thing I noticed is that by doing this, I miss all peak usage events that occur at night for those individual servers that happen to be busy at night. The “time” setting is a global setting so it can’t set it on a per-vm basis. Minus 1 point for this limitation.
The second item I noticed between reading the documentation, a few whitepapers, and posts on the VMware Communities forums is that CapacityIQ does not take peak usage into account (I’ll come back to this later). The basic formula for sizing calculations is fairly simple. No calculus used here.
The third thing I noticed is that the tool isn’t application aware. It’s telling me that my Exchange mailbox cluster servers are way over provisioned when I am pretty sure this isn’t the case. We sized our Exchange mailbox cluster servers by running multiple stress tests and fiddling with various configuration values to get to something that was stable. If I lower any of the settings (RAM and/or vCPU), I see failover events, customers can’t access email, and other chaos ensues. CapacityIQ is telling me that I can get by with 1 vCPU and 4GB of RAM for a server hosting a bit over 4500 mailboxes. That’s a fair-sized reduction from my current setting of 4 vCPU and 20GB of RAM.
It’s not that CapacityIQ is completely wrong in regards to my Exchange servers. It’s just that the app occasionally wants all that memory and CPU and if it doesn’t get it and has to swap, the nastiness begins. This is where application awareness comes in handy.
Let’s get back to peak usage. What is the overreaching, ultimate litmus test of proper vm sizing? In my book, the correct answer is “happy customers”. If my customers are complaining, then something is not right. Right or wrong, the biggest success factor for any virtualization initiative is customer satisfaction. The metric used to determine customer satisfaction may change from organization to organization. For some it may be dollars saved. For my org, it’s a combination of dollars saved and customer experience.
Based on the whole customer experience imperative, I cannot noticeably degrade performance or I’ll end up with business units buying discrete servers again. If peak usage is not taken into account, then it’s fairly obvious that CapacityIQ will recommend smaller than acceptable virtual server configurations. It’s one thing to take an extra 5 seconds to run a report, quite another to add over an hour or two, yet based on what I am seeing, that is exactly what CapacityIQ is telling me to do.
I realize that this is a new area for VMware so time will be needed for the product to mature. In the meantime, I plan on taking a look at Hyper9. I hear the sizing algorithms it uses are a bit more sophisticated so I may get more realistic results.
Anyone else have experience with CapacityIQ ? Let me know. Am I off in what I am seeing? I’ll tweak some of the threshold variables to see what affects they have on the results I am seeing. Maybe the defaults are just impractical.
Part of our UCS implementation involved the use of Cisco Advanced Services (AS) to help with the initial configuration and testing. Do to our integration issues, time ran out and we never completed some items related to our implementation plan. AS was back out this week for a few days in order to complete their portion of the plan. Due to timing, we worked with a different AS engineer this time. He performed a health-check of our UCS environment and suggested a vSphere configuration change to help improve performance.
Before I get into what we changed, let me give a quick background on our vSphere configuration. We are using the B250-M2 blade with a single Palo adapter. We are not taking advantage of the advanced vNIC capabilities of the Palo adapter. What I mean by that is that we are not assigning a vNIC to each guest and using dVswitches. Instead, we are presenting two vNICs for the Service Console, two vNICs for the VMkernel, and two vNICs for virtual machines and using them as we would if we were on a standard rackmount server. Each vswitch is configured with one vNIC from fabric A, one vNIC from fabric B, and teamed together in an active/active configuration.
Recommended Change: Instead of active/active teaming, set the service console and VMkernel ports to active/standby. When doing this, ensure that the active NICs are all on the same fabric interconnect. This will keep service console/VMkernel traffic from having to hit our northbound switches and keep the traffic isolated to a single fabric interconnect.
Here is where it gets interesting.
Once this was done, possibilities came to mind and I asked the $64,000 question. “Is there a way to keep everything in UCS up and running properly in the event we lose all our northbound links”? It’s was more of a theoretical question, but we spent the next 6hrs working on it anyway. :)
Disclaimer: not all of what you are about to read is fully tested. This was a theoretical exercise that we didn’t finish testing due to time constraints. We did test this with two hosts on the same subnet and it worked as theorized.
Here’s what we came up with:
First of all, when UCS loses its northbound links it can behave in two ways. Via the Network Control Policy – see screen shot below – the ports can be marked either “link-down” or “warning”. When northbound ports are marked” link-down”, the various vNICs presented to the blades go down. This will kick in fabric failover as well if enabled at the vNIC level. If you are not using the Fabric Failover feature on a particular vNIC, you can achieve the same functionality by running the NIC Teaming drivers at the operating system level. We are using NIC Teaming at the vswitch level in vSphere and Fabric Failover for bare metal operating systems.
Setting the Network Control Policy to “warning” keeps the ports alive as far as the blades are concerned and no failovers take place. The beauty of this policy is that it can be applied on a per vNIC basis so you can cherry pick which vNIC is affected by which policy (Link-down or warning). Using a combination of the Network Control Policy settings and vswitch configurations, it’s possible to keep workloads on UCS up and running, with all servers (virtual or otherwise) communicating without having any external connectivity. This could be used to prevent massive outages, boot storms due to outages, etc. In our case, since the bulk of our data center will be on UCS, it basically prevents me from having to restart my datacenter in event of a massive network switch outage.
Here is a table detailing our vSphere switch configuration:
|Port Group||Service Console NIC1||Service Console NIC2||VMkernel NIC1||VMkernel NIC2||Virtual Machine NIC1||Virtual Machine NIC2|
|Network Control Policy (in UCS)||Link-Down||Warning||Link-Down||Warning||Link-Down||Warning|
|Network Failover Detection (at vSwitch level)||Link Status Only||Link Status Only||Link Status Only||Link Status Only||Link Status Only||Link Status Only|
As far as bare metal blades, go:
|Teaming Config||Active||Active or Standby (depends on app)|
|Network Control Policy (in UCS)||Link-Down||Warning|
Digression: This looks like we are heavily loading up Fabric A, which is true from an overall placement point of view. However, most of our workloads are in vm, which is configured for active/active, thus providing some semblance of load balancing. We could go active/active for bare metal blades since the operative feature for them is the Network Control Policy. With vSphere, we are trying to keep the Service Console and VMkernel vNICs operating on the same fabric interconnects in order to reduce northbound traffic. Not so with bare metal systems.
Back on track: As previously stated (before tables), what all this does in affect is to force all my blade traffic onto a single fabric interconnect in case I lose ALL my northbound links. Since the ports on fabric B are not marked “link-down”, the blades do not see any network issues and continue communicating normally.
And now the “BUT”: But this won’t work completely in my environment due to the fact that I am connected to two disjointed L2 networks. See Brad Hedlund’s blog and The Unified Computing blog for more details. In order for this to completely work, I will need to put in a software router of some sort to span the two different networks (VLANS in this case).
So what do you think? Anyone out there with a lab that can fully test this? If so, I would interested in seeing your results.
We did it, and we did it early. We completed the move of our existing VMware infrastructure onto the Cisco UCS platform. At the same time, we also moved from ESX 3.5 to vSphere. All-in-all, everything is pretty much working. The only outstanding issue we haven’t resolved yet involves Microsoft NLB and our Exchange CAS/HUB/OWA servers. NLB just doesn’t want to play nice and we don’t know if the issue is related more to vSphere, UCS, or something else entirely different.
Next up: SQL Server clusters, P2Vs, and other bare metal workloads.
SQL Server migrations have already started and are going well. We have a few more clusters to build and that should be that for SQL.
P2Vs present a small challenge. A minor annoyance that we will have to live with is an issue with VMware Converter. Specifically, we’ve run into a problem with resizing disks during the P2V process. The process fails about 2% into the conversion with an “Unknown Error”. It seems a number of people have also run into this problem and the workaround provided by VMware in KB1004588 (and others) is to P2V as-is and then run the guest through Converter again to resize the disks. This is going to cause us some scheduling headaches, but we’ll get through it. Without knowing the cause, I can’t narrow it down to being vSphere or UCS related. All I can say is that it does not happen when I P2V to my ESX 3.5 hosts. Alas, they are HP servers.
We’ve gone all-in with Cisco and purchased a number of the C-Series servers, recently deploying a few C-210 M2 servers to get our feet wet. Interesting design choices to say the least. I will say that they are not bad, but they are not great either. My gold standard is the HP DL380 server line and as compared to the DL380, the C-210 needs a bit more work. For starters, the default drive controller is SATA, not SAS. I’m sorry, but I have a hard time feeling comfortable with SATA drives deployed in servers. SAS drives typically come with a 3yr warranty; SATA drives typically have a 1yr warranty. For some drive manufacturers, this stems from the fact that their SAS drives are designed for 24/7/365 use, but their SATA drives are not.
Hot Plug fans? Nope..These guys are hard-wired, and big. Overall length of the server is a bit of a stretch too, literally. We use the extended width/depth HP server cabinets and these servers just fit. I think the length issue stems from the size of the fans (they are big and deep) and some dead space in the case. The cable arm also sticks out a bit more than I expected. With a few design modifications, the C-210 M2 could shrink three, maybe four inches in length.
I’ll post some updates as we get more experience with the C-Series.
We’ve migrated most of our virtual servers over to UCS and vSphere. I’d say we are about 85% done, with this phase being completed by Aug 29. It’s not that it’s taking 10+ days to actually do the rest of the migrations. It’s more of a scheduling issue. From my perspective, I have three more downtimes to go. Not much at all.
The whole process of migrating from ESX to vSphere and updating all the virtual servers has been interesting to say the least. We haven’t encountered any major problems; just some small items related to the VMtools/VMhardware version (4 to 7) upgrades. For example, our basic VMTools upgrade process is to right-click on a guest in the VIC and click on the appropriate items to perform an automatic upgrade. When it works, the guest installs VMTools, reboots, and comes back up without admin intervention. For some reason, this would not work for our MS Terminal Servers unless we were logged into the target terminal server.
Here’s another example, this time involving Windows Server 2008: The automatic upgrade process wouldn’t work either. Instead, we had to login and launch VMTools from the System Tray and select upgrade. The only operating system that went perfectly was Windows Server 2003 with no fancy extras (terminal services, etc). Luckily, that’s the o/s most of our virtual workloads are running. I am going to hazard a guess and say that some of these oddities are related to our various security settings, GPOs, and the like.
All-in-all, the vm migration has gone very smoothly. I must say that I am happy with the quality of the VMware hyerpvisor, Virtual Center, and other basic components. There has been plenty of opportunity for something to go extremely wrong, but so far, nada. (knock on wood)
So what’s next? We are preparing to migrate our SQL servers onto bare metal blades. In reality, we are building new servers from scratch and installing SQL server. The implementation of UCS has given us the opportunity to update our SQL servers to Windows Server 2008 and SQL Server 2008. Other planned moved include some Oracle app servers (on RedHat) as well as domain controllers, file share clusters, and maybe some tape backup servers. This should take us into September.
Once we finish with the blades, we’ll start deploying the Cisco C-series rackmount servers. We still have a number of instances where we have to go rackmount. Servers in this category typically need multiple NICs, telephony boards, or other specialized expansion boards.
It’s been a few weeks since I last posted an update on our Cisco UCS implementation. We’ve mostly been in a holding pattern until now. Yes, we finally got the network integration component figured out. Unfortunately, we had to dedicate some additional L2 switches to accommodate our desired end-goal. If you look back a few posts, I covered the issues with connecting UCS to two disjointed L2 networks. We followed the recommended workaround and it seems to be working. It took us a bit to get here since my shop did not use VLANs, which turn out to be part of the workaround.
So now we have been in a test mode for a bit over a week with no additional problems found. Now it’s time for real workloads. We migrated a few development systems over Wednesday to test out our migration process. Up until then, it was a paper exercise. It worked, but required more time that we thought for VMtools and VM hardware version upgrades. The real fun starts today when we migrate a few production workloads. If all goes well, I’ll be very busy over the next 45 days as we move all our VMware and a number of bare metal installs to UCS.
Since we chose to migrate by moving one LUN at a time from the old hosts to the new hosts, and also upgrade to vSphere, our basic VM migrations process goes like this:
- Power off guests that are to be migrated. These guests should be on the same LUN.
- Present the LUN to the new VM hosts and do an HBA rescan on the new hosts.
- In Virtual Center, click on a guest to be migrated. Click on the migrate link and select Host. The migration should take seconds.
- Repeat for all other guests on this LUN.
- Unpresent the LUN from the old hosts.
- Power up guests
- Upgrade VM tools (now that we are on vSphere hosts) and reboot.
- Power the guests down.
- Upgrade VM hardware.
- Power up the guests and let them Plug-n-Play the new hardware and reboot when needed.
We chose to do steps 6 through 10 using no more than four guests at a time. It’s easier to keep track of things this way and the process seems to be working so far.
We are lucky to be on ESX 3.5. If we started out on ESX4, the LUN migration method would require extra steps due to the process of LUN removal from the old hosts. To properly remove a LUN from ESX4, you will need to follow a number of convoluted steps as noted in this VMware KB. With ESX3.5, you can just unpresent and do an HBA rescan.
I don’t know the technical reason for all these extra steps to remove a LUN in vSphere, but it sure seem like a step backwards from a customer perspective. Maybe VMware will change it in the next version.
Progress has been made!!
The first few days of the week involved a number of calls back to TAC, the UCS business unit, and various other Cisco resources without much progress. Then on Thursday I pressed the magic button and all the sudden our fabric interconnects came alive in Fabric Manager (MDS control software). What did I do? I turned on SNMP. No one noticed that it was turned off (default state). Pretty sad actually given the number of people involved in troubleshooting this.
This paragraph subject to change based on confirmation of accuracy from Cisco. So here’s the basic gist of what was going on. We are running an older version of MDS firmware and the version of Fabric Manager that comes with this firmware is not really “UCS aware”. It needs a method of communicating with the fabric interconnects to fully see all the WWNs. The workaround is to use SNMP. I created an SNMP user in UCS and our storage admin created the same username/password in Fabric Manager. Of course having the accounts created does nothing if the protocol they need to use is not active. Duh.
The screenshot below shows the button I am talking about. The reason no one noticed that SNMP was turned off was because I was able to add traps and users without any warnings about SNMP not being active. Also, take a look at the HTTP and HTTPS services listed above SNMP. They are enabled by default. Easy to miss.
With storage now presented, we were able to complete some basic testing. I must say that UCS is pretty resilient if you have cabled all your equipment wisely. We pulled power plugs, fibre to Ethernet, fibre to storage, etc and only a few did times did we lose a ping (singular PING!). All our data transfers kept transferring, pings kept pinging, RDP sessions stayed RDP’ing.
We did learn something interesting in regards to the Palo card and VMware. If you are using the basic Menlo card (standard CNA), then failover works as expected. Palo is different. Suffice it to say that for every vNIC you think you need, add another one. In other words, you will need two vNICS per vSwitch. When creating vNICs, be sure to balance them across both fabrics and note down the MAC addresses. Then when you are creating your vSwitches (or DVS) in VMware, apply two vNICs to each switch using one from fabric A and one from fabric B. This provides the failover capabilities. I can’t provide all the details because I don’t know them, but it was explained to me by one of the UCS developers that this is a difference in UCS hardware (Menlo vs Palo).
Next up: testing, testing, and more testing with some VLANing thrown in to help us connect up to two disjointed L2 networks.