Archive

Posts Tagged ‘vmware’

Book Review: Administering VMware Site Recovery Manager 5.0 by Mike Laverick

February 23, 2012 Leave a comment

I’ve decided that I am going to review at least one book per quarter.  Four book reviews per year are not that much when you consider that I probably read four books (various subjects) per month.

So for my first book review of 2012, I am going to start with Administering VMware Site Recovery Manager 5.0 by Mike Laverick.  Many of you already know Mike (the guy is everywhere) or are at least familiar with his blog (RTFM-ed).

Let me start out by saying that I did not like this book after first reading it.  I felt that it was missing something; something that I couldn’t quite put my finger on.  Then it hit me.  Concepts are not necessarily discussed as concepts.  Yes, there are one/two page discussions on concepts, but most often they are discussed as working knowledge.  This should not have been a surprise to me because Mike clearly states (multiple times) that he expects his readers to have read the various manuals for detailed concept and background info on vSphere, Site Recovery Manager (SRM), your storage array, etc. He can’t teach everything needed to get SRM working so you have do some work on your own.   In other words, RTFM.  Once I came to grips with this, I re-evaluated the book in a new light and have decided that I like it.

As for the book itself, it has an interesting layout.  You get a little bit of history concerning vSphere’s DR and HA features and what SRM is, and is not.  Then comes a little detour into setting up a number of different storage arrays from Dell, EMC, HP, and NetApp.    This detour does serve a purpose in that it sets a baseline storage configuration for installing and configuring SRM, albeit in the most simple configuration possible.  It’s actually a smart move on his part because he is able to show how he setup his lab.  It also prompts the reader to go check various items in order to ensure a successful install of SRM.

Then you get to the good stuff: installing, configuring, and using SRM. There are plenty of screenshots and step-by-step instructions for doing a lot of the configuration tasks.  In fact, you could think of this book along the lines of a cookbook.   Follow along and you should end up with a usable (in a lab) install of SRM.

It’s clear that after reading this book, Mike knows SRM.  Peppered throughout the chapters are the various problems and errors he encountered as well as what he did to fix them.  In a few cases, he does a mea culpa for not following his own advice of RTFM.  If he had, a few problems would have been avoided.

Mike also hits home on a few simple truths.  For those involved with Active Directory in the early days, there was a truth that went something like this: “The question is irrelevant, the answer is DNS”.  In the case of SRM, substitute “Network and storage configuration” for “DNS”.  So many problems that may be encountered are the result of a network or storage configuration issue.  vSwitches need to be setup correctly, hosts need to see storage, vCenter needs to see hosts, etc.

I especially liked the bits of wisdom that he shares in regards to doing what I call “rookie maneuvers” (others call them stupid mistakes).  For example, once you have SRM up and running, it’s too easy to hit the button without realizing what it all really entails.  Mike warns you about this many times and prompts you to think about your actions ahead of time.

The later chapters of the book introduce customizations, scripting, more complex configurations, and how to reverse a failover.  There is a lot going on here and worth re-reading a few times.  A surprising amount of this info can be applied to basic disaster recovery principals regardless of whether or not SRM is in the picture.

Lastly, Mike walks you through upgrading from vSphere 4.1 to vSphere 5 and from SRM 4.1 to SRM 5.  Upgrading vSphere may sound a bit odd, but not when you take into account that it’s required in order to upgrade SRM.

All-in-all, this book is a worthy read and should be in your library if your shop uses (or plans to use) SRM.

Does ESX lack storage resiliency?

October 27, 2010 3 comments

Back in April, I posted about how our primary storage vendor disavowed us after one of our arrays failed by saying, “It’s not our problem”.  (You can read about it here).  Well, this same vendor had to do some “routine” maintenance on one of our arrays that was so “routine” that the vendor claimed it would not have any impact on our servers.  The vendor technician came onsite to do the work and reaffirmed that it should have no visible impact.   This routine maintenance was just a reboot of one controller, wait for it come back online, and then a reboot of the other.  Over 50 servers went down and it took us three hours to recover.

While I could go on and rant about the vendor, I really want to focus on something I noticed about the outage.  Almost all of our physical Windows servers tolerated the outage and suffered no major problems, but our ESX hosts are another story altogether.

All of our ESX hosts that were attached to the array in question basically “froze”.  It was really weird.  Virtual Center said all the virtual servers were up and running, but we couldn’t do anything with them.  Rebooted VC, no change.    I logged into the service consoles of the hosts to run various iterations of vmware-cmd to manipulate the virtual servers, but nothing worked.   I figured the only thing I could do at this point was to reboot the hosts.  Since my hosts are attached to multiple arrays, I tried to vMotion the known good virtual servers to a single host so I wouldn’t have to take them down.  No go.  Basically, I had lost all control of my hosts.

OK, time for a reboot.  Did that and I lost all access to my LUNs.  A quick looksie into UCSM showed all my connections were up.  So did Fabric Manager.   I could see during the reboots that ESX was complaining about not being able to read a label for a volume or two.  Reviewing various host log files showed a number of weird entries that I have no idea how to interpret.  Many were obviously disk related, others weren’t.

After multiple reboots, HBA rescans (initiated via VC and service console), and such we still couldn’t see the LUNs.  Keep in mind; we were three hours into a major outage.  That is the point where I have to get real creative in coming up with solutions.  I am not going to say that these solutions are ideal, but they will get us up and running.  In this case, I was thinking to repurpose our dev ESX hosts to our production environment.  All it would take would be to add them to the appropriate cluster, present LUNs, manually register any really messed up virtual servers, and power up the virtual servers.

Before I presented this idea to management, I don’t know what or why, but something triggered a memory of my first ESX host failure.   Way back in the ESX 2.x days, I had a problem where a patch took out access to my LUNs.  The fix was to run the command ‘esxcfg-boot –b’.   Ran it, problem fixed.

I know that the esxcfg-boot command rejiggers inits and such, but I really don’t know why it fixed the problem.  Did something happen to my HBA drivers/config?

What really bothers me about this is that almost all of my Windows servers and clusters came back online by themselves.  If they can do it, why can’t VMware program a bit more resiliency into ESX?  I hate say this, but incidents like this make me question my choice of hypervisor.  Since the latest version of Hyper-V relies on Windows Failover Clustering, would it have responded like my existing clusters and tolerated the outage appropriately?  Anyone know?

First Impressions of VMware CapacityIQ

October 25, 2010 Leave a comment

I’ve always wondered how good of a job am I doing with my virtualization project.  Yes, I know that I have saved my organization a few hundred thousand dollars by NOT having to purchase over 100 new servers.  But could I do better?  Am I sizing my hosts and guests correctly?  To answer that question, I downloaded an evaluation copy of VMware’s CapacityIQ and have been running it for a bit over a week now.

My overall impression is that CapacityIQ needs some work.  Visually, the product is fine.  The product is also easy to use.  I’m just a bit dubious of the results though.

Before I get into the details, here are some details about my virtual environment.

  • Hypervisor is vSphere 4.0 build 261974.
  • CapacityIQ version is CIQ-ovf-1.0.4.1091-276824
  • Hosts are Cisco B250-M2 blades with 96GB RAM,  dual Xeon X5670 CPU, and Palo

 

So what results do I see after one week’s run?  All my virtual servers are oversized.   It’s not that I don’t believe it; it’s just that I don’t believe it.

I read, and then re-read the documentation and noticed that using a 24hr time setting was not considered a best practice since all the evening idle time would be factored into the sizing calculations.  So I adjusted the time calculations to be based on a 6am – 6pm Mon-Thurs schedule, which are our core business hours.  All other settings were left at the defaults.

The first thing I noticed is that by doing this, I miss all peak usage events that occur at night for those individual servers that happen to be busy at night.  The “time” setting is a global setting so it can’t set it on a per-vm basis.  Minus 1 point for this limitation.

The second item I noticed between reading the documentation, a few whitepapers, and posts on the VMware Communities forums is that CapacityIQ does not take peak usage into account (I’ll come back to this later).  The basic formula for sizing calculations is fairly simple.  No calculus used here.

The third thing I noticed is that the tool isn’t application aware.  It’s telling me that my Exchange mailbox cluster servers are way over provisioned when I am pretty sure this isn’t the case.  We sized our Exchange mailbox cluster servers by running multiple stress tests and fiddling with various configuration values to get to something that was stable.  If I lower any of the settings (RAM and/or vCPU), I see failover events, customers can’t access email, and other chaos ensues.   CapacityIQ is telling me that I can get by with 1 vCPU and 4GB of RAM for a server hosting a bit over 4500 mailboxes.  That’s a fair-sized reduction from my current setting of 4 vCPU and 20GB of RAM.

It’s not that CapacityIQ is completely wrong in regards to my Exchange servers.  It’s just that the app occasionally wants all that memory and CPU and if it doesn’t get it and has to swap, the nastiness begins.  This is where application awareness  comes in handy.

Let’s get back to peak usage.  What is the overreaching, ultimate litmus test of proper vm sizing?  In my book, the correct answer is “happy customers”.  If my customers are complaining, then something is not right.   Right or wrong, the biggest success factor for any virtualization initiative is customer satisfaction.  The metric used to determine customer satisfaction may change from organization to organization.  For some it may be dollars saved.  For my org, it’s a combination of dollars saved and customer experience.

Based on the whole customer experience imperative, I cannot noticeably degrade performance or I’ll end up with business units buying discrete servers again.  If peak usage is not taken into account, then it’s fairly obvious that CapacityIQ will recommend smaller than acceptable virtual server configurations.  It’s one thing to take an extra 5 seconds to run a report, quite another to add over an hour or two, yet based on what I am seeing, that is exactly what CapacityIQ is telling me to do.

I realize that this is a new area for VMware so time will be needed for the product to mature.  In the meantime, I plan on taking a look at Hyper9.  I hear the sizing algorithms it uses are a bit more sophisticated so I may get more realistic results.

Anyone else have experience with CapacityIQ ?  Let me know.  Am I off in what I am seeing?  I’ll tweak some of the threshold variables to see what affects they have on the results I am seeing.  Maybe the defaults are just impractical.

Can UCS Survive a Network Outage?

September 29, 2010 1 comment

Part of our UCS implementation involved the use of Cisco Advanced Services (AS) to help with the initial configuration and testing.  Do to our integration issues, time ran out and we never completed some items related to our implementation plan.  AS was back out this week for a few days in order to complete their portion of the plan.    Due to timing, we worked with a different AS engineer this time.  He performed a health-check of our UCS environment and suggested a vSphere configuration change to help improve performance.

Before I get into what we changed, let me give a quick background on our vSphere configuration.  We are using the B250-M2 blade with a single Palo adapter.  We are not taking advantage of the advanced vNIC capabilities of the Palo adapter.  What I mean by that is that we are not assigning a vNIC to each guest and using dVswitches.  Instead, we are presenting two vNICs for the Service Console, two vNICs for the VMkernel, and two vNICs for virtual machines and using them as we would if we were on a standard rackmount server.  Each vswitch is configured with one vNIC from fabric A, one vNIC from fabric B, and teamed together in an active/active configuration.

Recommended Change: Instead of active/active teaming, set the service console and VMkernel ports to active/standby.  When doing this, ensure that the active NICs are all on the same fabric interconnect.  This will keep service console/VMkernel traffic from having to hit our northbound switches and keep the traffic isolated to a single fabric interconnect.

.

Here is where it gets interesting.

Once this was done, possibilities came to mind and I asked the $64,000 question.  “Is there a way to keep everything in UCS up and running properly in the event we lose all our northbound links”?  It’s was more of a theoretical question, but we spent the next 6hrs working on it anyway.  🙂

Disclaimer: not all of what you are about to read is fully tested.  This was a theoretical exercise that we didn’t finish testing due to time constraints.  We did test this with two hosts on the same subnet and it worked as theorized.

Here’s what we came up with:

First of all, when UCS loses its northbound links it can behave in two ways.  Via the Network Control Policy – see screen shot below  – the ports can be marked either “link-down” or “warning”.  When northbound ports are marked” link-down”, the various vNICs presented to the blades go down.   This will kick in fabric failover as well if enabled at the vNIC level.  If you are not using the Fabric Failover feature on a particular vNIC, you can achieve the same functionality by running the NIC Teaming drivers at the operating system level.   We are using NIC Teaming at the vswitch level in vSphere and Fabric Failover for bare metal operating systems.

Setting the Network Control Policy to “warning” keeps the ports alive as far as the blades are concerned and no failovers take place.  The beauty of this policy is that it can be applied on a per vNIC basis so you can cherry pick which vNIC is affected by which policy (Link-down or warning).  Using a combination of the Network Control Policy settings and vswitch configurations, it’s possible to keep workloads on UCS up and running, with all servers (virtual or otherwise) communicating without having any external connectivity.  This could be used to prevent massive outages, boot storms due to outages, etc.  In our case, since the bulk of our data center will be on UCS, it basically prevents me from having to restart my datacenter in event of a massive network switch outage.

Here is a table detailing our vSphere switch configuration:

Port Group Service Console NIC1 Service Console NIC2 VMkernel NIC1 VMkernel NIC2 Virtual Machine NIC1 Virtual Machine NIC2
Fabric A B A B A B
Teaming Config Active Standby Active Standby Active Active
Network Control Policy (in UCS) Link-Down Warning Link-Down Warning Link-Down Warning
Network Failover Detection (at vSwitch level) Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only

As far as bare metal blades, go:

NIC1 NIC2
Fabric A B
Teaming Config Active Active or Standby (depends on app)
Network Control Policy (in UCS) Link-Down Warning

Digression: This looks like we are heavily loading up Fabric A, which is true from an overall placement point of view.  However, most of our workloads are in vm, which is configured for active/active, thus providing some semblance of load balancing.  We could go active/active for bare metal blades since the operative feature for them is the Network Control Policy.  With vSphere, we are trying to keep the Service Console and VMkernel vNICs operating on the same fabric interconnects in order to reduce northbound traffic.  Not so with bare metal systems.

Back on track: As previously stated (before tables),   what all this does in affect is to force all my blade traffic onto a single fabric interconnect in case I lose ALL my northbound links.  Since the ports on fabric B are not marked “link-down”, the blades do not see any network issues and continue communicating normally.

.

And now the “BUT”: But this won’t work completely in my environment due to the fact that I am connected to two disjointed L2 networks.  See Brad Hedlund’s blog and The Unified Computing blog for more details.  In order for this to completely work, I will need to put in a software router of some sort to span the two different networks (VLANS in this case).

.

So what do you think?  Anyone out there with a lab that can fully test this?  If so, I would interested in seeing your results.

.

Our Current UCS/vSphere Migration Status

August 17, 2010 Leave a comment

We’ve migrated most of our virtual servers over to UCS and vSphere.  I’d say we are about 85% done, with this phase being completed by Aug 29.  It’s not that it’s taking 10+ days to actually do the rest of the migrations.  It’s more of a scheduling issue.  From my perspective, I have three more downtimes to go.  Not much at all.

The whole process of migrating from ESX to vSphere and updating all the virtual servers has been interesting to say the least.  We haven’t encountered any major problems; just some small items related to the VMtools/VMhardware version (4 to 7) upgrades.   For example, our basic VMTools upgrade process is to right-click on a guest in the VIC and click on the appropriate items to perform an automatic upgrade.  When it works, the guest installs VMTools, reboots,  and comes back up without admin intervention.  For some reason, this would not work for our MS Terminal Servers unless we were logged into the target terminal server.

Here’s another example, this time involving Windows Server 2008:  The automatic upgrade process wouldn’t work either.  Instead, we had to login and launch VMTools from the System Tray and select upgrade.  The only operating system that went perfectly was Windows Server 2003 with no fancy extras (terminal services, etc).  Luckily, that’s the o/s most of our virtual workloads are running.  I am going to hazard a guess and say that some of these oddities are related to our various security settings, GPOs, and the like.

All-in-all, the vm migration has gone very smoothly.  I must say that I am happy with the quality of the VMware hyerpvisor, Virtual Center, and other basic components.  There has been plenty of opportunity for something to go extremely wrong, but so far, nada. (knock on wood)

So what’s next?  We are preparing to migrate our SQL servers onto bare metal blades.  In reality, we are building new servers from scratch and installing SQL server.  The implementation of UCS has given us the opportunity to update our SQL servers to Windows Server 2008 and SQL Server 2008.   Other planned moved include some Oracle app servers (on RedHat) as well as domain controllers, file share clusters, and maybe some tape backup servers.  This should take us into September.

Once we finish with the blades, we’ll start deploying the Cisco C-series rackmount servers.  We still have a number of instances where we have to go rackmount.   Servers in this category typically need multiple NICs, telephony boards, or other specialized expansion boards.

.

Upgrade Follies

August 12, 2010 Leave a comment

It’s amazing how many misconfigured, or perceived misconfigured, items can show up when doing maintenance and/or upgrades.  In the past three weeks, we have found at least four production items that fit this description that no one noticed because things appeared to be working.  Here’s a sampling:

During our migration from our legacy vm host hardware to UCS, we broke a website that was hardware load-balanced across two different servers.  Traffic should have been directed to Server A, then Server B, then Server C.  After the migration traffic was only going to Server C, which just hosts a page that says the site is down.  It’s a “maintenance” server, meaning that whenever we take a public facing page down, the traffic gets directed to Server C so that people can see a nice screen that says, “Sorry down for maintenance …..”

Everything looked right in the load balancer configuration.  While delving deeper, we noticed that server A was configured to be the primary node for a few other websites.  An application analyst whose app was affected chimed in and said that the configuration was incorrect.  Website 1 traffic was to go first to Server A, then B.  Website 2 traffic was supposed to go in the opposite order.   All our application documentation agreed with the analyst.  Of course, he wrote the documentation so it better agree with him 🙂  Here is the disconnect: we track all our changes in a Change Management system and no one ever put the desired configuration change into the system.  As far as our network team is concerned; the load balancer is configured properly.  Now this isn’t really a folly since our production system/network matched what our change management and CMDB systems were telling us.  This is actually GOODNESS.  If we ever had to recover due to a disaster, we would reference our CMDB and change management systems so they had better be in agreement.

Here’s another example:  We did a mail server upgrade about six months ago and everything worked as far as we could tell.  What we didn’t know was that some things were not working but no one noticed because mail was getting through.  When we did notice something not correct (a remote monitoring system) and fixed the cause, it led us to another item, and so on and so on.  Now, not everything was broken at the same time.  In a few cases, the fix of one item actually broke something else.  What’s funny is that if we didn’t correct the monitoring issue, everything would have still worked.  It was a fix that caused all the other problems.  In other words, one misconfiguration proved to be a correct configuration for other misconfigured items.  In this case, multiple wrongs did make a right.  Go Figure.

My manager has a saying for this: “If you are going to miss, miss by enough”.

.

I’ve also noticed that I sometimes don’t understand concepts when I think I do.  As part of our migration to UCS, we are also upgrading from ESX3.5 to vSphere.   Since I am new to vSphere, I did pretty much what every SysAdmin does: click all the buttons/links.  One of those buttons is the “Advanced Runtime Info” link that is part of the VMware HA portion of the main Virtual Center screen.

This link brings up info on slot sizes and usage.  You would think that numbers would add up, but clearly they don’t.

How does 268 -12 = 122?  I’m either obviously math challenged or I really need to go back and re-read the concept of Slots.

.

Let the Migrations Begin!!

August 7, 2010 2 comments

It’s been a few weeks since I last posted an update on our Cisco UCS implementation.  We’ve mostly been in a holding pattern until now.  Yes, we finally got the network integration component figured out.  Unfortunately, we had to dedicate some additional L2 switches to accommodate our desired end-goal.  If you look back a few posts, I covered the issues with connecting UCS to two disjointed L2 networks.  We followed the recommended workaround and it seems to be working.  It took us a bit to get here since my shop did not use VLANs, which turn out to be part of the workaround.

So now we have been in a test mode for a bit over a week with no additional problems found.  Now it’s time for real workloads.  We migrated a few development systems over Wednesday to test out our migration process.  Up until then, it was a paper exercise.  It worked, but required more time that we thought for VMtools and VM hardware version upgrades.  The real fun starts today when we migrate a few production workloads.  If all goes well, I’ll be very busy over the next 45 days as we move all our VMware and a number of bare metal installs to UCS.

Since we chose to migrate by moving one LUN at a time from the old hosts to the new hosts, and also upgrade to vSphere, our basic VM migrations process goes like this:

  1. Power off guests that are to be migrated.  These guests should be on the same LUN.
  2. Present the LUN to the new VM hosts and do an HBA rescan on the new hosts.
  3. In Virtual Center, click on a guest to be migrated.  Click on the migrate link and select Host.    The migration should take seconds.
  4. Repeat for all other guests on this LUN.
  5. Unpresent the LUN from the old hosts.
  6. Power up guests
  7. Upgrade VM tools (now that we are on vSphere hosts) and reboot.
  8. Power the guests down.
  9. Upgrade VM hardware.
  10. Power up the guests and let them Plug-n-Play the new hardware and reboot when needed.
  11. Test

We chose to do steps 6 through 10 using no more than four guests at a time.  It’s easier to keep track of things this way and the process seems to be working so far.

We are lucky to be on ESX 3.5.  If we started out on ESX4, the LUN migration method would require extra steps due to the process of LUN removal from the old hosts.  To properly remove a LUN from ESX4, you will need to follow a number of convoluted steps as noted in this VMware KB.  With ESX3.5, you can just unpresent and do an HBA rescan.

I don’t know the technical reason for all these extra steps to remove a LUN in vSphere, but it sure seem like a step backwards from a customer perspective.  Maybe VMware will change it in the next version.

Categories: UCS, VMware Tags: , ,