Archive

Archive for the ‘UCS’ Category

Use Cases for Cisco UCS Network Isolation

October 4, 2010 Leave a comment

Based on my last post, a couple of people have emailed me asking, “what is the value of keeping UCS links alive when the network has gone down?”  The answer is: It Depends.  It depends on your applications and environment.  In my case, I have a number of multi-tiered apps that are session oriented, batch processors, etc.

The easiest use case to describe involves batch processing.  We have a few applications that do batch processing late at night.  It just so happens that “late at night” is also the window for performing network maintenance.  When the two bump (batch jobs and maintenance), we either reschedule something (batch or maintenance), take down the application, or move forward and hope nothing goes awry.  Having these applications in UCS and taking advantage of the configuration in my previous post  means I can do network maintenance without having to reschedule batch jobs, or take down the application.

I could probably achieve similar functionality outside of UCS by having a complex setup that makes use of multiple switches and running NIC teaming drivers at the o/s level. However, some of my servers are using all of their physical NICs for different uses, with different IP addresses.  In these cases, teaming drivers may add unnecessary complexity.  Not to mention that the premise of this use case is the ability to do network maintenance.  Any way to avoid relying on the network is a plus in my book in regards to this use case.

Now let’s consider session oriented applications.  In our case, we have a multi-tiered app that requires that open sessions are maintained from one tier to the next.  If there is a hiccup in the connection, the session closes and the app has to be restarted.  Typically, this means rebooting.  Fabric failover prevents the session from closing so the app keeps running.  In this particular case, UCS isolation would prevent this app from doing any work since no clients will be able get to it.  Where it helps us is in restoring service faster when the network comes back due to removing the need for a reboot.

I am going to guess that this can be done with other vendor’s blade systems, but with additional equipment.  What I mean is that with other blade systems, the unit of measure is the chassis.  You can probably configure the internal switches to pass traffic from one blade to another without having to go to another set of switches.  But if you need a blade in chassis A to talk to a blade in chassis B, you will probably need to involve an additional switch, or two, mounted either Top-of-Rack or End-of-Row.  In the case of UCS, the unit of measure is the fabric.  Any blade can communicate with any other blade, provided they are in the same VLAN and assuming EHV mode.  Switch mode may offer more features, but I am not versed in it.

I hope this post answers your questions.  I am still thinking over the possibilities that UCS isolation can bring to the table.  BTW, I made up the term “UCS isolation”.  If anyone has an official term, or one that better describes the situation, please let me know.

Categories: cisco, UCS Tags: , ,

Can UCS Survive a Network Outage?

September 29, 2010 1 comment

Part of our UCS implementation involved the use of Cisco Advanced Services (AS) to help with the initial configuration and testing.  Do to our integration issues, time ran out and we never completed some items related to our implementation plan.  AS was back out this week for a few days in order to complete their portion of the plan.    Due to timing, we worked with a different AS engineer this time.  He performed a health-check of our UCS environment and suggested a vSphere configuration change to help improve performance.

Before I get into what we changed, let me give a quick background on our vSphere configuration.  We are using the B250-M2 blade with a single Palo adapter.  We are not taking advantage of the advanced vNIC capabilities of the Palo adapter.  What I mean by that is that we are not assigning a vNIC to each guest and using dVswitches.  Instead, we are presenting two vNICs for the Service Console, two vNICs for the VMkernel, and two vNICs for virtual machines and using them as we would if we were on a standard rackmount server.  Each vswitch is configured with one vNIC from fabric A, one vNIC from fabric B, and teamed together in an active/active configuration.

Recommended Change: Instead of active/active teaming, set the service console and VMkernel ports to active/standby.  When doing this, ensure that the active NICs are all on the same fabric interconnect.  This will keep service console/VMkernel traffic from having to hit our northbound switches and keep the traffic isolated to a single fabric interconnect.

.

Here is where it gets interesting.

Once this was done, possibilities came to mind and I asked the $64,000 question.  “Is there a way to keep everything in UCS up and running properly in the event we lose all our northbound links”?  It’s was more of a theoretical question, but we spent the next 6hrs working on it anyway.  🙂

Disclaimer: not all of what you are about to read is fully tested.  This was a theoretical exercise that we didn’t finish testing due to time constraints.  We did test this with two hosts on the same subnet and it worked as theorized.

Here’s what we came up with:

First of all, when UCS loses its northbound links it can behave in two ways.  Via the Network Control Policy – see screen shot below  – the ports can be marked either “link-down” or “warning”.  When northbound ports are marked” link-down”, the various vNICs presented to the blades go down.   This will kick in fabric failover as well if enabled at the vNIC level.  If you are not using the Fabric Failover feature on a particular vNIC, you can achieve the same functionality by running the NIC Teaming drivers at the operating system level.   We are using NIC Teaming at the vswitch level in vSphere and Fabric Failover for bare metal operating systems.

Setting the Network Control Policy to “warning” keeps the ports alive as far as the blades are concerned and no failovers take place.  The beauty of this policy is that it can be applied on a per vNIC basis so you can cherry pick which vNIC is affected by which policy (Link-down or warning).  Using a combination of the Network Control Policy settings and vswitch configurations, it’s possible to keep workloads on UCS up and running, with all servers (virtual or otherwise) communicating without having any external connectivity.  This could be used to prevent massive outages, boot storms due to outages, etc.  In our case, since the bulk of our data center will be on UCS, it basically prevents me from having to restart my datacenter in event of a massive network switch outage.

Here is a table detailing our vSphere switch configuration:

Port Group Service Console NIC1 Service Console NIC2 VMkernel NIC1 VMkernel NIC2 Virtual Machine NIC1 Virtual Machine NIC2
Fabric A B A B A B
Teaming Config Active Standby Active Standby Active Active
Network Control Policy (in UCS) Link-Down Warning Link-Down Warning Link-Down Warning
Network Failover Detection (at vSwitch level) Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only

As far as bare metal blades, go:

NIC1 NIC2
Fabric A B
Teaming Config Active Active or Standby (depends on app)
Network Control Policy (in UCS) Link-Down Warning

Digression: This looks like we are heavily loading up Fabric A, which is true from an overall placement point of view.  However, most of our workloads are in vm, which is configured for active/active, thus providing some semblance of load balancing.  We could go active/active for bare metal blades since the operative feature for them is the Network Control Policy.  With vSphere, we are trying to keep the Service Console and VMkernel vNICs operating on the same fabric interconnects in order to reduce northbound traffic.  Not so with bare metal systems.

Back on track: As previously stated (before tables),   what all this does in affect is to force all my blade traffic onto a single fabric interconnect in case I lose ALL my northbound links.  Since the ports on fabric B are not marked “link-down”, the blades do not see any network issues and continue communicating normally.

.

And now the “BUT”: But this won’t work completely in my environment due to the fact that I am connected to two disjointed L2 networks.  See Brad Hedlund’s blog and The Unified Computing blog for more details.  In order for this to completely work, I will need to put in a software router of some sort to span the two different networks (VLANS in this case).

.

So what do you think?  Anyone out there with a lab that can fully test this?  If so, I would interested in seeing your results.

.

Troubleshooting fault code F0327 in Cisco UCS

September 22, 2010 Leave a comment

For the past few days, I’ve been working on troubleshooting a problem in UCS that I, admittedly, caused.   The problem in question has to do with an error code/msg that I received when trying to move a service profile from one blade to another.  The error code is: F0327.

According to the UCS error code reference guide, it translates as:

fltLsServerConfigFailure

Fault Code:F0327

Message

Service profile [name] configuration failed due to [configQualifier]

Explanation

The named configuration qualifier is not available. This fault typically occurs because Cisco UCS Manager cannot successfully deploy the service profile due to a lack of resources that meet the named qualifier. For example, the following issues can cause this fault to occur:

•The service profile is configured for a server adapter with vHBAs, and the adapter on the server does not support vHBAs.

•The local disk configuration policy in the service profile specifies the No Local Storage mode, but the server contains local disks.

Recommended Action

If you see this fault, take the following actions:

Step 1  Check the state of the server and ensure that it is in either the discovered or unassociated state.

Step 2  If the server is associated or undiscovered, do one of the following:

–Discover the server.

–Disassociate the server from the current service profile.

–Select another server to associate with the service profile.

Step 3  Review each policy in the service profile and verify that the selected server meets the requirements in the policy.

Step 4  If the server does not meet the requirements of the service profile, do one of the following:

–Modify the service profile to match the server.

–Select another server that does meet the requirements to associate with the service profile.

Step 5  If you can verify that the server meets the requirements of the service profile, execute the show tech-support command and contact Cisco Technical Support.

——————–

While helpful in providing me lots of things to try to fix the problem, none of them worked.  It took me a while, but I figured out how to reproduce the error, a possible cause, and a workaround.

Here’s how to produce the error:

  1. Create a service profile without assigning any HBAs.  Shutdown the server when the association process has completed.
  2. After the profile is associated, assign an HBA or two.
  3. You should receive this dialog box:

You will then see this in the general tab of the service profile in question:

Now here is where the error can be induced:

  1. Don’t power on.  Keep in mind that the previous dialog box said that changes wouldn’t be applied until the blade was rebooted (powered on).
  2. Now disassociate the profile and associate it with another blade.  The “error” is carried over to the new blade and the config process (association process) does not run.

Powering up the newly associated blade does not correct the issue.  What has happened is that the disassociation/association process that is supposed to occur above does not take place due to the service profile being in an error state.

Workaround:

  1. Reboot after adding the HBA.  This will complete the re-configuration process, thus allowing disassociation/association processes to perform normally.  This is also the proper procedure.    Or
  2. Go to the Storage tab of the affected service profile and click on “Change World Wide Node Name”.  This forces the re-configuration to take place.

.

.

I’ve opened a ticket with TAC on this asking for a few documentation updates.  The first update is to basically state the correct method for applying the HBAs and that if not followed, the error msg will appear.

The second update is for them to update the error code guide with a 6th option – Press  “Change World Wide Node Name” button.

I am going to go out on a limb and say that they probably didn’t count on people like me doing things that they shouldn’t be doing or in an improper manner when they wrote the manuals.   🙂

.

Categories: UCS Tags: , ,

Our Current UCS/vSphere Migration Status

August 17, 2010 Leave a comment

We’ve migrated most of our virtual servers over to UCS and vSphere.  I’d say we are about 85% done, with this phase being completed by Aug 29.  It’s not that it’s taking 10+ days to actually do the rest of the migrations.  It’s more of a scheduling issue.  From my perspective, I have three more downtimes to go.  Not much at all.

The whole process of migrating from ESX to vSphere and updating all the virtual servers has been interesting to say the least.  We haven’t encountered any major problems; just some small items related to the VMtools/VMhardware version (4 to 7) upgrades.   For example, our basic VMTools upgrade process is to right-click on a guest in the VIC and click on the appropriate items to perform an automatic upgrade.  When it works, the guest installs VMTools, reboots,  and comes back up without admin intervention.  For some reason, this would not work for our MS Terminal Servers unless we were logged into the target terminal server.

Here’s another example, this time involving Windows Server 2008:  The automatic upgrade process wouldn’t work either.  Instead, we had to login and launch VMTools from the System Tray and select upgrade.  The only operating system that went perfectly was Windows Server 2003 with no fancy extras (terminal services, etc).  Luckily, that’s the o/s most of our virtual workloads are running.  I am going to hazard a guess and say that some of these oddities are related to our various security settings, GPOs, and the like.

All-in-all, the vm migration has gone very smoothly.  I must say that I am happy with the quality of the VMware hyerpvisor, Virtual Center, and other basic components.  There has been plenty of opportunity for something to go extremely wrong, but so far, nada. (knock on wood)

So what’s next?  We are preparing to migrate our SQL servers onto bare metal blades.  In reality, we are building new servers from scratch and installing SQL server.  The implementation of UCS has given us the opportunity to update our SQL servers to Windows Server 2008 and SQL Server 2008.   Other planned moved include some Oracle app servers (on RedHat) as well as domain controllers, file share clusters, and maybe some tape backup servers.  This should take us into September.

Once we finish with the blades, we’ll start deploying the Cisco C-series rackmount servers.  We still have a number of instances where we have to go rackmount.   Servers in this category typically need multiple NICs, telephony boards, or other specialized expansion boards.

.

Let the Migrations Begin!!

August 7, 2010 2 comments

It’s been a few weeks since I last posted an update on our Cisco UCS implementation.  We’ve mostly been in a holding pattern until now.  Yes, we finally got the network integration component figured out.  Unfortunately, we had to dedicate some additional L2 switches to accommodate our desired end-goal.  If you look back a few posts, I covered the issues with connecting UCS to two disjointed L2 networks.  We followed the recommended workaround and it seems to be working.  It took us a bit to get here since my shop did not use VLANs, which turn out to be part of the workaround.

So now we have been in a test mode for a bit over a week with no additional problems found.  Now it’s time for real workloads.  We migrated a few development systems over Wednesday to test out our migration process.  Up until then, it was a paper exercise.  It worked, but required more time that we thought for VMtools and VM hardware version upgrades.  The real fun starts today when we migrate a few production workloads.  If all goes well, I’ll be very busy over the next 45 days as we move all our VMware and a number of bare metal installs to UCS.

Since we chose to migrate by moving one LUN at a time from the old hosts to the new hosts, and also upgrade to vSphere, our basic VM migrations process goes like this:

  1. Power off guests that are to be migrated.  These guests should be on the same LUN.
  2. Present the LUN to the new VM hosts and do an HBA rescan on the new hosts.
  3. In Virtual Center, click on a guest to be migrated.  Click on the migrate link and select Host.    The migration should take seconds.
  4. Repeat for all other guests on this LUN.
  5. Unpresent the LUN from the old hosts.
  6. Power up guests
  7. Upgrade VM tools (now that we are on vSphere hosts) and reboot.
  8. Power the guests down.
  9. Upgrade VM hardware.
  10. Power up the guests and let them Plug-n-Play the new hardware and reboot when needed.
  11. Test

We chose to do steps 6 through 10 using no more than four guests at a time.  It’s easier to keep track of things this way and the process seems to be working so far.

We are lucky to be on ESX 3.5.  If we started out on ESX4, the LUN migration method would require extra steps due to the process of LUN removal from the old hosts.  To properly remove a LUN from ESX4, you will need to follow a number of convoluted steps as noted in this VMware KB.  With ESX3.5, you can just unpresent and do an HBA rescan.

I don’t know the technical reason for all these extra steps to remove a LUN in vSphere, but it sure seem like a step backwards from a customer perspective.  Maybe VMware will change it in the next version.

Categories: UCS, VMware Tags: , ,

Week Two of Cisco UCS Implementaion Completed

Progress has been made!!

The first few days of the week involved a number of calls back to TAC, the UCS business unit, and various other Cisco resources without much progress.  Then on Thursday I pressed the magic button and all the sudden our fabric interconnects came alive in Fabric Manager (MDS control software).  What did I do? I turned on SNMP.  No one noticed that it was turned off (default state).    Pretty sad actually given the number of people involved in troubleshooting this.

This paragraph subject to change based on confirmation of accuracy from Cisco. So here’s the basic gist of what was going on.  We are running an older version of MDS firmware and the version of Fabric Manager that comes with this firmware is not really “UCS aware”.  It needs a method of communicating with the fabric interconnects to fully see all the WWNs.  The workaround is to use SNMP.   I created an SNMP user in UCS and our storage admin created the same username/password in Fabric Manager.  Of course having the accounts created does nothing if the protocol they need to use is not active.  Duh.

The screenshot below shows the button I am talking about.  The reason no one noticed that SNMP was turned off was because I was able to add traps and users without any warnings about SNMP not being active.  Also, take a look at the HTTP and HTTPS services listed above SNMP.  They are enabled by default.  Easy to miss.

.

.

With storage now presented, we were able to complete some basic testing.  I must say that UCS is pretty resilient if you have cabled all your equipment wisely.  We pulled power plugs, fibre to Ethernet, fibre to storage,  etc and only a few did times did we lose a ping (singular PING!).   All our data transfers kept transferring, pings kept pinging, RDP sessions stayed RDP’ing.

We did learn something interesting in regards to the Palo card and VMware.  If you are using the basic Menlo card (standard CNA), then failover works as expected.  Palo is different.  Suffice it to say that for every vNIC you think you need, add another one.  In other words, you will need two vNICS per vSwitch. When creating vNICs, be sure to balance them across both fabrics and note down the MAC addresses.  Then when you are creating your vSwitches (or DVS) in VMware, apply two vNICs to each switch using one from fabric A and one from fabric B.  This provides the failover capabilities.    I can’t provide all the details because I don’t know them, but it was explained to me by one of the UCS developers that this is a difference in UCS hardware (Menlo vs Palo).

Next up: testing, testing, and more testing with some VLANing thrown in to help us connect up to two disjointed L2 networks.

.

Week One of Cisco UCS Implementation Complete

July 5, 2010 2 comments

The first week of Cisco UCS implementation has passed.  I wish I could say we were 100% successful, but I can’t.  We’ve encountered two sticking points which are requiring some rethinking on our part.

The first problem we have run into revolves around our SAN.  The firmware on our MDS switches is a bit out of date and we’ve encountered a display bug in the graphical SAN management tool (Fabric Manager).  This display bug won’t show our UCS components as “zoneable” addresses.  This means that all SAN configurations relating to UCS have to be done via command line.   Why don’t we update our SAN switch firmware?  That would also entail updating the firmware on our storage arrays and it is not something we are prepared to do right now.  It might end up occurring sooner rather than later if doing everything via command line is too cumbersome.

The second problem involves connecting to two separate L2 networks.  This has been discussed on various blogs such as BradHedlund.com and the Unified Computing Blog.  Suffice it to say that we have proven that UCS was not designed to directly connect to two different L2 networks at the same time.  While there is a forthcoming firmware update that will address this, it does not help us now.  I should clarify that this is not a bug and that UCS is working as designed.  I am going to guess that either Cisco engineers did not think that customers would want to connect in to two L2 networks or that it was just a future roadmap feature.  Either way, we are working on methods to get around the problem.

For those who didn’t click the links to the other blogs, here’s a short synopsis:  UCS basically treats all uplink ports equally.  It doesn’t know about the different networks so it will assume any VLAN can be on any uplink port.  ARPs, broadcasts, other terms and how they all work apply here.  If you want a better description, please go click the links in the previous paragraph.

But the entire week was not wasted and we managed to accomplish quite a bit.  Once we get passed the two hurdles mentioned above, we should be able to commence our testing.  It’s actually quite a bit of work to get this far.  Here’s how it pans out:

  1. Completed setup of policies
  2. Completed setup of Service Profile Templates
  3. Successfully deployed a number of different server types based on Service Profiles and Server Pool Policy Qualifications
  4. Configured our VM infrastructure to support Palo
  5. Configure UCS to support our VM infrastructure
  6. Successfully integrated UCS into our Windows Deployment system

Just getting past numbers 1 and 2 was a feat.  There are a number of policies that you can set so it is very easy to go overboard and create/modify way too many.   The more you create, the more you have to manage and we are trying to follow the K.I.S.S principle as much as possible.   We started out by having too many policies, but eventually came to our senses and whittled the number down.

One odd item to note: when creating vNIC templates, a corresponding port profile is created under the VM tab of UCS Manager.  Deleting vNIC templates does not delete the corresponding port profiles so you will have to manually delete them.  Consistency would be nice here.

And finally, now that we have a complete rack of UCS I can show you the just how “clean” the system looks.

Before

The cabling on a typical rack

After

A full rack of UCS - notice the clean cabling

.

Let’s hope week number two gets us into testing mode…..

.