Archive

Archive for September, 2010

Can UCS Survive a Network Outage?

September 29, 2010 1 comment

Part of our UCS implementation involved the use of Cisco Advanced Services (AS) to help with the initial configuration and testing.  Do to our integration issues, time ran out and we never completed some items related to our implementation plan.  AS was back out this week for a few days in order to complete their portion of the plan.    Due to timing, we worked with a different AS engineer this time.  He performed a health-check of our UCS environment and suggested a vSphere configuration change to help improve performance.

Before I get into what we changed, let me give a quick background on our vSphere configuration.  We are using the B250-M2 blade with a single Palo adapter.  We are not taking advantage of the advanced vNIC capabilities of the Palo adapter.  What I mean by that is that we are not assigning a vNIC to each guest and using dVswitches.  Instead, we are presenting two vNICs for the Service Console, two vNICs for the VMkernel, and two vNICs for virtual machines and using them as we would if we were on a standard rackmount server.  Each vswitch is configured with one vNIC from fabric A, one vNIC from fabric B, and teamed together in an active/active configuration.

Recommended Change: Instead of active/active teaming, set the service console and VMkernel ports to active/standby.  When doing this, ensure that the active NICs are all on the same fabric interconnect.  This will keep service console/VMkernel traffic from having to hit our northbound switches and keep the traffic isolated to a single fabric interconnect.

.

Here is where it gets interesting.

Once this was done, possibilities came to mind and I asked the $64,000 question.  “Is there a way to keep everything in UCS up and running properly in the event we lose all our northbound links”?  It’s was more of a theoretical question, but we spent the next 6hrs working on it anyway.  🙂

Disclaimer: not all of what you are about to read is fully tested.  This was a theoretical exercise that we didn’t finish testing due to time constraints.  We did test this with two hosts on the same subnet and it worked as theorized.

Here’s what we came up with:

First of all, when UCS loses its northbound links it can behave in two ways.  Via the Network Control Policy – see screen shot below  – the ports can be marked either “link-down” or “warning”.  When northbound ports are marked” link-down”, the various vNICs presented to the blades go down.   This will kick in fabric failover as well if enabled at the vNIC level.  If you are not using the Fabric Failover feature on a particular vNIC, you can achieve the same functionality by running the NIC Teaming drivers at the operating system level.   We are using NIC Teaming at the vswitch level in vSphere and Fabric Failover for bare metal operating systems.

Setting the Network Control Policy to “warning” keeps the ports alive as far as the blades are concerned and no failovers take place.  The beauty of this policy is that it can be applied on a per vNIC basis so you can cherry pick which vNIC is affected by which policy (Link-down or warning).  Using a combination of the Network Control Policy settings and vswitch configurations, it’s possible to keep workloads on UCS up and running, with all servers (virtual or otherwise) communicating without having any external connectivity.  This could be used to prevent massive outages, boot storms due to outages, etc.  In our case, since the bulk of our data center will be on UCS, it basically prevents me from having to restart my datacenter in event of a massive network switch outage.

Here is a table detailing our vSphere switch configuration:

Port Group Service Console NIC1 Service Console NIC2 VMkernel NIC1 VMkernel NIC2 Virtual Machine NIC1 Virtual Machine NIC2
Fabric A B A B A B
Teaming Config Active Standby Active Standby Active Active
Network Control Policy (in UCS) Link-Down Warning Link-Down Warning Link-Down Warning
Network Failover Detection (at vSwitch level) Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only

As far as bare metal blades, go:

NIC1 NIC2
Fabric A B
Teaming Config Active Active or Standby (depends on app)
Network Control Policy (in UCS) Link-Down Warning

Digression: This looks like we are heavily loading up Fabric A, which is true from an overall placement point of view.  However, most of our workloads are in vm, which is configured for active/active, thus providing some semblance of load balancing.  We could go active/active for bare metal blades since the operative feature for them is the Network Control Policy.  With vSphere, we are trying to keep the Service Console and VMkernel vNICs operating on the same fabric interconnects in order to reduce northbound traffic.  Not so with bare metal systems.

Back on track: As previously stated (before tables),   what all this does in affect is to force all my blade traffic onto a single fabric interconnect in case I lose ALL my northbound links.  Since the ports on fabric B are not marked “link-down”, the blades do not see any network issues and continue communicating normally.

.

And now the “BUT”: But this won’t work completely in my environment due to the fact that I am connected to two disjointed L2 networks.  See Brad Hedlund’s blog and The Unified Computing blog for more details.  In order for this to completely work, I will need to put in a software router of some sort to span the two different networks (VLANS in this case).

.

So what do you think?  Anyone out there with a lab that can fully test this?  If so, I would interested in seeing your results.

.

Troubleshooting fault code F0327 in Cisco UCS

September 22, 2010 Leave a comment

For the past few days, I’ve been working on troubleshooting a problem in UCS that I, admittedly, caused.   The problem in question has to do with an error code/msg that I received when trying to move a service profile from one blade to another.  The error code is: F0327.

According to the UCS error code reference guide, it translates as:

fltLsServerConfigFailure

Fault Code:F0327

Message

Service profile [name] configuration failed due to [configQualifier]

Explanation

The named configuration qualifier is not available. This fault typically occurs because Cisco UCS Manager cannot successfully deploy the service profile due to a lack of resources that meet the named qualifier. For example, the following issues can cause this fault to occur:

•The service profile is configured for a server adapter with vHBAs, and the adapter on the server does not support vHBAs.

•The local disk configuration policy in the service profile specifies the No Local Storage mode, but the server contains local disks.

Recommended Action

If you see this fault, take the following actions:

Step 1  Check the state of the server and ensure that it is in either the discovered or unassociated state.

Step 2  If the server is associated or undiscovered, do one of the following:

–Discover the server.

–Disassociate the server from the current service profile.

–Select another server to associate with the service profile.

Step 3  Review each policy in the service profile and verify that the selected server meets the requirements in the policy.

Step 4  If the server does not meet the requirements of the service profile, do one of the following:

–Modify the service profile to match the server.

–Select another server that does meet the requirements to associate with the service profile.

Step 5  If you can verify that the server meets the requirements of the service profile, execute the show tech-support command and contact Cisco Technical Support.

——————–

While helpful in providing me lots of things to try to fix the problem, none of them worked.  It took me a while, but I figured out how to reproduce the error, a possible cause, and a workaround.

Here’s how to produce the error:

  1. Create a service profile without assigning any HBAs.  Shutdown the server when the association process has completed.
  2. After the profile is associated, assign an HBA or two.
  3. You should receive this dialog box:

You will then see this in the general tab of the service profile in question:

Now here is where the error can be induced:

  1. Don’t power on.  Keep in mind that the previous dialog box said that changes wouldn’t be applied until the blade was rebooted (powered on).
  2. Now disassociate the profile and associate it with another blade.  The “error” is carried over to the new blade and the config process (association process) does not run.

Powering up the newly associated blade does not correct the issue.  What has happened is that the disassociation/association process that is supposed to occur above does not take place due to the service profile being in an error state.

Workaround:

  1. Reboot after adding the HBA.  This will complete the re-configuration process, thus allowing disassociation/association processes to perform normally.  This is also the proper procedure.    Or
  2. Go to the Storage tab of the affected service profile and click on “Change World Wide Node Name”.  This forces the re-configuration to take place.

.

.

I’ve opened a ticket with TAC on this asking for a few documentation updates.  The first update is to basically state the correct method for applying the HBAs and that if not followed, the error msg will appear.

The second update is for them to update the error code guide with a 6th option – Press  “Change World Wide Node Name” button.

I am going to go out on a limb and say that they probably didn’t count on people like me doing things that they shouldn’t be doing or in an improper manner when they wrote the manuals.   🙂

.

Categories: UCS Tags: , ,