Archive

Posts Tagged ‘datacenter’

Use Cases for Cisco UCS Network Isolation

October 4, 2010 Leave a comment

Based on my last post, a couple of people have emailed me asking, “what is the value of keeping UCS links alive when the network has gone down?”  The answer is: It Depends.  It depends on your applications and environment.  In my case, I have a number of multi-tiered apps that are session oriented, batch processors, etc.

The easiest use case to describe involves batch processing.  We have a few applications that do batch processing late at night.  It just so happens that “late at night” is also the window for performing network maintenance.  When the two bump (batch jobs and maintenance), we either reschedule something (batch or maintenance), take down the application, or move forward and hope nothing goes awry.  Having these applications in UCS and taking advantage of the configuration in my previous post  means I can do network maintenance without having to reschedule batch jobs, or take down the application.

I could probably achieve similar functionality outside of UCS by having a complex setup that makes use of multiple switches and running NIC teaming drivers at the o/s level. However, some of my servers are using all of their physical NICs for different uses, with different IP addresses.  In these cases, teaming drivers may add unnecessary complexity.  Not to mention that the premise of this use case is the ability to do network maintenance.  Any way to avoid relying on the network is a plus in my book in regards to this use case.

Now let’s consider session oriented applications.  In our case, we have a multi-tiered app that requires that open sessions are maintained from one tier to the next.  If there is a hiccup in the connection, the session closes and the app has to be restarted.  Typically, this means rebooting.  Fabric failover prevents the session from closing so the app keeps running.  In this particular case, UCS isolation would prevent this app from doing any work since no clients will be able get to it.  Where it helps us is in restoring service faster when the network comes back due to removing the need for a reboot.

I am going to guess that this can be done with other vendor’s blade systems, but with additional equipment.  What I mean is that with other blade systems, the unit of measure is the chassis.  You can probably configure the internal switches to pass traffic from one blade to another without having to go to another set of switches.  But if you need a blade in chassis A to talk to a blade in chassis B, you will probably need to involve an additional switch, or two, mounted either Top-of-Rack or End-of-Row.  In the case of UCS, the unit of measure is the fabric.  Any blade can communicate with any other blade, provided they are in the same VLAN and assuming EHV mode.  Switch mode may offer more features, but I am not versed in it.

I hope this post answers your questions.  I am still thinking over the possibilities that UCS isolation can bring to the table.  BTW, I made up the term “UCS isolation”.  If anyone has an official term, or one that better describes the situation, please let me know.

Categories: cisco, UCS Tags: , ,

Can UCS Survive a Network Outage?

September 29, 2010 1 comment

Part of our UCS implementation involved the use of Cisco Advanced Services (AS) to help with the initial configuration and testing.  Do to our integration issues, time ran out and we never completed some items related to our implementation plan.  AS was back out this week for a few days in order to complete their portion of the plan.    Due to timing, we worked with a different AS engineer this time.  He performed a health-check of our UCS environment and suggested a vSphere configuration change to help improve performance.

Before I get into what we changed, let me give a quick background on our vSphere configuration.  We are using the B250-M2 blade with a single Palo adapter.  We are not taking advantage of the advanced vNIC capabilities of the Palo adapter.  What I mean by that is that we are not assigning a vNIC to each guest and using dVswitches.  Instead, we are presenting two vNICs for the Service Console, two vNICs for the VMkernel, and two vNICs for virtual machines and using them as we would if we were on a standard rackmount server.  Each vswitch is configured with one vNIC from fabric A, one vNIC from fabric B, and teamed together in an active/active configuration.

Recommended Change: Instead of active/active teaming, set the service console and VMkernel ports to active/standby.  When doing this, ensure that the active NICs are all on the same fabric interconnect.  This will keep service console/VMkernel traffic from having to hit our northbound switches and keep the traffic isolated to a single fabric interconnect.

.

Here is where it gets interesting.

Once this was done, possibilities came to mind and I asked the $64,000 question.  “Is there a way to keep everything in UCS up and running properly in the event we lose all our northbound links”?  It’s was more of a theoretical question, but we spent the next 6hrs working on it anyway.  🙂

Disclaimer: not all of what you are about to read is fully tested.  This was a theoretical exercise that we didn’t finish testing due to time constraints.  We did test this with two hosts on the same subnet and it worked as theorized.

Here’s what we came up with:

First of all, when UCS loses its northbound links it can behave in two ways.  Via the Network Control Policy – see screen shot below  – the ports can be marked either “link-down” or “warning”.  When northbound ports are marked” link-down”, the various vNICs presented to the blades go down.   This will kick in fabric failover as well if enabled at the vNIC level.  If you are not using the Fabric Failover feature on a particular vNIC, you can achieve the same functionality by running the NIC Teaming drivers at the operating system level.   We are using NIC Teaming at the vswitch level in vSphere and Fabric Failover for bare metal operating systems.

Setting the Network Control Policy to “warning” keeps the ports alive as far as the blades are concerned and no failovers take place.  The beauty of this policy is that it can be applied on a per vNIC basis so you can cherry pick which vNIC is affected by which policy (Link-down or warning).  Using a combination of the Network Control Policy settings and vswitch configurations, it’s possible to keep workloads on UCS up and running, with all servers (virtual or otherwise) communicating without having any external connectivity.  This could be used to prevent massive outages, boot storms due to outages, etc.  In our case, since the bulk of our data center will be on UCS, it basically prevents me from having to restart my datacenter in event of a massive network switch outage.

Here is a table detailing our vSphere switch configuration:

Port Group Service Console NIC1 Service Console NIC2 VMkernel NIC1 VMkernel NIC2 Virtual Machine NIC1 Virtual Machine NIC2
Fabric A B A B A B
Teaming Config Active Standby Active Standby Active Active
Network Control Policy (in UCS) Link-Down Warning Link-Down Warning Link-Down Warning
Network Failover Detection (at vSwitch level) Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only

As far as bare metal blades, go:

NIC1 NIC2
Fabric A B
Teaming Config Active Active or Standby (depends on app)
Network Control Policy (in UCS) Link-Down Warning

Digression: This looks like we are heavily loading up Fabric A, which is true from an overall placement point of view.  However, most of our workloads are in vm, which is configured for active/active, thus providing some semblance of load balancing.  We could go active/active for bare metal blades since the operative feature for them is the Network Control Policy.  With vSphere, we are trying to keep the Service Console and VMkernel vNICs operating on the same fabric interconnects in order to reduce northbound traffic.  Not so with bare metal systems.

Back on track: As previously stated (before tables),   what all this does in affect is to force all my blade traffic onto a single fabric interconnect in case I lose ALL my northbound links.  Since the ports on fabric B are not marked “link-down”, the blades do not see any network issues and continue communicating normally.

.

And now the “BUT”: But this won’t work completely in my environment due to the fact that I am connected to two disjointed L2 networks.  See Brad Hedlund’s blog and The Unified Computing blog for more details.  In order for this to completely work, I will need to put in a software router of some sort to span the two different networks (VLANS in this case).

.

So what do you think?  Anyone out there with a lab that can fully test this?  If so, I would interested in seeing your results.

.

Week One of Cisco UCS Implementation Complete

July 5, 2010 2 comments

The first week of Cisco UCS implementation has passed.  I wish I could say we were 100% successful, but I can’t.  We’ve encountered two sticking points which are requiring some rethinking on our part.

The first problem we have run into revolves around our SAN.  The firmware on our MDS switches is a bit out of date and we’ve encountered a display bug in the graphical SAN management tool (Fabric Manager).  This display bug won’t show our UCS components as “zoneable” addresses.  This means that all SAN configurations relating to UCS have to be done via command line.   Why don’t we update our SAN switch firmware?  That would also entail updating the firmware on our storage arrays and it is not something we are prepared to do right now.  It might end up occurring sooner rather than later if doing everything via command line is too cumbersome.

The second problem involves connecting to two separate L2 networks.  This has been discussed on various blogs such as BradHedlund.com and the Unified Computing Blog.  Suffice it to say that we have proven that UCS was not designed to directly connect to two different L2 networks at the same time.  While there is a forthcoming firmware update that will address this, it does not help us now.  I should clarify that this is not a bug and that UCS is working as designed.  I am going to guess that either Cisco engineers did not think that customers would want to connect in to two L2 networks or that it was just a future roadmap feature.  Either way, we are working on methods to get around the problem.

For those who didn’t click the links to the other blogs, here’s a short synopsis:  UCS basically treats all uplink ports equally.  It doesn’t know about the different networks so it will assume any VLAN can be on any uplink port.  ARPs, broadcasts, other terms and how they all work apply here.  If you want a better description, please go click the links in the previous paragraph.

But the entire week was not wasted and we managed to accomplish quite a bit.  Once we get passed the two hurdles mentioned above, we should be able to commence our testing.  It’s actually quite a bit of work to get this far.  Here’s how it pans out:

  1. Completed setup of policies
  2. Completed setup of Service Profile Templates
  3. Successfully deployed a number of different server types based on Service Profiles and Server Pool Policy Qualifications
  4. Configured our VM infrastructure to support Palo
  5. Configure UCS to support our VM infrastructure
  6. Successfully integrated UCS into our Windows Deployment system

Just getting past numbers 1 and 2 was a feat.  There are a number of policies that you can set so it is very easy to go overboard and create/modify way too many.   The more you create, the more you have to manage and we are trying to follow the K.I.S.S principle as much as possible.   We started out by having too many policies, but eventually came to our senses and whittled the number down.

One odd item to note: when creating vNIC templates, a corresponding port profile is created under the VM tab of UCS Manager.  Deleting vNIC templates does not delete the corresponding port profiles so you will have to manually delete them.  Consistency would be nice here.

And finally, now that we have a complete rack of UCS I can show you the just how “clean” the system looks.

Before

The cabling on a typical rack

After

A full rack of UCS - notice the clean cabling

.

Let’s hope week number two gets us into testing mode…..

.

DataCenter Prep Complete for Cisco UCS

We completed our data center preparations a few weeks ago.    I knew we were getting some higher end power distribution units (PDUs) for the server cabinets, but I did not realize how big these items are.   We went with APC7866 PDUs for the following reasons:

  1. Two of them provide the requisite number of outlets that we need for an entire rack of UCS.
  2. The PDUs can be monitored via SNMP.  This cabinet is going to very dense in terms of compute power so we’ll need to stay on top of environmental and power conditions.
  3. Our Cisco SE recommended it.  Nothing like having equipment recommended by your vendors.

Below is a photo of the power outlet from out data center PDU:

As you can see, that goes from my colleague’s wrist to his elbow.   Now here’s a photo of the two PDUs (data center and server cabinet) connected:

The length of the two connectors is almost as long as a grown man’s arm!    Just for grins, here is what the server cabinet PDU plug looks like:

Those are some beefy looking prongs.  They need to be since we are looking at 3phase, 240volt, 60amp.

What are you using for your UCS implementation?

.

Prepping for our Cisco UCS Implementation

The purchase order has finally been sent in.  This means our implementation is really going to happen.  We’ve been told there is a three week lead time to get the product, but Cisco is looking to reduce it to two weeks.  A lot has to happen before the first package arrives.  Two logistical items of note are:

  • Stockroom prep
  • Datacenter prep

What do I mean by “Stockroom prep?”  A lot actually.  While not a large UCS implementation by many standards, we are purchasing a fair amount of equipment.  We’ve contacted Cisco for various pieces of logistical information such as box dimensions and the  number of boxes we can expect to receive.   Once it gets here, we have to store it.

Our stockroom is maybe 30×40 and houses all our non-deployed IT equipment.  It also houses all our physical layer products (think cabling) too.    A quick look at the area dedicated to servers reveals parts for servers going back almost ten years.  Yes, I have running servers that are nearly ten years old <sigh>.    Throw in generic equipment such as KVM, rackmount monitors, rackmount keyboards, etc and it adds up.   Our plan is to review our existing inventory of deployed equipment and their service histories.  We’ll then bump up that info with our stockroom inventory to see what can be sent to disposal.   Since we don’t have a lot of room, we’ll be really cutting down to the bone which introduces an element of risk.  If we plan correctly, we’ll have a minimum number of parts in our stockroom to get us through our migration.  If we are wrong and something fails, I guess we’ll be buying some really old parts off eBay…

As for prepping the data-center, it’s a bit less labor but a lot more complex.  Our data-center PDUs are almost full so we’ll be doing some re-wiring.  As a side note, the rack PDU recommended by our Cisco SE has an interesting connector to say the least.  These puppies run about $250 each.  The PDUs run over $1200 each.   Since we’ll be running two 42U racks of equipment, that equals four of each component.  That’s almost $6K in power equipment!!

As another data-center prep task, we will need to do some server shuffling.  Servers in rack A will need to move to a different rack.  No biggie, but it takes some effort to pre-cable, schedule the downtime, and then execute the move.

All-in-all, a fair amount of work to do in a short time-frame.