Can UCS Survive a Network Outage?
Part of our UCS implementation involved the use of Cisco Advanced Services (AS) to help with the initial configuration and testing. Do to our integration issues, time ran out and we never completed some items related to our implementation plan. AS was back out this week for a few days in order to complete their portion of the plan. Due to timing, we worked with a different AS engineer this time. He performed a health-check of our UCS environment and suggested a vSphere configuration change to help improve performance.
Before I get into what we changed, let me give a quick background on our vSphere configuration. We are using the B250-M2 blade with a single Palo adapter. We are not taking advantage of the advanced vNIC capabilities of the Palo adapter. What I mean by that is that we are not assigning a vNIC to each guest and using dVswitches. Instead, we are presenting two vNICs for the Service Console, two vNICs for the VMkernel, and two vNICs for virtual machines and using them as we would if we were on a standard rackmount server. Each vswitch is configured with one vNIC from fabric A, one vNIC from fabric B, and teamed together in an active/active configuration.
Recommended Change: Instead of active/active teaming, set the service console and VMkernel ports to active/standby. When doing this, ensure that the active NICs are all on the same fabric interconnect. This will keep service console/VMkernel traffic from having to hit our northbound switches and keep the traffic isolated to a single fabric interconnect.
Here is where it gets interesting.
Once this was done, possibilities came to mind and I asked the $64,000 question. “Is there a way to keep everything in UCS up and running properly in the event we lose all our northbound links”? It’s was more of a theoretical question, but we spent the next 6hrs working on it anyway. :)
Disclaimer: not all of what you are about to read is fully tested. This was a theoretical exercise that we didn’t finish testing due to time constraints. We did test this with two hosts on the same subnet and it worked as theorized.
Here’s what we came up with:
First of all, when UCS loses its northbound links it can behave in two ways. Via the Network Control Policy – see screen shot below – the ports can be marked either “link-down” or “warning”. When northbound ports are marked” link-down”, the various vNICs presented to the blades go down. This will kick in fabric failover as well if enabled at the vNIC level. If you are not using the Fabric Failover feature on a particular vNIC, you can achieve the same functionality by running the NIC Teaming drivers at the operating system level. We are using NIC Teaming at the vswitch level in vSphere and Fabric Failover for bare metal operating systems.
Setting the Network Control Policy to “warning” keeps the ports alive as far as the blades are concerned and no failovers take place. The beauty of this policy is that it can be applied on a per vNIC basis so you can cherry pick which vNIC is affected by which policy (Link-down or warning). Using a combination of the Network Control Policy settings and vswitch configurations, it’s possible to keep workloads on UCS up and running, with all servers (virtual or otherwise) communicating without having any external connectivity. This could be used to prevent massive outages, boot storms due to outages, etc. In our case, since the bulk of our data center will be on UCS, it basically prevents me from having to restart my datacenter in event of a massive network switch outage.
Here is a table detailing our vSphere switch configuration:
|Port Group||Service Console NIC1||Service Console NIC2||VMkernel NIC1||VMkernel NIC2||Virtual Machine NIC1||Virtual Machine NIC2|
|Network Control Policy (in UCS)||Link-Down||Warning||Link-Down||Warning||Link-Down||Warning|
|Network Failover Detection (at vSwitch level)||Link Status Only||Link Status Only||Link Status Only||Link Status Only||Link Status Only||Link Status Only|
As far as bare metal blades, go:
|Teaming Config||Active||Active or Standby (depends on app)|
|Network Control Policy (in UCS)||Link-Down||Warning|
Digression: This looks like we are heavily loading up Fabric A, which is true from an overall placement point of view. However, most of our workloads are in vm, which is configured for active/active, thus providing some semblance of load balancing. We could go active/active for bare metal blades since the operative feature for them is the Network Control Policy. With vSphere, we are trying to keep the Service Console and VMkernel vNICs operating on the same fabric interconnects in order to reduce northbound traffic. Not so with bare metal systems.
Back on track: As previously stated (before tables), what all this does in affect is to force all my blade traffic onto a single fabric interconnect in case I lose ALL my northbound links. Since the ports on fabric B are not marked “link-down”, the blades do not see any network issues and continue communicating normally.
And now the “BUT”: But this won’t work completely in my environment due to the fact that I am connected to two disjointed L2 networks. See Brad Hedlund’s blog and The Unified Computing blog for more details. In order for this to completely work, I will need to put in a software router of some sort to span the two different networks (VLANS in this case).
So what do you think? Anyone out there with a lab that can fully test this? If so, I would interested in seeing your results.