Archive

Posts Tagged ‘Strategy’

Book Review: Enterprise Network Testing

May 6, 2011 1 comment

For my next trick, I am going to review another Cisco Press book titled “Enterprise Network Testing”.  I think I can sum this up in two sentences:  “Holy Crap!” and  “This book has PLENTY of cowbell!”.

Now I am not currently a network guy by profession, but if I was, this book would be on my desk with copies on my teammates’ desks too.  It is literally THE blueprint for how to test your network.

The journey into network testing begins with a discussion on why you need to test your network.  Most people only think of one or two reasons.  This book provides a few more to help you make your business case.  BTW, the authors make it very clear that testing your network is not a one-time event.  Testing should be done whenever changes are made, for compliance, introduction of new technologies, etc.  In other words, plan on testing regularly.

One area where this book and I completely agree is where testing should first take place: in the lab.  There is whole chapter devoted to lab strategy.  Topics covered include staffing, facilities planning, test methodologies, power, and more.  I must say that I was surprised at how good this chapter turned out to be.  Most books give basic guidance on lab setup, but like I said at the beginning of this review, this book has plenty of cowbell.

So now you have your lab setup, what are you going to do?  Simple, read this book because it provides guidance for “crafting the test approach” (actual chapter title).  Briefly, this chapter discusses several reasons/objectives for testing and how to craft your strategies to set you up for success.  This includes setting your test targets, what tools are you going to use, writing a test plan, allocating resources, etc.  It’s a very well thought out approach.

Business case approved? Check.  Lab resources allocated? Check.  Test plan created? Check.  Great, now go execute your plan.  Need help?  No problem, this book will walk you through a sample lab setup, finding the appropriate tools, and a few different methodologies for measuring different network characteristics.  This is the point in the book where the authors stress the need to understand what you are testing, the tools you are using, and how to interpret the results.  In other words, if you don’t know what you are doing you will not be successful.

Speaking of knowing your tools, this book does a credible job discussing network toolsets that are available for free and for purchase.  Even non-Cisco products are covered which is something I am not used to seeing in a Cisco Press book.  Usually, these books are oblivious to other companies’ products.  Kudos to the authors for being thorough.

The next six chapters are where you will find plenty of test case examples.  There are individual chapters devoted to six types of testing.  They are: Proof of concept testing, network readiness testing, design verification testing, migration plan testing, new platform and code certification testing, and network ready for use testing.  They are written in a case study format and are quite readable.

Nerdgasm time.  This is where the book gets hairy…Are you too lazy to develop your own plans from scratch?  You want to cheat?  Just borrow the DETAILED test plans that are in the next seven chapters.  There is enough meat here that Cisco Press could copy & paste into a shorter book to sell.  We are talking over 200 pages of test plans covering seven areas.  That’s a lot of cowbell!

The book ends on a high note.  Since you went through the trouble of setting up a lab, why not use it for training/learning purposes. Step-by-step instructions are provided to setup a lab. This chapter may not be useful to a large number of folks since the equipment covered is pure Cisco, including UCS.  In fact, many of the directions provided center around setting up a UCS environment. I happen to like this chapter because one of my last major implementations before joining VCE was installing UCS for the organization at which I worked.  Sort of brings back memories.

To sum this review up:  If you are in the network field, you need this book.

Use Cases for Cisco UCS Network Isolation

October 4, 2010 Leave a comment

Based on my last post, a couple of people have emailed me asking, “what is the value of keeping UCS links alive when the network has gone down?”  The answer is: It Depends.  It depends on your applications and environment.  In my case, I have a number of multi-tiered apps that are session oriented, batch processors, etc.

The easiest use case to describe involves batch processing.  We have a few applications that do batch processing late at night.  It just so happens that “late at night” is also the window for performing network maintenance.  When the two bump (batch jobs and maintenance), we either reschedule something (batch or maintenance), take down the application, or move forward and hope nothing goes awry.  Having these applications in UCS and taking advantage of the configuration in my previous post  means I can do network maintenance without having to reschedule batch jobs, or take down the application.

I could probably achieve similar functionality outside of UCS by having a complex setup that makes use of multiple switches and running NIC teaming drivers at the o/s level. However, some of my servers are using all of their physical NICs for different uses, with different IP addresses.  In these cases, teaming drivers may add unnecessary complexity.  Not to mention that the premise of this use case is the ability to do network maintenance.  Any way to avoid relying on the network is a plus in my book in regards to this use case.

Now let’s consider session oriented applications.  In our case, we have a multi-tiered app that requires that open sessions are maintained from one tier to the next.  If there is a hiccup in the connection, the session closes and the app has to be restarted.  Typically, this means rebooting.  Fabric failover prevents the session from closing so the app keeps running.  In this particular case, UCS isolation would prevent this app from doing any work since no clients will be able get to it.  Where it helps us is in restoring service faster when the network comes back due to removing the need for a reboot.

I am going to guess that this can be done with other vendor’s blade systems, but with additional equipment.  What I mean is that with other blade systems, the unit of measure is the chassis.  You can probably configure the internal switches to pass traffic from one blade to another without having to go to another set of switches.  But if you need a blade in chassis A to talk to a blade in chassis B, you will probably need to involve an additional switch, or two, mounted either Top-of-Rack or End-of-Row.  In the case of UCS, the unit of measure is the fabric.  Any blade can communicate with any other blade, provided they are in the same VLAN and assuming EHV mode.  Switch mode may offer more features, but I am not versed in it.

I hope this post answers your questions.  I am still thinking over the possibilities that UCS isolation can bring to the table.  BTW, I made up the term “UCS isolation”.  If anyone has an official term, or one that better describes the situation, please let me know.

Categories: cisco, UCS Tags: , ,

Can UCS Survive a Network Outage?

September 29, 2010 1 comment

Part of our UCS implementation involved the use of Cisco Advanced Services (AS) to help with the initial configuration and testing.  Do to our integration issues, time ran out and we never completed some items related to our implementation plan.  AS was back out this week for a few days in order to complete their portion of the plan.    Due to timing, we worked with a different AS engineer this time.  He performed a health-check of our UCS environment and suggested a vSphere configuration change to help improve performance.

Before I get into what we changed, let me give a quick background on our vSphere configuration.  We are using the B250-M2 blade with a single Palo adapter.  We are not taking advantage of the advanced vNIC capabilities of the Palo adapter.  What I mean by that is that we are not assigning a vNIC to each guest and using dVswitches.  Instead, we are presenting two vNICs for the Service Console, two vNICs for the VMkernel, and two vNICs for virtual machines and using them as we would if we were on a standard rackmount server.  Each vswitch is configured with one vNIC from fabric A, one vNIC from fabric B, and teamed together in an active/active configuration.

Recommended Change: Instead of active/active teaming, set the service console and VMkernel ports to active/standby.  When doing this, ensure that the active NICs are all on the same fabric interconnect.  This will keep service console/VMkernel traffic from having to hit our northbound switches and keep the traffic isolated to a single fabric interconnect.

.

Here is where it gets interesting.

Once this was done, possibilities came to mind and I asked the $64,000 question.  “Is there a way to keep everything in UCS up and running properly in the event we lose all our northbound links”?  It’s was more of a theoretical question, but we spent the next 6hrs working on it anyway.  :)

Disclaimer: not all of what you are about to read is fully tested.  This was a theoretical exercise that we didn’t finish testing due to time constraints.  We did test this with two hosts on the same subnet and it worked as theorized.

Here’s what we came up with:

First of all, when UCS loses its northbound links it can behave in two ways.  Via the Network Control Policy – see screen shot below  – the ports can be marked either “link-down” or “warning”.  When northbound ports are marked” link-down”, the various vNICs presented to the blades go down.   This will kick in fabric failover as well if enabled at the vNIC level.  If you are not using the Fabric Failover feature on a particular vNIC, you can achieve the same functionality by running the NIC Teaming drivers at the operating system level.   We are using NIC Teaming at the vswitch level in vSphere and Fabric Failover for bare metal operating systems.

Setting the Network Control Policy to “warning” keeps the ports alive as far as the blades are concerned and no failovers take place.  The beauty of this policy is that it can be applied on a per vNIC basis so you can cherry pick which vNIC is affected by which policy (Link-down or warning).  Using a combination of the Network Control Policy settings and vswitch configurations, it’s possible to keep workloads on UCS up and running, with all servers (virtual or otherwise) communicating without having any external connectivity.  This could be used to prevent massive outages, boot storms due to outages, etc.  In our case, since the bulk of our data center will be on UCS, it basically prevents me from having to restart my datacenter in event of a massive network switch outage.

Here is a table detailing our vSphere switch configuration:

Port Group Service Console NIC1 Service Console NIC2 VMkernel NIC1 VMkernel NIC2 Virtual Machine NIC1 Virtual Machine NIC2
Fabric A B A B A B
Teaming Config Active Standby Active Standby Active Active
Network Control Policy (in UCS) Link-Down Warning Link-Down Warning Link-Down Warning
Network Failover Detection (at vSwitch level) Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only

As far as bare metal blades, go:

NIC1 NIC2
Fabric A B
Teaming Config Active Active or Standby (depends on app)
Network Control Policy (in UCS) Link-Down Warning

Digression: This looks like we are heavily loading up Fabric A, which is true from an overall placement point of view.  However, most of our workloads are in vm, which is configured for active/active, thus providing some semblance of load balancing.  We could go active/active for bare metal blades since the operative feature for them is the Network Control Policy.  With vSphere, we are trying to keep the Service Console and VMkernel vNICs operating on the same fabric interconnects in order to reduce northbound traffic.  Not so with bare metal systems.

Back on track: As previously stated (before tables),   what all this does in affect is to force all my blade traffic onto a single fabric interconnect in case I lose ALL my northbound links.  Since the ports on fabric B are not marked “link-down”, the blades do not see any network issues and continue communicating normally.

.

And now the “BUT”: But this won’t work completely in my environment due to the fact that I am connected to two disjointed L2 networks.  See Brad Hedlund’s blog and The Unified Computing blog for more details.  In order for this to completely work, I will need to put in a software router of some sort to span the two different networks (VLANS in this case).

.

So what do you think?  Anyone out there with a lab that can fully test this?  If so, I would interested in seeing your results.

.

VMworld voting has begun

The folks who run the VMworld tradeshow have taken a new path to deciding session content this year.  Instead of a panel making all the choices, the powers-that-be have decided to open it up to the public to vote on those topics that interest them.  I’ve submitted a proposal to speak about our soon-to-be UCS implementation and how it is going to transform our business.  If interested, please go here:  http://vmworld.com/community/conferences/2010/cfpvote/tarchitecture and vote for my session.  My session is entitled:

Session ID: TA7081, Session Title: Case Study: vSphere on Cisco UCS – How the City of Mesa changed strategic direction

.


Prepping for our Cisco UCS Implementation

The purchase order has finally been sent in.  This means our implementation is really going to happen.  We’ve been told there is a three week lead time to get the product, but Cisco is looking to reduce it to two weeks.  A lot has to happen before the first package arrives.  Two logistical items of note are:

  • Stockroom prep
  • Datacenter prep

What do I mean by “Stockroom prep?”  A lot actually.  While not a large UCS implementation by many standards, we are purchasing a fair amount of equipment.  We’ve contacted Cisco for various pieces of logistical information such as box dimensions and the  number of boxes we can expect to receive.   Once it gets here, we have to store it.

Our stockroom is maybe 30×40 and houses all our non-deployed IT equipment.  It also houses all our physical layer products (think cabling) too.    A quick look at the area dedicated to servers reveals parts for servers going back almost ten years.  Yes, I have running servers that are nearly ten years old <sigh>.    Throw in generic equipment such as KVM, rackmount monitors, rackmount keyboards, etc and it adds up.   Our plan is to review our existing inventory of deployed equipment and their service histories.  We’ll then bump up that info with our stockroom inventory to see what can be sent to disposal.   Since we don’t have a lot of room, we’ll be really cutting down to the bone which introduces an element of risk.  If we plan correctly, we’ll have a minimum number of parts in our stockroom to get us through our migration.  If we are wrong and something fails, I guess we’ll be buying some really old parts off eBay…

As for prepping the data-center, it’s a bit less labor but a lot more complex.  Our data-center PDUs are almost full so we’ll be doing some re-wiring.  As a side note, the rack PDU recommended by our Cisco SE has an interesting connector to say the least.  These puppies run about $250 each.  The PDUs run over $1200 each.   Since we’ll be running two 42U racks of equipment, that equals four of each component.  That’s almost $6K in power equipment!!

As another data-center prep task, we will need to do some server shuffling.  Servers in rack A will need to move to a different rack.  No biggie, but it takes some effort to pre-cable, schedule the downtime, and then execute the move.

All-in-all, a fair amount of work to do in a short time-frame.