Archive

Posts Tagged ‘Cisco UCS’

Guest Post on “Define the Cloud”

December 8, 2010 Leave a comment

I haven’t posted anything in a while, but not because I have been lazy…well, maybe a little bit, but the main reason is that I have been working on a guest post over at DefinetheCloud.net.   It’s a good site with a lot of great technical info.  A direct link to my post can be had here. Enjoy the read.

 

Advertisements
Categories: cisco Tags: ,

Use Cases for Cisco UCS Network Isolation

October 4, 2010 Leave a comment

Based on my last post, a couple of people have emailed me asking, “what is the value of keeping UCS links alive when the network has gone down?”  The answer is: It Depends.  It depends on your applications and environment.  In my case, I have a number of multi-tiered apps that are session oriented, batch processors, etc.

The easiest use case to describe involves batch processing.  We have a few applications that do batch processing late at night.  It just so happens that “late at night” is also the window for performing network maintenance.  When the two bump (batch jobs and maintenance), we either reschedule something (batch or maintenance), take down the application, or move forward and hope nothing goes awry.  Having these applications in UCS and taking advantage of the configuration in my previous post  means I can do network maintenance without having to reschedule batch jobs, or take down the application.

I could probably achieve similar functionality outside of UCS by having a complex setup that makes use of multiple switches and running NIC teaming drivers at the o/s level. However, some of my servers are using all of their physical NICs for different uses, with different IP addresses.  In these cases, teaming drivers may add unnecessary complexity.  Not to mention that the premise of this use case is the ability to do network maintenance.  Any way to avoid relying on the network is a plus in my book in regards to this use case.

Now let’s consider session oriented applications.  In our case, we have a multi-tiered app that requires that open sessions are maintained from one tier to the next.  If there is a hiccup in the connection, the session closes and the app has to be restarted.  Typically, this means rebooting.  Fabric failover prevents the session from closing so the app keeps running.  In this particular case, UCS isolation would prevent this app from doing any work since no clients will be able get to it.  Where it helps us is in restoring service faster when the network comes back due to removing the need for a reboot.

I am going to guess that this can be done with other vendor’s blade systems, but with additional equipment.  What I mean is that with other blade systems, the unit of measure is the chassis.  You can probably configure the internal switches to pass traffic from one blade to another without having to go to another set of switches.  But if you need a blade in chassis A to talk to a blade in chassis B, you will probably need to involve an additional switch, or two, mounted either Top-of-Rack or End-of-Row.  In the case of UCS, the unit of measure is the fabric.  Any blade can communicate with any other blade, provided they are in the same VLAN and assuming EHV mode.  Switch mode may offer more features, but I am not versed in it.

I hope this post answers your questions.  I am still thinking over the possibilities that UCS isolation can bring to the table.  BTW, I made up the term “UCS isolation”.  If anyone has an official term, or one that better describes the situation, please let me know.

Categories: cisco, UCS Tags: , ,

Can UCS Survive a Network Outage?

September 29, 2010 1 comment

Part of our UCS implementation involved the use of Cisco Advanced Services (AS) to help with the initial configuration and testing.  Do to our integration issues, time ran out and we never completed some items related to our implementation plan.  AS was back out this week for a few days in order to complete their portion of the plan.    Due to timing, we worked with a different AS engineer this time.  He performed a health-check of our UCS environment and suggested a vSphere configuration change to help improve performance.

Before I get into what we changed, let me give a quick background on our vSphere configuration.  We are using the B250-M2 blade with a single Palo adapter.  We are not taking advantage of the advanced vNIC capabilities of the Palo adapter.  What I mean by that is that we are not assigning a vNIC to each guest and using dVswitches.  Instead, we are presenting two vNICs for the Service Console, two vNICs for the VMkernel, and two vNICs for virtual machines and using them as we would if we were on a standard rackmount server.  Each vswitch is configured with one vNIC from fabric A, one vNIC from fabric B, and teamed together in an active/active configuration.

Recommended Change: Instead of active/active teaming, set the service console and VMkernel ports to active/standby.  When doing this, ensure that the active NICs are all on the same fabric interconnect.  This will keep service console/VMkernel traffic from having to hit our northbound switches and keep the traffic isolated to a single fabric interconnect.

.

Here is where it gets interesting.

Once this was done, possibilities came to mind and I asked the $64,000 question.  “Is there a way to keep everything in UCS up and running properly in the event we lose all our northbound links”?  It’s was more of a theoretical question, but we spent the next 6hrs working on it anyway.  🙂

Disclaimer: not all of what you are about to read is fully tested.  This was a theoretical exercise that we didn’t finish testing due to time constraints.  We did test this with two hosts on the same subnet and it worked as theorized.

Here’s what we came up with:

First of all, when UCS loses its northbound links it can behave in two ways.  Via the Network Control Policy – see screen shot below  – the ports can be marked either “link-down” or “warning”.  When northbound ports are marked” link-down”, the various vNICs presented to the blades go down.   This will kick in fabric failover as well if enabled at the vNIC level.  If you are not using the Fabric Failover feature on a particular vNIC, you can achieve the same functionality by running the NIC Teaming drivers at the operating system level.   We are using NIC Teaming at the vswitch level in vSphere and Fabric Failover for bare metal operating systems.

Setting the Network Control Policy to “warning” keeps the ports alive as far as the blades are concerned and no failovers take place.  The beauty of this policy is that it can be applied on a per vNIC basis so you can cherry pick which vNIC is affected by which policy (Link-down or warning).  Using a combination of the Network Control Policy settings and vswitch configurations, it’s possible to keep workloads on UCS up and running, with all servers (virtual or otherwise) communicating without having any external connectivity.  This could be used to prevent massive outages, boot storms due to outages, etc.  In our case, since the bulk of our data center will be on UCS, it basically prevents me from having to restart my datacenter in event of a massive network switch outage.

Here is a table detailing our vSphere switch configuration:

Port Group Service Console NIC1 Service Console NIC2 VMkernel NIC1 VMkernel NIC2 Virtual Machine NIC1 Virtual Machine NIC2
Fabric A B A B A B
Teaming Config Active Standby Active Standby Active Active
Network Control Policy (in UCS) Link-Down Warning Link-Down Warning Link-Down Warning
Network Failover Detection (at vSwitch level) Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only Link Status Only

As far as bare metal blades, go:

NIC1 NIC2
Fabric A B
Teaming Config Active Active or Standby (depends on app)
Network Control Policy (in UCS) Link-Down Warning

Digression: This looks like we are heavily loading up Fabric A, which is true from an overall placement point of view.  However, most of our workloads are in vm, which is configured for active/active, thus providing some semblance of load balancing.  We could go active/active for bare metal blades since the operative feature for them is the Network Control Policy.  With vSphere, we are trying to keep the Service Console and VMkernel vNICs operating on the same fabric interconnects in order to reduce northbound traffic.  Not so with bare metal systems.

Back on track: As previously stated (before tables),   what all this does in affect is to force all my blade traffic onto a single fabric interconnect in case I lose ALL my northbound links.  Since the ports on fabric B are not marked “link-down”, the blades do not see any network issues and continue communicating normally.

.

And now the “BUT”: But this won’t work completely in my environment due to the fact that I am connected to two disjointed L2 networks.  See Brad Hedlund’s blog and The Unified Computing blog for more details.  In order for this to completely work, I will need to put in a software router of some sort to span the two different networks (VLANS in this case).

.

So what do you think?  Anyone out there with a lab that can fully test this?  If so, I would interested in seeing your results.

.

Troubleshooting fault code F0327 in Cisco UCS

September 22, 2010 Leave a comment

For the past few days, I’ve been working on troubleshooting a problem in UCS that I, admittedly, caused.   The problem in question has to do with an error code/msg that I received when trying to move a service profile from one blade to another.  The error code is: F0327.

According to the UCS error code reference guide, it translates as:

fltLsServerConfigFailure

Fault Code:F0327

Message

Service profile [name] configuration failed due to [configQualifier]

Explanation

The named configuration qualifier is not available. This fault typically occurs because Cisco UCS Manager cannot successfully deploy the service profile due to a lack of resources that meet the named qualifier. For example, the following issues can cause this fault to occur:

•The service profile is configured for a server adapter with vHBAs, and the adapter on the server does not support vHBAs.

•The local disk configuration policy in the service profile specifies the No Local Storage mode, but the server contains local disks.

Recommended Action

If you see this fault, take the following actions:

Step 1  Check the state of the server and ensure that it is in either the discovered or unassociated state.

Step 2  If the server is associated or undiscovered, do one of the following:

–Discover the server.

–Disassociate the server from the current service profile.

–Select another server to associate with the service profile.

Step 3  Review each policy in the service profile and verify that the selected server meets the requirements in the policy.

Step 4  If the server does not meet the requirements of the service profile, do one of the following:

–Modify the service profile to match the server.

–Select another server that does meet the requirements to associate with the service profile.

Step 5  If you can verify that the server meets the requirements of the service profile, execute the show tech-support command and contact Cisco Technical Support.

——————–

While helpful in providing me lots of things to try to fix the problem, none of them worked.  It took me a while, but I figured out how to reproduce the error, a possible cause, and a workaround.

Here’s how to produce the error:

  1. Create a service profile without assigning any HBAs.  Shutdown the server when the association process has completed.
  2. After the profile is associated, assign an HBA or two.
  3. You should receive this dialog box:

You will then see this in the general tab of the service profile in question:

Now here is where the error can be induced:

  1. Don’t power on.  Keep in mind that the previous dialog box said that changes wouldn’t be applied until the blade was rebooted (powered on).
  2. Now disassociate the profile and associate it with another blade.  The “error” is carried over to the new blade and the config process (association process) does not run.

Powering up the newly associated blade does not correct the issue.  What has happened is that the disassociation/association process that is supposed to occur above does not take place due to the service profile being in an error state.

Workaround:

  1. Reboot after adding the HBA.  This will complete the re-configuration process, thus allowing disassociation/association processes to perform normally.  This is also the proper procedure.    Or
  2. Go to the Storage tab of the affected service profile and click on “Change World Wide Node Name”.  This forces the re-configuration to take place.

.

.

I’ve opened a ticket with TAC on this asking for a few documentation updates.  The first update is to basically state the correct method for applying the HBAs and that if not followed, the error msg will appear.

The second update is for them to update the error code guide with a 6th option – Press  “Change World Wide Node Name” button.

I am going to go out on a limb and say that they probably didn’t count on people like me doing things that they shouldn’t be doing or in an improper manner when they wrote the manuals.   🙂

.

Categories: UCS Tags: , ,

My Thoughts on Our Cisco UCS Sales Experience

August 31, 2010 2 comments

This is a topic that when I think about it, I jump around in my head from subtopic to subtopic.  To make things easier on myself, I am going to write a bunch of disjointed paragraphs and tie them together in the end.

Disjoint #1

I’ve never worked on Cisco gear in the past.  Everywhere I worked where I had access to network/server equipment, Cisco was not a technology provider.  I don’t know why, other than I’ve heard Cisco had the priciest gear on the market.  I’ve also heard/read that while Cisco is #1 in the networking gear market, their products are not necessarily #1 in performance, capacity, etc.  Throw in the perception of the 800lb gorilla and you get a lot of negative commentary out there.

Disjoint #2

When I was 19, I started my career in the technology field as a bench tech for a local consumer electronics store.  The owner (Ralph) was a man wise beyond his years. He saw something in me and decided to take me under his wing, but because I was 19, I did not understand/appreciate the opportunity that he was bestowing upon me.

While I learned some of the various technical aspects of running a small business, I did not do so well on the human side of it.  I was a brash, cocky 19yr old who thought he could take over the world.  However, there is one thing Ralph said that I remember very well and that is, “If no one has any problems, how will they ever find out what wonderful customer service we have”.

It’s not that he wanted people to have problems with the equipment they purchased.  He knew that by selling thousands of answering machines, telephones, T.Vs, computer, etc there would be some issues at some point and he felt that he  should do his best to make amends for it.

Ralph truly believed in customer service and would go out of his way to ensure that all customers left feeling like they had been taken care of extremely well.  If there was poster child for exemplary customer service, it would be Ralph.

Disjoint #3

A number of vendors with broad product lines have somehow decided that the SMB market does not need robust, highly available (maybe even fault tolerant) equipment.  Somehow, company size and revenue have become equated with technical needs.  Perceptions of affordability have also played into this, meaning, if you can’t afford it, then you don’t need it.

Why do I bring this up?  Way back in one of my earlier posts, I mentioned that we had a major piece of equipment fail and received poor customer service from the vendor.  The vendor sales rep kept saying that we bought the wrong equipment.  We didn’t buy the wrong equipment, we bought what we could afford.   In hindsight it wasn’t the equipment that failed us, but the company behind it.

.

Tieing all this together…

When we first started looking at UCS, some folks here had trepidations about doing business with Cisco.  There were preconceived notions of pricing and support.  Cisco was also perceived to have a reputation of abandoning a market where they could not be number one in sales.

I must also admit that there are technical zealots in my organization that only believe in technical specifications.  These folks try to avoid products that don’t “read” the best on paper or have the best results in every performance test.

However, my team diligently worked to overcome these objections one by one and we couldn’t have done it without the exceptional Cisco sales team assigned to us.

In the early part of the sales process, we pretty much only dealt with the Product Sales Specialist (PSS) and her System Engineer (SE).  The rest of the account team entered the picture a month or so later.

These two (PSS and SE) had the patience of Job.   The sales team took copious amounts of time meeting with us to explain how UCS was different from the other blade systems out there and how it could fit into our environment and enable us to achieve our strategic goals.  All questions were answered thoroughly in a timely manner.  Not once did I ever get the feeling that they (Cisco) felt they were wasting their time.

When the infamous HP-sponsored Tolly report (and other competing vendor FUD) came out, Cisco sales took the time to allay our concerns.   As we read and talked about other competing products, not once did they engage in any negative marketing.  Cisco took the high road and stuck to it.

We had phone calls with multiple reference accounts.  We had phone calls with product managers.  We had phone calls with the Unified Computing business unit leaders.   We had phone calls with…you get the idea.  Cisco put in a great amount of effort show us their commitment to be in the server business.

On top of all this, there was no overt pressure to close the sale.  Yes, the sales team asked if they could have the sale.  That’s what they are supposed to do.  But they didn’t act like car salesman by offering a limited duration, once in a lifetime deal.   Instead, they offered a competitive price with no strings attached. (Disjoint #1)

Needless to say, we bought into UCS and have transitioned to the post sales team.  This means we now interact more with our overall account rep and a generic SE rather than the PSS and her SE.  I call our new SE generic because he is not tied to a particular product but represents the entire Cisco product line.  He’s is quite knowledgeable and very helpful in teaching the ways of navigating Cisco sales and support.

So has everything gone perfectly?  No. We’ve had a few defective parts.  If you have read of my other posts, you know that we have had some integration issues.  We’ve also found a few areas of the management system that could use a bit more polish.  So in light of all this, do I regret going with UCS?  Not at all.  I still think it is the best blade system out there and I truly think the UCS architecture is the right way to go.

But with defective parts, integrations issues, etc…”Why do I still like Cisco?” you ask.  For starters, I don’t expect everything to be perfect.  That’s just life in the IT field.

Second, go re-read Disjoint #2.   Cisco must have hired Ralph at some point in time because their support has been phenomenal.    Not only do the pre and post sales teams check in to see how we are doing, any time we run into an issue they ask what Cisco can do to help.  It’s not that they just ask to see if they can help, they actually follow through if we say “yes”.  They are treating us as if we are their most important customer.

Finally, to tie in Disjoint #3, any time we run into something where other vendors would say we purchased the wrong equipment, Cisco owns the issue and asks how they can improve what we already have purchased.   It’s not about “buy this” or “buy that”.  It’s “How can we make it right?”, “What can we do to improve the product/process/experience?”, and “What could we have done differently?”   These are all questions a quality organization asks themselves and their customers.

I don’t know what else I can write about my Cisco sales experience other than to say that it has become my gold standard.  If other vendors read this post, they now know what standard they have to live up to.

To other UCS customers: What was your sale experience like?

.

Our Current UCS/vSphere Migration Status

August 17, 2010 Leave a comment

We’ve migrated most of our virtual servers over to UCS and vSphere.  I’d say we are about 85% done, with this phase being completed by Aug 29.  It’s not that it’s taking 10+ days to actually do the rest of the migrations.  It’s more of a scheduling issue.  From my perspective, I have three more downtimes to go.  Not much at all.

The whole process of migrating from ESX to vSphere and updating all the virtual servers has been interesting to say the least.  We haven’t encountered any major problems; just some small items related to the VMtools/VMhardware version (4 to 7) upgrades.   For example, our basic VMTools upgrade process is to right-click on a guest in the VIC and click on the appropriate items to perform an automatic upgrade.  When it works, the guest installs VMTools, reboots,  and comes back up without admin intervention.  For some reason, this would not work for our MS Terminal Servers unless we were logged into the target terminal server.

Here’s another example, this time involving Windows Server 2008:  The automatic upgrade process wouldn’t work either.  Instead, we had to login and launch VMTools from the System Tray and select upgrade.  The only operating system that went perfectly was Windows Server 2003 with no fancy extras (terminal services, etc).  Luckily, that’s the o/s most of our virtual workloads are running.  I am going to hazard a guess and say that some of these oddities are related to our various security settings, GPOs, and the like.

All-in-all, the vm migration has gone very smoothly.  I must say that I am happy with the quality of the VMware hyerpvisor, Virtual Center, and other basic components.  There has been plenty of opportunity for something to go extremely wrong, but so far, nada. (knock on wood)

So what’s next?  We are preparing to migrate our SQL servers onto bare metal blades.  In reality, we are building new servers from scratch and installing SQL server.  The implementation of UCS has given us the opportunity to update our SQL servers to Windows Server 2008 and SQL Server 2008.   Other planned moved include some Oracle app servers (on RedHat) as well as domain controllers, file share clusters, and maybe some tape backup servers.  This should take us into September.

Once we finish with the blades, we’ll start deploying the Cisco C-series rackmount servers.  We still have a number of instances where we have to go rackmount.   Servers in this category typically need multiple NICs, telephony boards, or other specialized expansion boards.

.

Upgrade Follies

August 12, 2010 Leave a comment

It’s amazing how many misconfigured, or perceived misconfigured, items can show up when doing maintenance and/or upgrades.  In the past three weeks, we have found at least four production items that fit this description that no one noticed because things appeared to be working.  Here’s a sampling:

During our migration from our legacy vm host hardware to UCS, we broke a website that was hardware load-balanced across two different servers.  Traffic should have been directed to Server A, then Server B, then Server C.  After the migration traffic was only going to Server C, which just hosts a page that says the site is down.  It’s a “maintenance” server, meaning that whenever we take a public facing page down, the traffic gets directed to Server C so that people can see a nice screen that says, “Sorry down for maintenance …..”

Everything looked right in the load balancer configuration.  While delving deeper, we noticed that server A was configured to be the primary node for a few other websites.  An application analyst whose app was affected chimed in and said that the configuration was incorrect.  Website 1 traffic was to go first to Server A, then B.  Website 2 traffic was supposed to go in the opposite order.   All our application documentation agreed with the analyst.  Of course, he wrote the documentation so it better agree with him 🙂  Here is the disconnect: we track all our changes in a Change Management system and no one ever put the desired configuration change into the system.  As far as our network team is concerned; the load balancer is configured properly.  Now this isn’t really a folly since our production system/network matched what our change management and CMDB systems were telling us.  This is actually GOODNESS.  If we ever had to recover due to a disaster, we would reference our CMDB and change management systems so they had better be in agreement.

Here’s another example:  We did a mail server upgrade about six months ago and everything worked as far as we could tell.  What we didn’t know was that some things were not working but no one noticed because mail was getting through.  When we did notice something not correct (a remote monitoring system) and fixed the cause, it led us to another item, and so on and so on.  Now, not everything was broken at the same time.  In a few cases, the fix of one item actually broke something else.  What’s funny is that if we didn’t correct the monitoring issue, everything would have still worked.  It was a fix that caused all the other problems.  In other words, one misconfiguration proved to be a correct configuration for other misconfigured items.  In this case, multiple wrongs did make a right.  Go Figure.

My manager has a saying for this: “If you are going to miss, miss by enough”.

.

I’ve also noticed that I sometimes don’t understand concepts when I think I do.  As part of our migration to UCS, we are also upgrading from ESX3.5 to vSphere.   Since I am new to vSphere, I did pretty much what every SysAdmin does: click all the buttons/links.  One of those buttons is the “Advanced Runtime Info” link that is part of the VMware HA portion of the main Virtual Center screen.

This link brings up info on slot sizes and usage.  You would think that numbers would add up, but clearly they don’t.

How does 268 -12 = 122?  I’m either obviously math challenged or I really need to go back and re-read the concept of Slots.

.