Archive for April, 2010

Battle of the Blades – Part III

April 27, 2010 Leave a comment

So earlier on I posted that I would list our strategic initiatives and how they led us down the path of choosing Cisco UCS as our new server platform as opposed to HP or IBM blades.  Before I begin, let me state that all three vendors have good, reliable equipment and that all three will meet our needs to some degree.  Another item of note is that some of our strategies/strategic direction may really be tactical in nature.  We just sort of lumped both together as one item in our decision matrix (really a fancy spreadsheet).  The last item to note is that all facts and figures (numbers) are based on our proposed configurations (vendor provided) so don’t get hung up on all the specifics.  With that out of the way, let’s begin…

If we go way back, our initial plan was just to purchase more HP rack-mount servers.  I have to say that DL380 server is amazing.  Rock solid.  But given our change in strategic direction, which was to move from rack-mounts to blades, we were given the option of going “pie-in-the-sky” and develop a wish list.  It’s this wish list, plus some specific initiatives,  that started us down the path at looking at Cisco UCS (hereafter referred to as UCS).

Item one:  Cabling.  Now all blade systems have the potential to reduce the number of cables needed when compared to rack-mount systems.  Overall, UCS requires the least amount of cables outside of the equipment rack due to the fact that the only cables to leave the rack are the uplinks from the fabric interconnect.  With HP and IBM, each chassis is cabled back to your switch of choice.  That’s roughly 16 cables per chassis leaving the rack.  With UCS, we have a TOTAL of 16 cables leaving the rack.  Now you might say that a difference of 32 cables per rack (assume 3 HP or IBM chassis in a rack) might not be much, but for us it is.  Cable management is a nightmare for us.  Not because we are bad at it, we just don’t like doing it so less cabling is a plus for us.  We could mitigate the cable issue by adding top of rack switches (which what a fabric interconnect sort of is), but we would need a lot more of them and they would add more management points, which leads us to item two.

Item two:  Number of management points.  Unless you have some really stringent, bizarre, outlandish requirements the chances are that you will spend your time managing the UCS system at the fabric interconnect level.  I know we will.  If we went with HP or IBM, we would have to manage down to the chassis level and then some.  Not only is each chassis managed separately, think of all the networking/storage gear that installed into each chassis.  Each of those is a separate item to manage.  Great, let’s just add in X more network switches and X more SAN switches that need to be managed, updated, secured, audited, etc.  Not the best way to make friends with other operational teams.

Item 3:  Complexity.  This was a major item for us.  Our goal is to simplify where possible. We had a lot of back&forth getting a VALID configuration for HP and IBM blade systems.  This was primarily the fault of the very large VAR representing both HP and IBM.  We would receive a config, question it, get a white-paper from VAR in rebuttal, point VAR to same white-paper showing that we were correct, and then finally get a corrected config.  If the “experts” were having trouble configuring the systems, what could we look forward to as “non-experts”?

Talking specifically about HP,  let’s add in HP SIM as the management tool.  As our HP rep is fond of stating, he has thousands of references that use SIM.  Of course he does, it’s free!   We use it too because we can’t afford Openview or Tivoli.  And for basic monitoring functions, it works fine albeit with a few quirks.  Add in Blade System Matrix on top of it, and you have a fairly complex management tool set.  We spent a few hours in a demo of the Matrix in which the demoer, who does this every day, had trouble showing certain basic tasks.  The demoer had to do the old tech standby: click around until you find what you are looking for.

Item 4: Multi-tenancy.  We plan on becoming a service provider, of sorts.  If you read my brief bio, you would remember that I work for a municipal government.  We want to enter into various relationships with other municipalities and school districts in which we will host their hardware, apps, DR, etc and vice-versa.   So we need a system that easily handles multiple organizations in the management tool set.   Since we are an HP shop, we gave a very strong looksy to how HP SIM would handle this.  It’s not pretty.  Add in the Matrix software and it’s even uglier.  Now don’t get me wrong.  HP’s product offerings can do what they claim, but it not drag-and-drop to setup for multi-tenancy.

Item 5: Converged Architecture. When we made our initial decision to go with UCS, it was the only converged architecture in town.  I know we are not going to be totally converged end-to-end for a few years, but UCS gets us moving in the right direction starting with item 1: cabling.    All the other vendors seemed to think it was the wrong way to go and once they saw the interest out there, they decided to change direction and move toward convergence too.

Item 6: Abstraction:  You could also call this identity, configuration, or in UCS parlance, service profiles.  We really like the idea of a blade just being a compute node and all the properties that give it an identity (MAC, WWN, etc) are abstracted and portable.  It’s virtualization to the next level.   Yes, HP and IBM have this capability to but it’s more elegant with UCS.  It’s this abstraction that will open up a number of possibilities in the high-availability and DR realms for us further down the road.  We have plans….

So there you have it.  Nothing earth shattering as for as tactics and strategy go.  UCS happened to come out ahead because Cisco got to start with a clean slate when developing the product.  They also didn’t design it for today, but for tomorrow.

Questions, comments?


Do you defrag your virtual servers?

April 20, 2010 Leave a comment

There have been some recent comments on Scott Drumond’s site (and others)  regarding defragging of virtual servers.  What do you think?  Do you defrag your virtual servers?  I’m personally torn.  I can see the value, but I am not sure if the costs are justified,  and I am not just talking about the monetary costs either.

We try to have a “what we do for one, we do for a thousand” mentality when it comes to our standard server builds.  Simply, this means that if we are going to declare a piece of software as a part of our standard base image all servers get it.  So far, it means that all servers get anti-virus protection, log monitoring, and a few others as a default.  In terms of dollars, it adds up to quite a pretty penny.  Virtualization makes it more expensive because I  have more server instances in my environment than I would have if every server was physical due to physical server costs.  Since a project doesn’t have to pay for hardware, it’s easier to ask for a server.  Some people call this sprawl, but I wouldn’t.  Sprawl connotates lack of control and we have well-defined controls in place.  No servers get deployed without adequate licensing and other resources.

Another cost is resource utilization.  If a server is busy defragging, it’s using CPU and disk resources.  Does this impact other virtual servers?  I would say yes, but I can’t say how much.  Your mileage will vary.  Yes, I can quantify direct resource utilization, but if my customers don’t notice the difference, does it really matter.    A 5% increase in CPU may have no impact on customer experience.  Fine then.  But what if they do notice a difference?  What if transaction times go up?  All the sudden that $xxx license may have just tripled in cost due to lost productivity.

Don’t forget to throw in the costs of environmentals.  If the host is busy, it’s generating heat.  If the host is busy, it’s using more electricity than it would be at idle (definitely true on Nehalem CPUs with power mgmt active).

Long story short, it’s not so simple as saying “defragging will improve vm performance by x”.  You need to figure out all the other ramifications.    My personal belief is to defrag those systems that clearly need it.  You’ll know which ones because you will either already have seen a significant performance degradation in them, or if you actively monitor your systems, you’re watching them begin to degrade.


  • As an aside..most of the performance results being posted these days are based on running newer CPUs, newer storage, etc.  Older equipment will not fare the same.  Example, just because Virtual Center shows 50% CPU available doesn’t mean it’s really available.  An additional 5% load can be noticed by other virtual servers.  We’ve experienced it in my organization on 3yr old servers.  It’s not a problem on new equipment, but something we have to take into consideration when deploying guests onto older hosts.
Categories: Philosophy Tags:

Patching VMware

April 19, 2010 Leave a comment

So here is my first post regarding an actual “day in the Life….”   Last night was VMware patching.  I wish I could say it was a pleasant experience, but I can’t.   The unpleasantness has nothing to do with VMware and all to do with my organization’s maintenance policies.

Our policy is to work on e-commerce systems on Sunday nights, from 10:00pm to 1:00am.   In my case, since the vm hosts in question host some e-commerce systems, I am stuck in the same maintenance window.  It doesn’t matter that I am running in a DRS/HA clustered configuration and that no downtime is experienced.  It’s the potential for downtime that causes worry in the ranks.  I can’t say that I disagree.  Nothing is 100%.  Problems do occur which sometimes result in an outage.

On-site at 9:45pm. Maintenance mode activated at 10:00pm, patches applied.  Repeat for other host.    I was home by 1am.    What makes it really unpleasant is that I am a terrible sleeper.  I’ll wake up at the slightest noise and/or light.  So home by 1am, in bed by 1:30am, up at 6:00am.  I am going to be dragging until Wednesday.  The older I get, the longer it takes to recover.

Categories: Life Tags:

How did your vendor respond in a disaster situation?

April 14, 2010 1 comment

Have you ever had a major systems failure that could be classified as a disaster or near-disaster?  How did your vendor(s) of the failed systems respond?  Did they own up to it?  Obfuscate?  Lay blame elsewhere?  Help in the recovery?

Back in the fall of 2007, we had an “event” with our primary storage array.  I remember it as though it occurred yesterday.  I was coming home from vacation and had just disembarked from the airplane when I received a call from one of our engineers.  The engineer was very low-key and said that I might want to check in with our storage guys because of some problem they were having.  “Some problem” turned out to be a complete array failure.

I went home, took a shower, and then went to the office.  First thing I saw was the vendor field engineer standing near the array looking very bored.  A quick conversation ensued in which he told me he was having trouble getting support from his own people.  Uh-oh.

A few minutes later I found our storage folks in an office talking about various next steps.  I was given some background info.  The array had been down for over five hours, no one knew the cause of the failure, no one knew the extent of the failure, and no one had filled in our CIO on the details.  As far as she knew, the vendor was fixing the problem and things were going to be peachy again.

At this point, alarm bells should have been going off in everyone’s head.  I tracked down the vendor engineer and gave him a hard deadline to get the array fixed.  I also started prepping, with the help of our storage team,  for massive recovery efforts.  The deadline came and the vendor was no further along so I woke up my manager and told her to wake up the CIO to declare a disaster.

Along comes daylight and we still haven’t made any progress on fixing the downed array, but we have started tape restoration to a different array.   A disaster is declared.  Teams are put together to determine the scope of impact, options for recovering, customer communications, etc..  We also called in our array sales rep, his support folks, 2nd/3rd level vendor tech support, and more.

So here we are all in a room trying to figure out what happened and what to do next.  3rd level vendor support is in another part of the country.  He doesn’t know what’s been discussed so he tells us what happened.  Unfortunately this was not the party line.  Vendor would like to blame the problem on something different; something that was supposedly fixed in a firmware update that we hadn’t yet applied (thus the finger-pointing begins).  Not a bright idea since we had the white paper on that particular error and we were nowhere close to hitting the trigger point.  Months later this so-called fixed problem was corrected, again, in another firmware release.

To make matters worse, while discussing recovery options one of the vendor’s local managers said..and I quote..”It’s not our problem”.   Wow!!!  Our primary storage provider just told us that his products failure was not his problem.  Yes, we bought mid-range equipment so we knew we weren’t buying 5 nines or better.  Still, to say that it’s our fault and to tell us that we should have bought the high-end, seven-figure system was a bit much.

We recovered about 70%  of the data to another array within 36 hours and then ran into a bad tape problem.  The remaining 30% took about two weeks to get.  Needless to say, we learned a lot.  Our DR processes weren’t up to snuff, our backup processes weren’t up to snuff, and our choice of vendor wasn’t up to snuff.  We are in the process of correcting all three deficiencies.

Back to my opening paragraph, how have your vendors treated you in a disaster?

Categories: Uncategorized Tags: ,

Patches! We don’t need no stinkin’ patches!

April 13, 2010 Leave a comment

It’s time again for our monthly visitor.  No, not that one.  I’m talking about Microsoft patches.   We’re pretty aggressive here in applying them.   How aggressive?  We like to play Russian Roulette by patching all servers within 48hours of patch release.   Strangely enough, we’ve been really lucky.  I think we’ve had only five outages in the seven or so years of patching.  I would prefer more time to test, but that’s not my call.  However, if something breaks, my team has to pick up the pieces.

How about you?  When do you patch?

Categories: Philosophy Tags:

Battle of the Blades – Part II

April 12, 2010 Leave a comment

Before reading this post, please read the following two posts first:

So far I’ve talked a bit about methods for determining when to replace equipment, what equipment is to be used as the replacements,  and what factors may go into the overall decision.  I also mentioned that this post would cover some of our strategic initiatives and how they factored into the overall product choice.  I lied.    I missed an important piece of history in the background.

Let’s recap:  I was prepared to order traditional rack-mounts in June 2009.  Our management team asked for a capacity analysis.  This process of getting this analysis took so long that we decided to look at blades.  Since we are an HP shop, we definitely had HP on our short list.  However, since we were taking the opportunity to rethink our architecture we decided to step out a bit and look at other unique products that existed in the blade market.  The only real requirement is that we had to have a mindset that the ‘unique’ vendor was viable from our perspective.    Who came to mind?  Cisco.

Cisco?  Yes, Cisco.  When we started looking at blades, Cisco had been shipping their UCS product for a few months already.  Press was good, reviews were good, etc.  Not only were we seeing positive news, Cisco offered a very unique architecture.  Look at all the differences between UCS and a traditional blade system.  I am not going to list them here because it’s already been listed a number of times out there in the blogosphere.  Go check out Scott Lowe’s blog or the By the Bell blog.  Both have excellent articles on UCS characteristics.

Moving along…We couldn’t just go and say, “Let’s buy UCS.”   I don’t work that way.  I am very happy with HP rack-mount servers and would not hesitate to recommend them to anyone if asked.  If I am going to choose a different vendor, I need to have good reasons that I can objectively point to.

Thus began the epic saga that culminated in many months of  research into HP and Cisco blade offerings.  I can’t say it was enjoyable.   Part of the problem stems from the vendor HP brought in.  The sales rep, whom represents all the tier 1 companies,  didn’t/doesn’t believe in UCS.  Every time we asked for something, we got massive amounts of FUD in response.   Now to be fair to HP, the sales rep knows someone in our upper management.  I am speculating this sales rep approached management about performing a capacity analysis and that since we already use HP equipment, they brought in HP to work with them.

So a forward we go and develop a list of criteria which is as objectionable as we can get it.  Items on the list: complexity, number of cables, number of management point, RAM capacity, etc.    Some were just technical check-box items, others related to our strategic initiatives.  When all was said and done, a few of the criteria really stood out.  Two were complexity and the other was support for our strategic initiatives.  I don’t mean to bag on HP, but their blade system is complex.  We went back and forth with the vendor on developing a workable configuration for quite some time.  It wasn’t the big items that tripped us up, but rather the little things.  Unfortunately, it was these little things that would make or break the entire solution.  I am guessing that a lot of the complexity in developing a configuration is the sheer breadth of accessories that HP offers. Which switches in the chassis are needed, which HBA, which this, which that…

The more we looked at Cisco, the more we liked their solution.  Imagine being able to have 20/20 hindsight when developing a new product.  That’s what Cisco had in this case.  Cisco was able to look at all the other blade systems out there, see what was good and bad about them, and design a new solution.   Think of all the bad the comes with most blade systems.    I mentioned in a previous post that cable management was a pain point for us.  Well, you can’t get much cleaner than UCS.   How about complexity?  I am not saying Cisco is perfect, but their solution is pretty easy to work with.  Some of it has to do with the fact that there is no legacy equipment for them to be compatible with.  Some of it has to do with the fact that UCS is managed at the Fabric Interconnect vs the chassis level.

Seems like a done deal then, doesn’t it?  Cisco has a solution that meets our needs better than HP.  Simple.  Not really.  Management wanted us to consider other vendors, notably IBM.  Why IBM.  They support multiple processor families (Intel, PowerPC, Sparc), have a good track record,  and have a fair amount of market share.    So in come the IBM folks to discuss their offerings.  Personally, I wasn’t impressed.  While there was some interesting technology there, it just seemed ‘old’.  Judging by some other blog posts I have read, IBM agrees and will be coming out with some new offering over the next few months…

Are we there yet, are we there yet, are we there yet?   Nope.  HP/vendor had one more trick up their sleeve.  They managed to get some info on our criteria and then stated that they proposed the wrong product.  Instead of just a blade system, they felt that they should have proposed the BladeSystem Matrix.  Well if they weren’t complex before, they sure were then.  We went through a demo of the Matrix software and all I can say is the complexity score shot through the roof (in a bad way).   I don’t think bolting on software to SIM was the right way to go.  Even then, it was obvious that some components were not tightly integrated and were just being launched by SIM.  However, some of the new functionality did support our strategic initiatives more so than just the plain blade system as originally proposed.

In the end, we chose Cisco.  It is a done deal?  No.  There is still some jockeying going on.  All I can say is that Cisco has stepped up to the plate and has taken on the challenge of proving to us that they offer the best solution to meet our needs.

And for those strategic initiatives…next post, maybe 🙂

Categories: Hardware refresh Tags: ,

Battle of the Blades – Part I

April 5, 2010 1 comment

Please read first.

So how does one go about choosing a blade system?   I don’t have an answer for you.  I can tell you what we did, but it may not be the proper methodology for your organization.

When all is said and done, most blade systems appear nearly identical.  After all, what is a blade server?  It’s a rack mount turned vertical.  Yes, there is more to them but that is the essence of it.   If you are 100% satisfied with your incumbent provider and they meet your needs,  then stick with them.    Or you can do what we did and take the opportunity to envision what your ideal data center environment would look like and try to find the provider that comes closest to that vision.

Now I am not going to pull one over on you and say that we knew what our ideal data center would look like from day one.  We didn’t.  Our vision evolved as we reviewed the various offering from the tier 1 vendors.  Our vision evolved as we learned the strategic plans of our various business units.  Our vision evolved as ….  Yes, our vision is ever changing.  My team and I will never 100% know where our organization is going since we work in an ever changing world.   So what did we do?  We courted multiple providers.

Yes, you read correctly.  We came up with a basic configuration based on the Capacity Planner analysis (and other factors) and then asked multiple vendors to provide a Bill of Materials (BOM).  Then we sent the BOMs to various resellers for pricing.   The main reason we did was to have all our paperwork ready for our purchasing department to take to the City Council for approval once we made a decision on product.  Getting all our ducks in a row takes time so even though we didn’t have a final choice of product, we could at least keep the process moving along.   Besides, getting the BOM and pricing helped us develop a five-year cost model.   Unless you are funded extremely well, price has to play a factor in the decision-making.   The best product in the world is not going to find a spot in my data center if it is ten times more expensive than everyone else and does not make me ten time more productive, reduce other costs by 10x, etc..

While all this was taking place, we finally reached a point where we could articulate our data center vision and defend it.  It’s one thing to say we want to reduce power consumption, reduce our data center footprint, blah, blah, blah.  Everyone says that and all providers can meet those requirements.   These  bullet points were not going to help us decide on a product.   Besides addressing the various strategic initiatives, we needed to address what causes my team the most pain:  cabling and ease of hardware management.

Just for giggles, go do a Google image search on the phrase “a network cable is unplugged”.  While our data center was nowhere near as bad as that one, we do have some cable management nightmares.  When a rack is put in, everything is nice and neat.   In a dynamic data center, the cabling  becomes a nightmare one cable at a time.  If I had to come up with a movie title for it it would probably be: “Adds, moves, and changes:  The bane of the data center.”

Ease of hardware management was our second greatest pain.  We currently use HP servers so our primary server management tool is Systems Insight Manager (SIM).  SIM isn’t bad, but it isn’t great either.  It’s offers a fair amount of functionality for the price (free with server purchase).   However, it has some niggling quirks which drive us crazy.  For starters, it uses reverse DNS lookups to determine system name.  What happens if a server has fourteen aliases?  Thirteen are marked as down.  Instead of querying the host’s primary name, it picks up whatever DNS spits out at the time of the reverse lookup.    Sort of makes alerting/alarming harder than it has to be.

Of course, all this assumes it can discover every server.  We’ve had times when a server had to manually entered into SIM and still couldn’t be seen.

The final issue we have with SIM is its interface.  It’s just not as friendly as it could be.  To give you an idea what I am talking about… There are some blogs out there that seem to think that the HP BladeSystem Matrix management console can only manage 250 servers.  The real answer is that it can manage over 1300 servers.  The 250 number comes from the HP system engineers due to SIM’s interface.  SIM just doesn’t visually handle large numbers of objects very well.

That’s it for this entry.  My next post will cover some strategic initiatives and how they factored into our product choice.

Categories: Hardware refresh Tags: