Archive

Archive for April, 2010

Battle of the Blades – Part III

April 27, 2010 Leave a comment

So earlier on I posted that I would list our strategic initiatives and how they led us down the path of choosing Cisco UCS as our new server platform as opposed to HP or IBM blades.  Before I begin, let me state that all three vendors have good, reliable equipment and that all three will meet our needs to some degree.  Another item of note is that some of our strategies/strategic direction may really be tactical in nature.  We just sort of lumped both together as one item in our decision matrix (really a fancy spreadsheet).  The last item to note is that all facts and figures (numbers) are based on our proposed configurations (vendor provided) so don’t get hung up on all the specifics.  With that out of the way, let’s begin…

If we go way back, our initial plan was just to purchase more HP rack-mount servers.  I have to say that DL380 server is amazing.  Rock solid.  But given our change in strategic direction, which was to move from rack-mounts to blades, we were given the option of going “pie-in-the-sky” and develop a wish list.  It’s this wish list, plus some specific initiatives,  that started us down the path at looking at Cisco UCS (hereafter referred to as UCS).

Item one:  Cabling.  Now all blade systems have the potential to reduce the number of cables needed when compared to rack-mount systems.  Overall, UCS requires the least amount of cables outside of the equipment rack due to the fact that the only cables to leave the rack are the uplinks from the fabric interconnect.  With HP and IBM, each chassis is cabled back to your switch of choice.  That’s roughly 16 cables per chassis leaving the rack.  With UCS, we have a TOTAL of 16 cables leaving the rack.  Now you might say that a difference of 32 cables per rack (assume 3 HP or IBM chassis in a rack) might not be much, but for us it is.  Cable management is a nightmare for us.  Not because we are bad at it, we just don’t like doing it so less cabling is a plus for us.  We could mitigate the cable issue by adding top of rack switches (which what a fabric interconnect sort of is), but we would need a lot more of them and they would add more management points, which leads us to item two.

Item two:  Number of management points.  Unless you have some really stringent, bizarre, outlandish requirements the chances are that you will spend your time managing the UCS system at the fabric interconnect level.  I know we will.  If we went with HP or IBM, we would have to manage down to the chassis level and then some.  Not only is each chassis managed separately, think of all the networking/storage gear that installed into each chassis.  Each of those is a separate item to manage.  Great, let’s just add in X more network switches and X more SAN switches that need to be managed, updated, secured, audited, etc.  Not the best way to make friends with other operational teams.

Item 3:  Complexity.  This was a major item for us.  Our goal is to simplify where possible. We had a lot of back&forth getting a VALID configuration for HP and IBM blade systems.  This was primarily the fault of the very large VAR representing both HP and IBM.  We would receive a config, question it, get a white-paper from VAR in rebuttal, point VAR to same white-paper showing that we were correct, and then finally get a corrected config.  If the “experts” were having trouble configuring the systems, what could we look forward to as “non-experts”?

Talking specifically about HP,  let’s add in HP SIM as the management tool.  As our HP rep is fond of stating, he has thousands of references that use SIM.  Of course he does, it’s free!   We use it too because we can’t afford Openview or Tivoli.  And for basic monitoring functions, it works fine albeit with a few quirks.  Add in Blade System Matrix on top of it, and you have a fairly complex management tool set.  We spent a few hours in a demo of the Matrix in which the demoer, who does this every day, had trouble showing certain basic tasks.  The demoer had to do the old tech standby: click around until you find what you are looking for.

Item 4: Multi-tenancy.  We plan on becoming a service provider, of sorts.  If you read my brief bio, you would remember that I work for a municipal government.  We want to enter into various relationships with other municipalities and school districts in which we will host their hardware, apps, DR, etc and vice-versa.   So we need a system that easily handles multiple organizations in the management tool set.   Since we are an HP shop, we gave a very strong looksy to how HP SIM would handle this.  It’s not pretty.  Add in the Matrix software and it’s even uglier.  Now don’t get me wrong.  HP’s product offerings can do what they claim, but it not drag-and-drop to setup for multi-tenancy.

Item 5: Converged Architecture. When we made our initial decision to go with UCS, it was the only converged architecture in town.  I know we are not going to be totally converged end-to-end for a few years, but UCS gets us moving in the right direction starting with item 1: cabling.    All the other vendors seemed to think it was the wrong way to go and once they saw the interest out there, they decided to change direction and move toward convergence too.

Item 6: Abstraction:  You could also call this identity, configuration, or in UCS parlance, service profiles.  We really like the idea of a blade just being a compute node and all the properties that give it an identity (MAC, WWN, etc) are abstracted and portable.  It’s virtualization to the next level.   Yes, HP and IBM have this capability to but it’s more elegant with UCS.  It’s this abstraction that will open up a number of possibilities in the high-availability and DR realms for us further down the road.  We have plans….

So there you have it.  Nothing earth shattering as for as tactics and strategy go.  UCS happened to come out ahead because Cisco got to start with a clean slate when developing the product.  They also didn’t design it for today, but for tomorrow.

Questions, comments?

Do you defrag your virtual servers?

April 20, 2010 Leave a comment

There have been some recent comments on Scott Drumond’s site (and others)  regarding defragging of virtual servers.  What do you think?  Do you defrag your virtual servers?  I’m personally torn.  I can see the value, but I am not sure if the costs are justified,  and I am not just talking about the monetary costs either.

We try to have a “what we do for one, we do for a thousand” mentality when it comes to our standard server builds.  Simply, this means that if we are going to declare a piece of software as a part of our standard base image all servers get it.  So far, it means that all servers get anti-virus protection, log monitoring, and a few others as a default.  In terms of dollars, it adds up to quite a pretty penny.  Virtualization makes it more expensive because I  have more server instances in my environment than I would have if every server was physical due to physical server costs.  Since a project doesn’t have to pay for hardware, it’s easier to ask for a server.  Some people call this sprawl, but I wouldn’t.  Sprawl connotates lack of control and we have well-defined controls in place.  No servers get deployed without adequate licensing and other resources.

Another cost is resource utilization.  If a server is busy defragging, it’s using CPU and disk resources.  Does this impact other virtual servers?  I would say yes, but I can’t say how much.  Your mileage will vary.  Yes, I can quantify direct resource utilization, but if my customers don’t notice the difference, does it really matter.    A 5% increase in CPU may have no impact on customer experience.  Fine then.  But what if they do notice a difference?  What if transaction times go up?  All the sudden that $xxx license may have just tripled in cost due to lost productivity.

Don’t forget to throw in the costs of environmentals.  If the host is busy, it’s generating heat.  If the host is busy, it’s using more electricity than it would be at idle (definitely true on Nehalem CPUs with power mgmt active).

Long story short, it’s not so simple as saying “defragging will improve vm performance by x”.  You need to figure out all the other ramifications.    My personal belief is to defrag those systems that clearly need it.  You’ll know which ones because you will either already have seen a significant performance degradation in them, or if you actively monitor your systems, you’re watching them begin to degrade.

.

  • As an aside..most of the performance results being posted these days are based on running newer CPUs, newer storage, etc.  Older equipment will not fare the same.  Example, just because Virtual Center shows 50% CPU available doesn’t mean it’s really available.  An additional 5% load can be noticed by other virtual servers.  We’ve experienced it in my organization on 3yr old servers.  It’s not a problem on new equipment, but something we have to take into consideration when deploying guests onto older hosts.
Categories: Philosophy Tags:

Patching VMware

April 19, 2010 Leave a comment

So here is my first post regarding an actual “day in the Life….”   Last night was VMware patching.  I wish I could say it was a pleasant experience, but I can’t.   The unpleasantness has nothing to do with VMware and all to do with my organization’s maintenance policies.

Our policy is to work on e-commerce systems on Sunday nights, from 10:00pm to 1:00am.   In my case, since the vm hosts in question host some e-commerce systems, I am stuck in the same maintenance window.  It doesn’t matter that I am running in a DRS/HA clustered configuration and that no downtime is experienced.  It’s the potential for downtime that causes worry in the ranks.  I can’t say that I disagree.  Nothing is 100%.  Problems do occur which sometimes result in an outage.

On-site at 9:45pm. Maintenance mode activated at 10:00pm, patches applied.  Repeat for other host.    I was home by 1am.    What makes it really unpleasant is that I am a terrible sleeper.  I’ll wake up at the slightest noise and/or light.  So home by 1am, in bed by 1:30am, up at 6:00am.  I am going to be dragging until Wednesday.  The older I get, the longer it takes to recover.

Categories: Life Tags:

How did your vendor respond in a disaster situation?

April 14, 2010 1 comment

Have you ever had a major systems failure that could be classified as a disaster or near-disaster?  How did your vendor(s) of the failed systems respond?  Did they own up to it?  Obfuscate?  Lay blame elsewhere?  Help in the recovery?

Back in the fall of 2007, we had an “event” with our primary storage array.  I remember it as though it occurred yesterday.  I was coming home from vacation and had just disembarked from the airplane when I received a call from one of our engineers.  The engineer was very low-key and said that I might want to check in with our storage guys because of some problem they were having.  “Some problem” turned out to be a complete array failure.

I went home, took a shower, and then went to the office.  First thing I saw was the vendor field engineer standing near the array looking very bored.  A quick conversation ensued in which he told me he was having trouble getting support from his own people.  Uh-oh.

A few minutes later I found our storage folks in an office talking about various next steps.  I was given some background info.  The array had been down for over five hours, no one knew the cause of the failure, no one knew the extent of the failure, and no one had filled in our CIO on the details.  As far as she knew, the vendor was fixing the problem and things were going to be peachy again.

At this point, alarm bells should have been going off in everyone’s head.  I tracked down the vendor engineer and gave him a hard deadline to get the array fixed.  I also started prepping, with the help of our storage team,  for massive recovery efforts.  The deadline came and the vendor was no further along so I woke up my manager and told her to wake up the CIO to declare a disaster.

Along comes daylight and we still haven’t made any progress on fixing the downed array, but we have started tape restoration to a different array.   A disaster is declared.  Teams are put together to determine the scope of impact, options for recovering, customer communications, etc..  We also called in our array sales rep, his support folks, 2nd/3rd level vendor tech support, and more.

So here we are all in a room trying to figure out what happened and what to do next.  3rd level vendor support is in another part of the country.  He doesn’t know what’s been discussed so he tells us what happened.  Unfortunately this was not the party line.  Vendor would like to blame the problem on something different; something that was supposedly fixed in a firmware update that we hadn’t yet applied (thus the finger-pointing begins).  Not a bright idea since we had the white paper on that particular error and we were nowhere close to hitting the trigger point.  Months later this so-called fixed problem was corrected, again, in another firmware release.

To make matters worse, while discussing recovery options one of the vendor’s local managers said..and I quote..”It’s not our problem”.   Wow!!!  Our primary storage provider just told us that his products failure was not his problem.  Yes, we bought mid-range equipment so we knew we weren’t buying 5 nines or better.  Still, to say that it’s our fault and to tell us that we should have bought the high-end, seven-figure system was a bit much.

We recovered about 70%  of the data to another array within 36 hours and then ran into a bad tape problem.  The remaining 30% took about two weeks to get.  Needless to say, we learned a lot.  Our DR processes weren’t up to snuff, our backup processes weren’t up to snuff, and our choice of vendor wasn’t up to snuff.  We are in the process of correcting all three deficiencies.

Back to my opening paragraph, how have your vendors treated you in a disaster?

Categories: Uncategorized Tags: ,

Patches! We don’t need no stinkin’ patches!

April 13, 2010 Leave a comment

It’s time again for our monthly visitor.  No, not that one.  I’m talking about Microsoft patches.   We’re pretty aggressive here in applying them.   How aggressive?  We like to play Russian Roulette by patching all servers within 48hours of patch release.   Strangely enough, we’ve been really lucky.  I think we’ve had only five outages in the seven or so years of patching.  I would prefer more time to test, but that’s not my call.  However, if something breaks, my team has to pick up the pieces.

How about you?  When do you patch?

Categories: Philosophy Tags: