Archive

Archive for the ‘Hardware refresh’ Category

My Friend Needs a Product From VCE

Disclosure: I work for VCE so make of it what you will.

I had lunch this weekend with a friend of mine that manages the server infrastructure for a local government entity.  During the usual banter regarding business, he mentioned that a recent storage project (1+yr and still going) had suffered a series of setbacks due to outages, product compatibility issues, and other minor odds & ends.

The entity he worked at released an RFP that detailed quite a bit of what they were looking to accomplish, which in retrospect may have to been too ambitious given the budget available.  After going through all the proposals, the list was narrowed down to two.  On paper, both were excellent.  All vendors were either Tier1 or Tier2 (defined by sales).  Both proposals were heavily vetted with on-site visits to existing customers, reference phone calls, etc.  Both proposals consisted of a storage back-end with copious amounts of software for replication, snapshotting, provisioning, heterogeneous storage virtualization, and more.

While each component vendor of the winning proposal was highly respected, the two together did not have a large installed base that mimicked the proposed configuration (combo of hardware & software).  In hindsight, this should have been a big RED flag.  What looked good on paper did not pan out in reality.

Procurement went smoothly and the equipment arrived on time.   Installation went smoothly.  It was during integration with the existing environment that things started to break down.   The heterogeneous virtualization layer wasn’t quite as transparent as listed on paper.  Turns out all the servers needed to have the storage unpresented and presented back.  (Hmmm…first outage, albeit not a crash.)

Then servers starting having BSODs.  A review by the vendors determined that new HBAs were needed.  This was quite the surprise to my friend since the he provided all the tech info on the server & storage environment as part of the proposal and was told that all existing equipment was compatible with the proposal.  (2nd, 3rd, 4th+ outages…crashes this time).

So HBAs were updated, drivers installed, and hopefully goodness would ensue. (planned outages galore).

Unfortunately, goodness would not last.  Performance issues, random outages, and more.  This is where it started to get nasty and the finger pointing began.  My friend runs servers from Cisco (UCS) and HP.  The storage software vendor started pointing at the server vendors.  Then problems were attributed to both VMware and Microsoft. (more unplanned outages).

Then the two component vendors started pointing fingers at each other.  Talk about partnership breakdown.

So what is my friend’s shop doing?  They are buying some EMC and NetApp storage for their Tier 1 apps.  Tier 2 and Tier 3 apps will remain on the problematic storage.  They are also scaling back their ambitious goals since they can’t afford all the bells & whistles from either vendor.  The reason they didn’t purchase EMC or NetApp in the first place was due to fiscal constraints.  Those constraints still exist.

As I listened to the details, I realized that he really needed one of the Vblock ™ Infrastructure Platforms from VCE.

First, his shop already has the constituent products that make up the majority of a Vblock – Cisco UCS, EMC Storage (small amount), and vSphere. This makes transitioning to a Vblock ™ easier since less training is needed and a comfort level already exists for those products.

Second, the hardware and software components in the winning proposal were not widely deployed in the proposed configuration.  At the time my friend’s shop put out the RFP, there were more Vblocks in production use in the United States than there were of the winning proposal’s configuration world-wide.

Third, at VCE, all the hardware/software components in a Vblock ™ are thoroughly tested together.  It’s VCE’s job to find any problems so a customer doesn’t.  People think that if you take three items listed on an HCL somewhere and put them together, then everything will work fine.  It just isn’t true.   In fact, a number of patches put out by the parent companies are the result of testing by VCE’s engineering and quality assurance teams.

And finally, probably the biggest reason why my friend needs a product from VCE is to get rid of all the finger pointing.  When VCE sells a product, every component is supported by VCE.  There is no saying “It’s not our problem, call the server vendor”, or “call the storage vendor”, or “call the software vendor”.  I’m not saying that all vendors finger point. Also, your mileage with a set of particular vendors will vary, but if you are CIO/Manager/whatever, you have to admit that “one call, that’s all” is quite compelling.  You can either have your staff spend time managing vendors or you can have your staff spend time moving your business forward.

I’ve put the bug in the ear of my friend about going Vblock in the future.  It won’t happen any time soon since his procurement cycle isn’t conducive to purchasing all his infrastructure components in one fiscal year.  It usually takes five years to get through all three major components.  But who knows?  Maybe his recent experiences will accelerate the cycle.

.

 

Does the Storage Tecchnology Really Matter?

November 15, 2010 4 comments

This article is really more of a rant.  Take it for what it is.  I’m just a frustrated infrastructure admin trying to choose a storage product.

I am not a storage admin (and I don’t play one on TV), but I am on our storage replacement project representing the server infrastructure area.  In preparation for this project, I started reading a number of the more popular blogs and following some of the storage Tweeters.  One thing I noticed is that all the banter seems to be about speeds and feeds as opposed to solving business problems.  In the case of Twitter, I am guessing it’s due to the 140 character limit, but I would expect to see more in the blogs.  Some of the back & forth reminds me of the old elementary school bravado along the lines of “My dad can beat up your dad”.

I must admit that I am learning a lot, but does it really matter if it supports iSCSI, FC, FCoE, NFS, or other acronyms?  As long as it fits into my existing infrastructure with minimal disruption and provides me options for growth (capacity and features), should I care?   If so, why should I care? We recently moved the bulk of our data center to Cisco UCS so you would think that FCoE would be a highly valued feature of our new solution.  But it’s not.  We don’t run Cisco networking gear and our current gear provider has no short term plans for FCoE.  Given that we just finished a network upgrade project, I don’t forsee FCoE in our environment for at least three years unless funding magically appears.  It doesn’t mean that it isn’t on our radar; it just means that it won’t be here for at least three years.  So stop trying to sell me on FCoE support.

So who has the better solution?  I am going to use EMC and NetApp in my example just because they blog/tweet a lot.

I think if one were to put a chart together, both EMC and NetApp could be at the heading of any column.  Their products look the same to me.  Both have replication software, both support snapshots, both support multiple protocols, and so on and so on and so on.  The features list is pages long and each vendor seems to match the other.

There are technical differences in how these features are implemented and in how the back-end arrays work, but should I care?  Tell me how these features will help my business grow.  Tell me how these features will protect my business.  Tell me how these features will save my business money. Tell me how they can integrate into my existing infrastructure without having to change my infrastructure.  And when I say “tell me”, don’t just say “it can do this”, or “it can do that”.  Give me case studies more than six pages long, give me processes and procedures, and give me REAL metrics that I can replicate/validate (assuming I had the equipment and time) in a real-world scenario which information telling me how they affect my apps and customers.

This is an area where companies need to do a better job of marketing.  EMC started down this path with the vBlock.  Techies aren’t really interested because the specs are blasé.  C-level folks love it because it marketed towards them and the marketing focuses on the solution from a business perspective.   NetApp is starting to do the same with their recently announced FlexPod.  The main downside to these new initiatives is that they seem to forget about the SMB.  I think it’s great from a techie POV that a FlexPod can handle 50,000 VDI sessions.  But as an IT Architect for my organization, so what?  We only have 4200 employees or so.

Right now, I’m sort of in-between in what type of information I need: technical vs business.  I am technical at heart, but have been looking at things from a business perspective the last few yrs.  I am in the process of trying to map what our mgmt team wants to accomplish over the next few years to the storage feature sets out there in the market.  This is where both types come together.  Now if I can just get past the FUD.

A Major Milestone Has Been Reached!!

August 24, 2010 Leave a comment

We did it, and we did it early.  We completed the move of our existing VMware infrastructure onto the Cisco UCS platform.    At the same time, we also moved from ESX 3.5 to vSphere.  All-in-all, everything is pretty much working.  The only outstanding issue we haven’t resolved yet involves Microsoft NLB and our Exchange CAS/HUB/OWA servers.  NLB just doesn’t want to play nice and we don’t know if the issue is related more to vSphere, UCS, or something else entirely different.

Next up: SQL Server clusters, P2Vs, and other bare metal workloads.

SQL Server migrations have already started and are going well.  We have a few more clusters to build and that should be that for SQL.

P2Vs present a small challenge.  A minor annoyance that we will have to live with is an issue with VMware Converter.  Specifically, we’ve run into a problem with resizing disks during the P2V process.  The process fails about 2% into the conversion with an “Unknown Error”.  It seems a number of people have also run into this problem and the workaround provided by VMware in KB1004588 (and others) is to P2V as-is and then run the guest through Converter again to resize the disks.  This is going to cause us some scheduling headaches, but we’ll get through it.   Without knowing the cause, I can’t narrow it down to being vSphere or UCS related.  All I can say is that it does not happen when I P2V to my ESX 3.5 hosts.  Alas, they are HP servers.

.

We’ve gone all-in with Cisco and purchased a number of the C-Series servers, recently deploying a few C-210 M2 servers to get our feet wet.  Interesting design choices to say the least.  I will say that they are not bad, but they are not great either.   My gold standard is the HP DL380 server line and as compared to the DL380, the C-210 needs a bit more work.  For starters, the default drive controller is SATA, not SAS.  I’m sorry, but I have a hard time feeling comfortable with SATA drives deployed in servers.  SAS drives typically come with a 3yr warranty; SATA drives typically have a 1yr warranty.  For some drive manufacturers, this stems from the fact that their SAS drives are designed for 24/7/365 use, but their SATA drives are not.

Hot Plug fans?  Nope..These guys are hard-wired, and big.   Overall length of the server is a bit of a stretch too, literally.   We use the extended width/depth HP server cabinets and these servers just fit.   I think the length issue stems from the size of the fans (they are big and deep) and some dead space in the case.  The cable arm also sticks out a bit more than I expected.  With a few design modifications, the C-210 M2 could shrink three, maybe four inches in length.

I’ll post some updates as we get more experience with the C-Series.

Our Current UCS/vSphere Migration Status

August 17, 2010 Leave a comment

We’ve migrated most of our virtual servers over to UCS and vSphere.  I’d say we are about 85% done, with this phase being completed by Aug 29.  It’s not that it’s taking 10+ days to actually do the rest of the migrations.  It’s more of a scheduling issue.  From my perspective, I have three more downtimes to go.  Not much at all.

The whole process of migrating from ESX to vSphere and updating all the virtual servers has been interesting to say the least.  We haven’t encountered any major problems; just some small items related to the VMtools/VMhardware version (4 to 7) upgrades.   For example, our basic VMTools upgrade process is to right-click on a guest in the VIC and click on the appropriate items to perform an automatic upgrade.  When it works, the guest installs VMTools, reboots,  and comes back up without admin intervention.  For some reason, this would not work for our MS Terminal Servers unless we were logged into the target terminal server.

Here’s another example, this time involving Windows Server 2008:  The automatic upgrade process wouldn’t work either.  Instead, we had to login and launch VMTools from the System Tray and select upgrade.  The only operating system that went perfectly was Windows Server 2003 with no fancy extras (terminal services, etc).  Luckily, that’s the o/s most of our virtual workloads are running.  I am going to hazard a guess and say that some of these oddities are related to our various security settings, GPOs, and the like.

All-in-all, the vm migration has gone very smoothly.  I must say that I am happy with the quality of the VMware hyerpvisor, Virtual Center, and other basic components.  There has been plenty of opportunity for something to go extremely wrong, but so far, nada. (knock on wood)

So what’s next?  We are preparing to migrate our SQL servers onto bare metal blades.  In reality, we are building new servers from scratch and installing SQL server.  The implementation of UCS has given us the opportunity to update our SQL servers to Windows Server 2008 and SQL Server 2008.   Other planned moved include some Oracle app servers (on RedHat) as well as domain controllers, file share clusters, and maybe some tape backup servers.  This should take us into September.

Once we finish with the blades, we’ll start deploying the Cisco C-series rackmount servers.  We still have a number of instances where we have to go rackmount.   Servers in this category typically need multiple NICs, telephony boards, or other specialized expansion boards.

.

Upgrade Follies

August 12, 2010 Leave a comment

It’s amazing how many misconfigured, or perceived misconfigured, items can show up when doing maintenance and/or upgrades.  In the past three weeks, we have found at least four production items that fit this description that no one noticed because things appeared to be working.  Here’s a sampling:

During our migration from our legacy vm host hardware to UCS, we broke a website that was hardware load-balanced across two different servers.  Traffic should have been directed to Server A, then Server B, then Server C.  After the migration traffic was only going to Server C, which just hosts a page that says the site is down.  It’s a “maintenance” server, meaning that whenever we take a public facing page down, the traffic gets directed to Server C so that people can see a nice screen that says, “Sorry down for maintenance …..”

Everything looked right in the load balancer configuration.  While delving deeper, we noticed that server A was configured to be the primary node for a few other websites.  An application analyst whose app was affected chimed in and said that the configuration was incorrect.  Website 1 traffic was to go first to Server A, then B.  Website 2 traffic was supposed to go in the opposite order.   All our application documentation agreed with the analyst.  Of course, he wrote the documentation so it better agree with him 🙂  Here is the disconnect: we track all our changes in a Change Management system and no one ever put the desired configuration change into the system.  As far as our network team is concerned; the load balancer is configured properly.  Now this isn’t really a folly since our production system/network matched what our change management and CMDB systems were telling us.  This is actually GOODNESS.  If we ever had to recover due to a disaster, we would reference our CMDB and change management systems so they had better be in agreement.

Here’s another example:  We did a mail server upgrade about six months ago and everything worked as far as we could tell.  What we didn’t know was that some things were not working but no one noticed because mail was getting through.  When we did notice something not correct (a remote monitoring system) and fixed the cause, it led us to another item, and so on and so on.  Now, not everything was broken at the same time.  In a few cases, the fix of one item actually broke something else.  What’s funny is that if we didn’t correct the monitoring issue, everything would have still worked.  It was a fix that caused all the other problems.  In other words, one misconfiguration proved to be a correct configuration for other misconfigured items.  In this case, multiple wrongs did make a right.  Go Figure.

My manager has a saying for this: “If you are going to miss, miss by enough”.

.

I’ve also noticed that I sometimes don’t understand concepts when I think I do.  As part of our migration to UCS, we are also upgrading from ESX3.5 to vSphere.   Since I am new to vSphere, I did pretty much what every SysAdmin does: click all the buttons/links.  One of those buttons is the “Advanced Runtime Info” link that is part of the VMware HA portion of the main Virtual Center screen.

This link brings up info on slot sizes and usage.  You would think that numbers would add up, but clearly they don’t.

How does 268 -12 = 122?  I’m either obviously math challenged or I really need to go back and re-read the concept of Slots.

.

Week Two of Cisco UCS Implementaion Completed

Progress has been made!!

The first few days of the week involved a number of calls back to TAC, the UCS business unit, and various other Cisco resources without much progress.  Then on Thursday I pressed the magic button and all the sudden our fabric interconnects came alive in Fabric Manager (MDS control software).  What did I do? I turned on SNMP.  No one noticed that it was turned off (default state).    Pretty sad actually given the number of people involved in troubleshooting this.

This paragraph subject to change based on confirmation of accuracy from Cisco. So here’s the basic gist of what was going on.  We are running an older version of MDS firmware and the version of Fabric Manager that comes with this firmware is not really “UCS aware”.  It needs a method of communicating with the fabric interconnects to fully see all the WWNs.  The workaround is to use SNMP.   I created an SNMP user in UCS and our storage admin created the same username/password in Fabric Manager.  Of course having the accounts created does nothing if the protocol they need to use is not active.  Duh.

The screenshot below shows the button I am talking about.  The reason no one noticed that SNMP was turned off was because I was able to add traps and users without any warnings about SNMP not being active.  Also, take a look at the HTTP and HTTPS services listed above SNMP.  They are enabled by default.  Easy to miss.

.

.

With storage now presented, we were able to complete some basic testing.  I must say that UCS is pretty resilient if you have cabled all your equipment wisely.  We pulled power plugs, fibre to Ethernet, fibre to storage,  etc and only a few did times did we lose a ping (singular PING!).   All our data transfers kept transferring, pings kept pinging, RDP sessions stayed RDP’ing.

We did learn something interesting in regards to the Palo card and VMware.  If you are using the basic Menlo card (standard CNA), then failover works as expected.  Palo is different.  Suffice it to say that for every vNIC you think you need, add another one.  In other words, you will need two vNICS per vSwitch. When creating vNICs, be sure to balance them across both fabrics and note down the MAC addresses.  Then when you are creating your vSwitches (or DVS) in VMware, apply two vNICs to each switch using one from fabric A and one from fabric B.  This provides the failover capabilities.    I can’t provide all the details because I don’t know them, but it was explained to me by one of the UCS developers that this is a difference in UCS hardware (Menlo vs Palo).

Next up: testing, testing, and more testing with some VLANing thrown in to help us connect up to two disjointed L2 networks.

.

Week One of Cisco UCS Implementation Complete

July 5, 2010 2 comments

The first week of Cisco UCS implementation has passed.  I wish I could say we were 100% successful, but I can’t.  We’ve encountered two sticking points which are requiring some rethinking on our part.

The first problem we have run into revolves around our SAN.  The firmware on our MDS switches is a bit out of date and we’ve encountered a display bug in the graphical SAN management tool (Fabric Manager).  This display bug won’t show our UCS components as “zoneable” addresses.  This means that all SAN configurations relating to UCS have to be done via command line.   Why don’t we update our SAN switch firmware?  That would also entail updating the firmware on our storage arrays and it is not something we are prepared to do right now.  It might end up occurring sooner rather than later if doing everything via command line is too cumbersome.

The second problem involves connecting to two separate L2 networks.  This has been discussed on various blogs such as BradHedlund.com and the Unified Computing Blog.  Suffice it to say that we have proven that UCS was not designed to directly connect to two different L2 networks at the same time.  While there is a forthcoming firmware update that will address this, it does not help us now.  I should clarify that this is not a bug and that UCS is working as designed.  I am going to guess that either Cisco engineers did not think that customers would want to connect in to two L2 networks or that it was just a future roadmap feature.  Either way, we are working on methods to get around the problem.

For those who didn’t click the links to the other blogs, here’s a short synopsis:  UCS basically treats all uplink ports equally.  It doesn’t know about the different networks so it will assume any VLAN can be on any uplink port.  ARPs, broadcasts, other terms and how they all work apply here.  If you want a better description, please go click the links in the previous paragraph.

But the entire week was not wasted and we managed to accomplish quite a bit.  Once we get passed the two hurdles mentioned above, we should be able to commence our testing.  It’s actually quite a bit of work to get this far.  Here’s how it pans out:

  1. Completed setup of policies
  2. Completed setup of Service Profile Templates
  3. Successfully deployed a number of different server types based on Service Profiles and Server Pool Policy Qualifications
  4. Configured our VM infrastructure to support Palo
  5. Configure UCS to support our VM infrastructure
  6. Successfully integrated UCS into our Windows Deployment system

Just getting past numbers 1 and 2 was a feat.  There are a number of policies that you can set so it is very easy to go overboard and create/modify way too many.   The more you create, the more you have to manage and we are trying to follow the K.I.S.S principle as much as possible.   We started out by having too many policies, but eventually came to our senses and whittled the number down.

One odd item to note: when creating vNIC templates, a corresponding port profile is created under the VM tab of UCS Manager.  Deleting vNIC templates does not delete the corresponding port profiles so you will have to manually delete them.  Consistency would be nice here.

And finally, now that we have a complete rack of UCS I can show you the just how “clean” the system looks.

Before

The cabling on a typical rack

After

A full rack of UCS - notice the clean cabling

.

Let’s hope week number two gets us into testing mode…..

.