Archive

Posts Tagged ‘storage’

My Friend Needs a Product From VCE

Disclosure: I work for VCE so make of it what you will.

I had lunch this weekend with a friend of mine that manages the server infrastructure for a local government entity.  During the usual banter regarding business, he mentioned that a recent storage project (1+yr and still going) had suffered a series of setbacks due to outages, product compatibility issues, and other minor odds & ends.

The entity he worked at released an RFP that detailed quite a bit of what they were looking to accomplish, which in retrospect may have to been too ambitious given the budget available.  After going through all the proposals, the list was narrowed down to two.  On paper, both were excellent.  All vendors were either Tier1 or Tier2 (defined by sales).  Both proposals were heavily vetted with on-site visits to existing customers, reference phone calls, etc.  Both proposals consisted of a storage back-end with copious amounts of software for replication, snapshotting, provisioning, heterogeneous storage virtualization, and more.

While each component vendor of the winning proposal was highly respected, the two together did not have a large installed base that mimicked the proposed configuration (combo of hardware & software).  In hindsight, this should have been a big RED flag.  What looked good on paper did not pan out in reality.

Procurement went smoothly and the equipment arrived on time.   Installation went smoothly.  It was during integration with the existing environment that things started to break down.   The heterogeneous virtualization layer wasn’t quite as transparent as listed on paper.  Turns out all the servers needed to have the storage unpresented and presented back.  (Hmmm…first outage, albeit not a crash.)

Then servers starting having BSODs.  A review by the vendors determined that new HBAs were needed.  This was quite the surprise to my friend since the he provided all the tech info on the server & storage environment as part of the proposal and was told that all existing equipment was compatible with the proposal.  (2nd, 3rd, 4th+ outages…crashes this time).

So HBAs were updated, drivers installed, and hopefully goodness would ensue. (planned outages galore).

Unfortunately, goodness would not last.  Performance issues, random outages, and more.  This is where it started to get nasty and the finger pointing began.  My friend runs servers from Cisco (UCS) and HP.  The storage software vendor started pointing at the server vendors.  Then problems were attributed to both VMware and Microsoft. (more unplanned outages).

Then the two component vendors started pointing fingers at each other.  Talk about partnership breakdown.

So what is my friend’s shop doing?  They are buying some EMC and NetApp storage for their Tier 1 apps.  Tier 2 and Tier 3 apps will remain on the problematic storage.  They are also scaling back their ambitious goals since they can’t afford all the bells & whistles from either vendor.  The reason they didn’t purchase EMC or NetApp in the first place was due to fiscal constraints.  Those constraints still exist.

As I listened to the details, I realized that he really needed one of the Vblock ™ Infrastructure Platforms from VCE.

First, his shop already has the constituent products that make up the majority of a Vblock – Cisco UCS, EMC Storage (small amount), and vSphere. This makes transitioning to a Vblock ™ easier since less training is needed and a comfort level already exists for those products.

Second, the hardware and software components in the winning proposal were not widely deployed in the proposed configuration.  At the time my friend’s shop put out the RFP, there were more Vblocks in production use in the United States than there were of the winning proposal’s configuration world-wide.

Third, at VCE, all the hardware/software components in a Vblock ™ are thoroughly tested together.  It’s VCE’s job to find any problems so a customer doesn’t.  People think that if you take three items listed on an HCL somewhere and put them together, then everything will work fine.  It just isn’t true.   In fact, a number of patches put out by the parent companies are the result of testing by VCE’s engineering and quality assurance teams.

And finally, probably the biggest reason why my friend needs a product from VCE is to get rid of all the finger pointing.  When VCE sells a product, every component is supported by VCE.  There is no saying “It’s not our problem, call the server vendor”, or “call the storage vendor”, or “call the software vendor”.  I’m not saying that all vendors finger point. Also, your mileage with a set of particular vendors will vary, but if you are CIO/Manager/whatever, you have to admit that “one call, that’s all” is quite compelling.  You can either have your staff spend time managing vendors or you can have your staff spend time moving your business forward.

I’ve put the bug in the ear of my friend about going Vblock in the future.  It won’t happen any time soon since his procurement cycle isn’t conducive to purchasing all his infrastructure components in one fiscal year.  It usually takes five years to get through all three major components.  But who knows?  Maybe his recent experiences will accelerate the cycle.

.

 

Advertisements

What about Tintri?

August 2, 2011 3 comments

I attended the Phoenix VMUG meeting this week.  The two main sessions were about vSphere5 and Tintri’s VMstore.  While vSphere5 is interesting, I have been working with it for over 5 months now so it wasn’t a “must see” presentation for me.  I was actually at the event to see Tintri and I have to say that the Tintri VMstore product intrigues me quite a bit.  For those who haven’t heard of this product, think of it as a purpose built storage appliance for your VMware environment.   This “appliance” is roughly 8.5TB (usable) and is only accessed via NFS.  The entire device presents itself as one large datastore to your hosts.  If you think about it, this really does simplify things quite a bit.  There is no zoning, no LUN creation, no disk grouping, etc.  Basically, all of your standard storage creation tasks have been removed.  Time to add capacity? Just add another appliance and add it to your vCenter as another datastore.  It’s that simple.

Management of the appliance is performed through a web interface and via a vCenter Plug-in.  The bulk of what you expect in a management tool is there with a few notable exceptions (discussed later in this post).

One of the VMstore design goals is performance.  To that end, Tintri has equipped the VMstore with 1TB of SSD storage.  Through their own internally developed magic, the bulk of “hot” data is kept in SSD.  The rest is stored on SATA disks.  You can imagine the kind of IOPS possible given the heavy use of SSD.  BTW, the SSD is deduped so you get more bang for your buck.

The folks at Tintri gave the standard “who we are” and “why we are different” presentation that we all expect at open events like this.  After talking about the product and walking us through the mgmt. interface the Tintri folks took questions from the audience.  All-in-all, a good showing.

There were no hard questions asked at the VMUG, but the after meeting was completely different.  I am also a member of Advanced Technology Networking Group (ATNG) and we met up with the Tintri folks a few hours later.  ATNG consist of hardcore techies and since many of our members are responsible for acquisitions and managing data centers, our meeting with vendors tend to be “No holds barred”, but in a friendly way.  Our goal is to get to know the product (warts and all) as much as we can during our meetings.

We questioned a lot of design choices and where the product is going.  One are of particular interest to me was the use of SATA drives.  Yes, the appliance uses RAID6 and has hot spares, but that did not alleviate my concern.  Drive quality continues to improve so only time will tell if this was a good design choice or not.

Another area questioned was the use of a single controller.  The general rule of enterprise storage is to have two controllers.  VMstore currently has one.  Notice that I say “currently”.  Future product will have two controllers.

There were a few questions and suggestions regarding the management interface.  One suggestion was to rename the VMStore snapshot function.  It is not the same snapshot feature as in vCenter.  vCenter has no visibility into VMstore native snapshots and vice-versa.  This can be a source of confusion if you consider that the target audience for this product is VM admin.

The lack of some enterprise features also came up in our discussions.  Notably, the lack of SNMP support and the lack of replication support.  The only way to get notified of something going wrong with the appliance is to either receive an email alert or see something in vCenter.    As for replication, the only option available is to perform a standard vm backup and restore the data to another appliance or storage device of your choice.

However, all is not doom and gloom.  Tintri is working on updates and improvements.  SNMP support, replication capabilities, and more are coming soon.   Keep in mind that Tintri recently came out of stealth mode and is on 1.0 of their product.   For a 1.0 product, it’s pretty good.  Just to give an idea of the performance and quality of VMstore, Tintri has a reference customer that will attest that they have been running a beta version since November 2010 without any issues.  In fact, that customer is still on the beta code and has not upgraded.  That’s a pretty good reference if you ask me.

So what do I think of VMStore?  I think Tintri is on the right track.  Purpose built storage for VMware is a great concept.  It shows a laser like focus on a particular market and it lets the company focus on capabilities and features that are specific to that market.  Generic storage has to cater to many masters and sometimes gets lost in the process.

I am going to predict that Tintri will either be copied by other storage vendors or that they will be acquired by one of them.  The product/concept is just too unique and spot-on that it can’t be ignored.

Links of interest:

Does the Storage Tecchnology Really Matter?

November 15, 2010 4 comments

This article is really more of a rant.  Take it for what it is.  I’m just a frustrated infrastructure admin trying to choose a storage product.

I am not a storage admin (and I don’t play one on TV), but I am on our storage replacement project representing the server infrastructure area.  In preparation for this project, I started reading a number of the more popular blogs and following some of the storage Tweeters.  One thing I noticed is that all the banter seems to be about speeds and feeds as opposed to solving business problems.  In the case of Twitter, I am guessing it’s due to the 140 character limit, but I would expect to see more in the blogs.  Some of the back & forth reminds me of the old elementary school bravado along the lines of “My dad can beat up your dad”.

I must admit that I am learning a lot, but does it really matter if it supports iSCSI, FC, FCoE, NFS, or other acronyms?  As long as it fits into my existing infrastructure with minimal disruption and provides me options for growth (capacity and features), should I care?   If so, why should I care? We recently moved the bulk of our data center to Cisco UCS so you would think that FCoE would be a highly valued feature of our new solution.  But it’s not.  We don’t run Cisco networking gear and our current gear provider has no short term plans for FCoE.  Given that we just finished a network upgrade project, I don’t forsee FCoE in our environment for at least three years unless funding magically appears.  It doesn’t mean that it isn’t on our radar; it just means that it won’t be here for at least three years.  So stop trying to sell me on FCoE support.

So who has the better solution?  I am going to use EMC and NetApp in my example just because they blog/tweet a lot.

I think if one were to put a chart together, both EMC and NetApp could be at the heading of any column.  Their products look the same to me.  Both have replication software, both support snapshots, both support multiple protocols, and so on and so on and so on.  The features list is pages long and each vendor seems to match the other.

There are technical differences in how these features are implemented and in how the back-end arrays work, but should I care?  Tell me how these features will help my business grow.  Tell me how these features will protect my business.  Tell me how these features will save my business money. Tell me how they can integrate into my existing infrastructure without having to change my infrastructure.  And when I say “tell me”, don’t just say “it can do this”, or “it can do that”.  Give me case studies more than six pages long, give me processes and procedures, and give me REAL metrics that I can replicate/validate (assuming I had the equipment and time) in a real-world scenario which information telling me how they affect my apps and customers.

This is an area where companies need to do a better job of marketing.  EMC started down this path with the vBlock.  Techies aren’t really interested because the specs are blasé.  C-level folks love it because it marketed towards them and the marketing focuses on the solution from a business perspective.   NetApp is starting to do the same with their recently announced FlexPod.  The main downside to these new initiatives is that they seem to forget about the SMB.  I think it’s great from a techie POV that a FlexPod can handle 50,000 VDI sessions.  But as an IT Architect for my organization, so what?  We only have 4200 employees or so.

Right now, I’m sort of in-between in what type of information I need: technical vs business.  I am technical at heart, but have been looking at things from a business perspective the last few yrs.  I am in the process of trying to map what our mgmt team wants to accomplish over the next few years to the storage feature sets out there in the market.  This is where both types come together.  Now if I can just get past the FUD.

Does ESX lack storage resiliency?

October 27, 2010 3 comments

Back in April, I posted about how our primary storage vendor disavowed us after one of our arrays failed by saying, “It’s not our problem”.  (You can read about it here).  Well, this same vendor had to do some “routine” maintenance on one of our arrays that was so “routine” that the vendor claimed it would not have any impact on our servers.  The vendor technician came onsite to do the work and reaffirmed that it should have no visible impact.   This routine maintenance was just a reboot of one controller, wait for it come back online, and then a reboot of the other.  Over 50 servers went down and it took us three hours to recover.

While I could go on and rant about the vendor, I really want to focus on something I noticed about the outage.  Almost all of our physical Windows servers tolerated the outage and suffered no major problems, but our ESX hosts are another story altogether.

All of our ESX hosts that were attached to the array in question basically “froze”.  It was really weird.  Virtual Center said all the virtual servers were up and running, but we couldn’t do anything with them.  Rebooted VC, no change.    I logged into the service consoles of the hosts to run various iterations of vmware-cmd to manipulate the virtual servers, but nothing worked.   I figured the only thing I could do at this point was to reboot the hosts.  Since my hosts are attached to multiple arrays, I tried to vMotion the known good virtual servers to a single host so I wouldn’t have to take them down.  No go.  Basically, I had lost all control of my hosts.

OK, time for a reboot.  Did that and I lost all access to my LUNs.  A quick looksie into UCSM showed all my connections were up.  So did Fabric Manager.   I could see during the reboots that ESX was complaining about not being able to read a label for a volume or two.  Reviewing various host log files showed a number of weird entries that I have no idea how to interpret.  Many were obviously disk related, others weren’t.

After multiple reboots, HBA rescans (initiated via VC and service console), and such we still couldn’t see the LUNs.  Keep in mind; we were three hours into a major outage.  That is the point where I have to get real creative in coming up with solutions.  I am not going to say that these solutions are ideal, but they will get us up and running.  In this case, I was thinking to repurpose our dev ESX hosts to our production environment.  All it would take would be to add them to the appropriate cluster, present LUNs, manually register any really messed up virtual servers, and power up the virtual servers.

Before I presented this idea to management, I don’t know what or why, but something triggered a memory of my first ESX host failure.   Way back in the ESX 2.x days, I had a problem where a patch took out access to my LUNs.  The fix was to run the command ‘esxcfg-boot –b’.   Ran it, problem fixed.

I know that the esxcfg-boot command rejiggers inits and such, but I really don’t know why it fixed the problem.  Did something happen to my HBA drivers/config?

What really bothers me about this is that almost all of my Windows servers and clusters came back online by themselves.  If they can do it, why can’t VMware program a bit more resiliency into ESX?  I hate say this, but incidents like this make me question my choice of hypervisor.  Since the latest version of Hyper-V relies on Windows Failover Clustering, would it have responded like my existing clusters and tolerated the outage appropriately?  Anyone know?

Week One of Cisco UCS Implementation Complete

July 5, 2010 2 comments

The first week of Cisco UCS implementation has passed.  I wish I could say we were 100% successful, but I can’t.  We’ve encountered two sticking points which are requiring some rethinking on our part.

The first problem we have run into revolves around our SAN.  The firmware on our MDS switches is a bit out of date and we’ve encountered a display bug in the graphical SAN management tool (Fabric Manager).  This display bug won’t show our UCS components as “zoneable” addresses.  This means that all SAN configurations relating to UCS have to be done via command line.   Why don’t we update our SAN switch firmware?  That would also entail updating the firmware on our storage arrays and it is not something we are prepared to do right now.  It might end up occurring sooner rather than later if doing everything via command line is too cumbersome.

The second problem involves connecting to two separate L2 networks.  This has been discussed on various blogs such as BradHedlund.com and the Unified Computing Blog.  Suffice it to say that we have proven that UCS was not designed to directly connect to two different L2 networks at the same time.  While there is a forthcoming firmware update that will address this, it does not help us now.  I should clarify that this is not a bug and that UCS is working as designed.  I am going to guess that either Cisco engineers did not think that customers would want to connect in to two L2 networks or that it was just a future roadmap feature.  Either way, we are working on methods to get around the problem.

For those who didn’t click the links to the other blogs, here’s a short synopsis:  UCS basically treats all uplink ports equally.  It doesn’t know about the different networks so it will assume any VLAN can be on any uplink port.  ARPs, broadcasts, other terms and how they all work apply here.  If you want a better description, please go click the links in the previous paragraph.

But the entire week was not wasted and we managed to accomplish quite a bit.  Once we get passed the two hurdles mentioned above, we should be able to commence our testing.  It’s actually quite a bit of work to get this far.  Here’s how it pans out:

  1. Completed setup of policies
  2. Completed setup of Service Profile Templates
  3. Successfully deployed a number of different server types based on Service Profiles and Server Pool Policy Qualifications
  4. Configured our VM infrastructure to support Palo
  5. Configure UCS to support our VM infrastructure
  6. Successfully integrated UCS into our Windows Deployment system

Just getting past numbers 1 and 2 was a feat.  There are a number of policies that you can set so it is very easy to go overboard and create/modify way too many.   The more you create, the more you have to manage and we are trying to follow the K.I.S.S principle as much as possible.   We started out by having too many policies, but eventually came to our senses and whittled the number down.

One odd item to note: when creating vNIC templates, a corresponding port profile is created under the VM tab of UCS Manager.  Deleting vNIC templates does not delete the corresponding port profiles so you will have to manually delete them.  Consistency would be nice here.

And finally, now that we have a complete rack of UCS I can show you the just how “clean” the system looks.

Before

The cabling on a typical rack

After

A full rack of UCS - notice the clean cabling

.

Let’s hope week number two gets us into testing mode…..

.

Prepping for our Cisco UCS Implementation

The purchase order has finally been sent in.  This means our implementation is really going to happen.  We’ve been told there is a three week lead time to get the product, but Cisco is looking to reduce it to two weeks.  A lot has to happen before the first package arrives.  Two logistical items of note are:

  • Stockroom prep
  • Datacenter prep

What do I mean by “Stockroom prep?”  A lot actually.  While not a large UCS implementation by many standards, we are purchasing a fair amount of equipment.  We’ve contacted Cisco for various pieces of logistical information such as box dimensions and the  number of boxes we can expect to receive.   Once it gets here, we have to store it.

Our stockroom is maybe 30×40 and houses all our non-deployed IT equipment.  It also houses all our physical layer products (think cabling) too.    A quick look at the area dedicated to servers reveals parts for servers going back almost ten years.  Yes, I have running servers that are nearly ten years old <sigh>.    Throw in generic equipment such as KVM, rackmount monitors, rackmount keyboards, etc and it adds up.   Our plan is to review our existing inventory of deployed equipment and their service histories.  We’ll then bump up that info with our stockroom inventory to see what can be sent to disposal.   Since we don’t have a lot of room, we’ll be really cutting down to the bone which introduces an element of risk.  If we plan correctly, we’ll have a minimum number of parts in our stockroom to get us through our migration.  If we are wrong and something fails, I guess we’ll be buying some really old parts off eBay…

As for prepping the data-center, it’s a bit less labor but a lot more complex.  Our data-center PDUs are almost full so we’ll be doing some re-wiring.  As a side note, the rack PDU recommended by our Cisco SE has an interesting connector to say the least.  These puppies run about $250 each.  The PDUs run over $1200 each.   Since we’ll be running two 42U racks of equipment, that equals four of each component.  That’s almost $6K in power equipment!!

As another data-center prep task, we will need to do some server shuffling.  Servers in rack A will need to move to a different rack.  No biggie, but it takes some effort to pre-cable, schedule the downtime, and then execute the move.

All-in-all, a fair amount of work to do in a short time-frame.

A Trend in Technology Provider Marketing Techniques

The recession has brought about a few major changes in sales/marketing techniques in the technology industry.  There was a time when only executive management was wined and dined and the common man was left out in the cold.  Well my friends, that time is no more.

Over the last 18 months or so, I have been invited to more lunches and activity-based events that I have in my 20+ years in the IT industry.  The two (lunches and activity based events) can be broken down into two categories of providers: those selling really expensive products and those with not-so expensive products.

Those in the really expensive product category usually are storage providers.  Since these systems can easily reach into the hundreds of thousands of dollars, the sales/marketing experience has to be equally impressive.  As such, the event most often chosen is lunch at an upscale steak restaurant such as Ruth’s Chris or Fleming’s.    The typical event consists of a small presentation (usually under 30 minutes) followed by a lunch from a scaled-down menu.  Even though the menu is scaled down, the quality of the food is not;  the reputation of the restaurant is still on display.

In the not-so expensive category, we typically find VARs and small product vendors.  The event of choice in this category is entrance to something with mass appeal such as a blockbuster movie’s opening day.   As with the lunches, the event begins with a 30 minutes presentation and then the movie begins.   This type of event has become so pervasive that I recently had three invitations to see Iron Man 2 at the same theater on the same day (all at different times).

I don’t go to the lunches very often because I feel it is disingenuous to take advantage of something so expensive for no return.   I only attend when I have a budgeted project.  I’m also careful to keep track of the “promoter”.  Some promoters are very good at setting up the presentations so that real information is imparted.  Others are there just to get butts in the seats and the presentations tend to suffer for it.  While I enjoy a good meal, I don’t want to waste my time.  However, I do partake in some of the movies since they usually take place on a Friday (my day off) and I use them to network with the VAR and other IT professionals.

Other events in the expensive category:

  • Tickets to major golf tournaments
  • Tickets to basketball games
  • Tickets to concerts

Other events in the not-so-expensive category:

  • Tickets to baseball games (many can be bought in volume for under $10 each)
  • Kart racing (fast go-karts)
  • Lunch and games at a large entertainment venue such as Dave & Busters

What else have you seen?   Anything outrageous?

.