Home > Hardware refresh, VMware > Upgrade Follies

Upgrade Follies

It’s amazing how many misconfigured, or perceived misconfigured, items can show up when doing maintenance and/or upgrades.  In the past three weeks, we have found at least four production items that fit this description that no one noticed because things appeared to be working.  Here’s a sampling:

During our migration from our legacy vm host hardware to UCS, we broke a website that was hardware load-balanced across two different servers.  Traffic should have been directed to Server A, then Server B, then Server C.  After the migration traffic was only going to Server C, which just hosts a page that says the site is down.  It’s a “maintenance” server, meaning that whenever we take a public facing page down, the traffic gets directed to Server C so that people can see a nice screen that says, “Sorry down for maintenance …..”

Everything looked right in the load balancer configuration.  While delving deeper, we noticed that server A was configured to be the primary node for a few other websites.  An application analyst whose app was affected chimed in and said that the configuration was incorrect.  Website 1 traffic was to go first to Server A, then B.  Website 2 traffic was supposed to go in the opposite order.   All our application documentation agreed with the analyst.  Of course, he wrote the documentation so it better agree with him 🙂  Here is the disconnect: we track all our changes in a Change Management system and no one ever put the desired configuration change into the system.  As far as our network team is concerned; the load balancer is configured properly.  Now this isn’t really a folly since our production system/network matched what our change management and CMDB systems were telling us.  This is actually GOODNESS.  If we ever had to recover due to a disaster, we would reference our CMDB and change management systems so they had better be in agreement.

Here’s another example:  We did a mail server upgrade about six months ago and everything worked as far as we could tell.  What we didn’t know was that some things were not working but no one noticed because mail was getting through.  When we did notice something not correct (a remote monitoring system) and fixed the cause, it led us to another item, and so on and so on.  Now, not everything was broken at the same time.  In a few cases, the fix of one item actually broke something else.  What’s funny is that if we didn’t correct the monitoring issue, everything would have still worked.  It was a fix that caused all the other problems.  In other words, one misconfiguration proved to be a correct configuration for other misconfigured items.  In this case, multiple wrongs did make a right.  Go Figure.

My manager has a saying for this: “If you are going to miss, miss by enough”.

.

I’ve also noticed that I sometimes don’t understand concepts when I think I do.  As part of our migration to UCS, we are also upgrading from ESX3.5 to vSphere.   Since I am new to vSphere, I did pretty much what every SysAdmin does: click all the buttons/links.  One of those buttons is the “Advanced Runtime Info” link that is part of the VMware HA portion of the main Virtual Center screen.

This link brings up info on slot sizes and usage.  You would think that numbers would add up, but clearly they don’t.

How does 268 -12 = 122?  I’m either obviously math challenged or I really need to go back and re-read the concept of Slots.

.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: