Archive

Posts Tagged ‘disaster’

Practice What You Preach

October 29, 2011 Leave a comment

There’s a saying in the medical profession that goes something like, “Doctors make the worst patients”. It due to them thinking they know what’s wrong with them or them thinking that nothing is wrong with them. It really should say, “Medical professionals make the worst patients”. Case in point: my mother. She’s a retired nurse that is DOWN to a pack of cigarettes per day. She has this cough that is so bad I swear that she’s going to hack up a lung on of these days. She says she’s fine and refuses to seek treatment.

So how does this relate to IT? Well, back in the 90’s I worked for a IT consultancy firm. You wouldn’t believe how bad the internal systems were. You would think that with all the fancy certifications and brain power that my local branch had, we would have a working network and such. Not so. It was really a simple choice: fix our own infrastructure or be out in the field and generate revenue. Revenue won.

 

The same can sometime happen in one’s own house. How? Let me regale you with a tale of woe.

Sometime around VMworld (can’t remember if before or after), I noticed my house lights flickering. My UPS/surge protector started making some funny noises for a few moments and then went back to normal. Things were good, so I thought.

A few hours later I noticed that the lower level of my house was quite warm even though the A/C was running. I turned off the A/C and called the repair company. The next morning when the automatic schedule kicked in, the A/C ran fine. The repairman thought that some of my attic insulation had clogged the A/C unit’s drip pan/pipe and that the water level in the drip pan rose to the level where it triggered the auto shutoff. Simple enough. I have a split system: The compressors is outside, but the air handler is in the attic. What I thought was a functioning A/C system was really just the air handler circulating air.

Over the next week or two I experienced my first blue-screen in two years. Then my UPS would randomly start beeping. Nothing like a 1am BEEP! BEEP! BEEP! to scare the crap out of you. Other oddities would pop up every now and then until finally, I went to wake my computer from sleep mode and it wouldn’t wake. I did the turn off/on trick and no video, no beeps, no nothing. After a lot of manual reading, troubleshooting, and the occassional sacrifice to the gods, I finally determined it was the CPU that died.

I’ve never had a CPU die. I’ve had them arrive DOA, but I’ve never had one just go bad on me. Thankfully, my Intel CPU carried a 3yr warranty. I played the 20 question game with Intel and got it replaced. Guess what? System still wouldn’t come up. So I took it to a local computer shop and asked them to run diagnostics on everything. They got my system up and running, but in the process they reset the BIOS back to factory defaults. That really sucked.

I run an ASUS motherboard that has built-on RAID. Resetting the BIOS set the drive controller back to standard IDE mode. Since this entire process of troubleshooting, a short vacation, and replacing parts took over 30 days, new Windows patches had been released. I run with “automatic updates” turned on so it had downloaded a few patches and installed them. Upon reboot, I got the dreaded “No boot device detected” message. Seems the combination of losing the RAID setting and patching screwed up the boot loader. “No problem”, says I, “I have my Win7 DVD so I’ll just boot to it and do a repair”.

DUMB! DUMB! DUMB!. Windows warned me that the repair process could take over an hour so I walked away and let it ran. I checked it the next morning and it said it was done. I rebooted to find that I no longer had anything installed on my hard drive except Windows. Everything was gone…iTunes: gone. Other Apps: gone. All my data: gone.

Sigh.

Sigh, again.

 

OK, I lost everything. Thankfully, I really didn’t have a lot that I couldn’t replace or rebuild (virtual machines). Largest loss was photographs. I can recover about 10% them from various web sites that I’ve shared them on. The rest are lost. My iTunes library consists of about 3000 CDs. I own them all on physical CDs so I can re-rip them. The other major loss was years of personal emails.

To prevent this from happening again, I went out and bought another drive and a copy of Ghost. I also turned on the backup feature of my Synology DS211. Yes, I ‘ve had a backup system at hand for over six months and never used it. I bought the DS211 for iSCSI and NFS storage capabilities for my home lab. Now I back up to my DS211 every night and Ghost once a week to the new drive.

As an IT Pro, I should have known better. How many times have we expressed to our employeers, clients, and whomever else will listen, the importance of backups? If we make claims to our customers regarding best practices, shouldn’t we follow them ourselves? Are we “doctors” when it comes with diagnosing our own IT issues?

 

By the way, I had another A/C failure a week ago and a different technician was sent to fix it. He found that the electrical connection on my A/C compressor had melted somewhat. Hmm…flickering lights, A/C outage, UPS issues, CPU dieing..I’m betting that I took a massive hit and my UPS didn’t do it’s job of protecting my equipment. Or it did, but it took some damage and eventually passed it on. Maybe the beeping was a hint.

So I bought another UPS. Like the extra drive and Ghost, it’s cheap protection in the grand scheme of things.

I’m also still experiencing random wierdness. I’m going to hazard a guess and say that whatever took out my UPS and CPU also may have damaged either my RAM or motherboard. Looks like I may be making my way back to the part store in the next week or two for some replacements.

Sigh.

VMDK UUID Recovery

We had an incident a few weeks ago in which a technician restored a virtual server using vRanger Pro to a different name and data-store for the purpose of copying the restored server to our test lab.  No big deal, but one bad thing happened.  Unkbeknownst to the technician, the vmdk header files (the ones that point to the _flat.vmdk files) and .vmx file were still pointing to the production system.   So when he went to clean up and delete the restored virtual server, Virtual Center dutifully removed the .vmx and .vmdk header files from the REAL server.  It wasn’t able to delete the _flat.vmdk files (thankfully) because they were in use.  Also turns out we had a few disks marked Independent, Persistent so they weren’t being backed-up.

We recovered what we could but still had to manually recreate the vmdk header files.  Since I had no reference file, I took an existing one from a working virtual server, changed all the various pointers, and powered the guest up.  I got an error in regards to the vmdk UUID.  Whenever I have a disk problem, one of my “go to” tools is the VMware provided vmkfstools utility.  A quick man-page review and I found that it can create UUIDs.  Running “vmkfstools -J setuuid filename.vmdk” solved the problem.  The system now boots fine, but since I didn’t know all the details to put in the file, the drive shows up with incorrect information in Virtual Center and backups are still a problem.

A soon to be implemented solution is to create another drive with the correct parameters (via Virtual Center) and then copy everything from the bad drive into the good drive, do some drive letter changing, and Oila.  Another solution is a process change on our part.  No more deletions of this sort via Virtual Center.  Just going to remove the server from inventory and manually delete the files via the service console.

UPDATE: So what do I find in my VMware support RSS feed this morning?  An article detailing how to create a new vmdk header file to replacing a missing header file.   (Link VM KB) I like VMware’s solution better because it not only generates a new UUID (in the header file), it also adds the correct disk geometries.

How did your vendor respond in a disaster situation?

April 14, 2010 1 comment

Have you ever had a major systems failure that could be classified as a disaster or near-disaster?  How did your vendor(s) of the failed systems respond?  Did they own up to it?  Obfuscate?  Lay blame elsewhere?  Help in the recovery?

Back in the fall of 2007, we had an “event” with our primary storage array.  I remember it as though it occurred yesterday.  I was coming home from vacation and had just disembarked from the airplane when I received a call from one of our engineers.  The engineer was very low-key and said that I might want to check in with our storage guys because of some problem they were having.  “Some problem” turned out to be a complete array failure.

I went home, took a shower, and then went to the office.  First thing I saw was the vendor field engineer standing near the array looking very bored.  A quick conversation ensued in which he told me he was having trouble getting support from his own people.  Uh-oh.

A few minutes later I found our storage folks in an office talking about various next steps.  I was given some background info.  The array had been down for over five hours, no one knew the cause of the failure, no one knew the extent of the failure, and no one had filled in our CIO on the details.  As far as she knew, the vendor was fixing the problem and things were going to be peachy again.

At this point, alarm bells should have been going off in everyone’s head.  I tracked down the vendor engineer and gave him a hard deadline to get the array fixed.  I also started prepping, with the help of our storage team,  for massive recovery efforts.  The deadline came and the vendor was no further along so I woke up my manager and told her to wake up the CIO to declare a disaster.

Along comes daylight and we still haven’t made any progress on fixing the downed array, but we have started tape restoration to a different array.   A disaster is declared.  Teams are put together to determine the scope of impact, options for recovering, customer communications, etc..  We also called in our array sales rep, his support folks, 2nd/3rd level vendor tech support, and more.

So here we are all in a room trying to figure out what happened and what to do next.  3rd level vendor support is in another part of the country.  He doesn’t know what’s been discussed so he tells us what happened.  Unfortunately this was not the party line.  Vendor would like to blame the problem on something different; something that was supposedly fixed in a firmware update that we hadn’t yet applied (thus the finger-pointing begins).  Not a bright idea since we had the white paper on that particular error and we were nowhere close to hitting the trigger point.  Months later this so-called fixed problem was corrected, again, in another firmware release.

To make matters worse, while discussing recovery options one of the vendor’s local managers said..and I quote..”It’s not our problem”.   Wow!!!  Our primary storage provider just told us that his products failure was not his problem.  Yes, we bought mid-range equipment so we knew we weren’t buying 5 nines or better.  Still, to say that it’s our fault and to tell us that we should have bought the high-end, seven-figure system was a bit much.

We recovered about 70%  of the data to another array within 36 hours and then ran into a bad tape problem.  The remaining 30% took about two weeks to get.  Needless to say, we learned a lot.  Our DR processes weren’t up to snuff, our backup processes weren’t up to snuff, and our choice of vendor wasn’t up to snuff.  We are in the process of correcting all three deficiencies.

Back to my opening paragraph, how have your vendors treated you in a disaster?

Categories: Uncategorized Tags: ,