Friday, February 11, 2011

VMWare ESXi back to life

So after my computer (which I use as a VMware ESXi virtualization host) resuscitated from the dead and started booting up I got the following messages:

Failed to load tpm
Failed to load lvmdriver

TPM apparently stand for Trusted Platform Module and since my new motherboard coincidentally had a header for a tpm module for a moment I considered buying one. Turned out it wasn't necessary because the tpm is not important for ESXi.

But lvmdriver is; lvmdriver means that it could not find a compatible network card and this stops completely the boot up process. ESX and ESXi are targeted for high end servers and so VMware has not added support for desktop NICs.

Solutions
There are two solutions to this problem, the easiest one and probably the best one is to get a compatible Intel card. Apparently Intel makes the best cards and most are fully compatible with ESXi. Mine is on the way.

The long way home
The other solution is to use a third party driver for your card. I tried this solution first.
My new motherboard came with an onboard Realtek 8111C NIC. So now I had to find a custom driver for it and install it in my unbootable ESXi installation without destroying it in the process. Bear in mind that my Unix/Linux skills are not that great.

Google is your friend, with it you can find practically anything in the Internet and if there is something you don't understand then Wikipedia is your other friend. With those friends I found this forum http://www.vm-help.com/forum/ where they specialize in this sort of thing. Turns out that you can take a Linux driver compatible with your NIC make some changes to adapt it to ESXi, compile it, copy some files here and there and you're good to go. Easy.

Since I cannot do that I used some precompiled and prepackaged solutions posted in the same forum. It basically involved booting up the computer (with Puppy Linux from a pen drive in my case) and replacing a file called oem.tgz in one of the partitions.
The first time I tried it I got a Pink Screen of Death which was something new to me (different from the BSOD in this one you can still interact with a debugger).
So I just got a different oem.tgz (thank you geppi) and this time success!

vSphere Client
At least partial success: ESXi booted up and it worked OK but since there was a change in hardware the VMs would not start right away. For each VM,  ESXi needed me to answer a pending question: did you move or did you copy the VM?. Until I could answer that question the VMs would not start.

It had been a while since I had used the vSphere client and by then it was broken. Apparently an update in .NET made the previous version inoperative. I was getting this error message:

Error parsing the server "IP" "clients.xml" file

The solution was to download the version 4.0U1 of the vSphere client. For some reason getting a direct link to this new version is not possible. The way the VMware website is constructed it's very difficult to find the link. When you seem to be getting closer it keeps eluding you.

Finally I realized you had to login with your user and get a license key to go to the download pages and there you can find a link to the client. Why is a much required update hidden so deep into web bureaucracy is a mystery to me.

Changing the virtual NIC
Before starting the VMs, and following advice found in vm-help/forum, I replaced the Virtual Network Card in each of the Windows VMs from type Flexible to VMXNET3. Without this change Remote Desktop Connection to the VM would be completely unusable. Also I was experiencing great instability with NeoRouter (great tool BTW). vSphere client worked well though.

Unfortunately, even after those changes, RDC proved to be still very unstable (much better than before but still not good, specially when browsing the web) so that's why I eventually decided to go for solution one (see above).

UPDATE: Afterwards I disabled the UDP & TCP Checksum Offload (IPv4) options in the Advanced properties of the vmxnet3 Ethernet Adapter in the Windows VMs, this seemed to fix all the stability issues.




Upgrading to ESXi 4.1
Since I was doing all this tinkering with ESXi I decided to also upgrade from 4.0 to 4.1.

- First I downloaded the upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release.zip package.
- Then I downloaded and installed the latest version of vSphere Host Update Utility for ESXi 4.0.
 I clicked on the button marked Upgrade Host, boom! got this fine message:

Failed to read the upgrade package metadata: Could not find file 'metadata.xml'

Back to google, turns out the Update Utility doesn't work for this upgrade.
I had to download the vSphere CLI which is based on Perl. This opens a command prompt (in Windows).
According to the upgrade guide I have to issue the following commands:

vihostupdate --server <hostname or IP> -i -b "location of upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release.zip"

vihostupdate --server <hostname or IP> -i -b "location of upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release.zip" -B ESXi410-GA-esxupdate

Except that doesn't work either for two reasons, first the command prompt is not in the correct directory so first we need to go there:
cd bin
Second in Windows the commands need to have .pl appended:

vihostupdate.pl --server <hostname or IP> --username <username> --password <password> -i -b "location of upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release.zip"

vihostupdate.pl --server <hostname or IP> --username <username> --password <password> -i -b "location of upgrade-from-ESXi4.0-to-4.1.0-0.0.260247-release.zip" -B ESXi410-GA-esxupdate

After that reboot the ESXi host and install the new vSphere Client..

Console
If you had the console activated in 4.0 upgrading to 4.1 will deactivate it. In 4.0 it used to be a hidden feature, in 4.1 it's called Tech Support Mode and you can activate it for local and remote connection through ssh. Fortunately it's very easy to turn it on.
Using the vSphere Client go to: Configuration  >Security Profile > Properties
click on Local Tech Support and/or Remote Tech Support,
click on Options, click on Start,
select Start and Stop with Host (Start Automatically probably works too).

Passthrough
One good news with the new motherboard is that it has support for VT-d which the older motherboard (or its BIOS) didn't have. So I decided to test it, sure enough in Configuration >  Advanced Settings where before I would get a:

 Warning: Host does not support passthrough configuration 

now it offered to configure some devices for VMDirectPath passthrough. I was this close to selecting every USB device for passthrough and clicking ok but this message gave me pause:

Warning: configuring host hardware without special virtualization features for virtual machine passthrough will make it unavailable for use except dedicating it to a single virtual machine. In particular, configuring a device needed for normal host boot or operation can make normal host boot impossible and may require significant effort to undo. See the online help for more information.

Long message, isn't it?  Even after reading that message I was still tempted to ignore it, but considering for a moment how difficult normal things are with vmware I decided to google it and sure enough there were horror stories of people that tried just that and ended up having to reinstall everything. I just couldn't afford it, maybe some other day.


Thinning disks

I wanted to convert some thick drives to thin drives for that I followed Kent's blog steps. This is the key step for me:

vmkfstools -i <source file> -d thin <target file>

It clones a virtual drive using the specified mode.

Moving things (disks) around

Something else I wanted to do is to make things more efficient in terms of speed and space, but mostly speed. I have a modest installation with 3 Windows VMs constantly on (though not necessarily constantly in use) and some other Windows and Linux (Ubuntu and Puppy Linux) VMs mostly for experimental use. The fact is whenever two of the VMs were doing some work performance suffers terribly. So I decided to move the disks around to test better configurations. I'll update as I get results.



In conclusion things are never easy with VMware but after much trial and error and googling a lot you eventually get there.

Thursday, February 10, 2011

First entry: Dealing with a dead computer

Dead Computer
So after a blackout last week I found my 18 month old  HP Pavilion m9650f  in a comatose state. The computer was in an infinite cycle of 4 seconds On - 4 seconds Off. All fans (power source, CPU, Graphic card), drives and LEDs seemed to work but inexplicably the computer would turn itself off after 4 seconds only to restart 4 seconds later. No Video signal. No beeping sounds.

Troubleshooting
So I started the typical procedure: open the computer case fumble with the graphic card, hard drives and any and all components. Clearing the BIOS CMOS memory, replacing the depleted CMOS battery (which was supposed to last 7 years). Nothing.

Is it the BIOS?
Only when removing the memory dimms did I notice a change. With fewer memory dimms the cycle was faster and with no memmory dimms I could hear some beeping which told me that the BIOS initial test was working and either was purposely turning the computer off or it was deffective.

So I removed a jumper (yellow arrow) on the board marked rom_recovery which is on a SPI programming header for the BIOS EEPROM memory. Partial success! This stopped the On/Off cycle. But nothing else happened.
So I suspected the BIOS was corrupted. I could download the BIOS firmware from the HP website but programming it in a non-bootable computer would require some specialized equipment that I don't have. Researching on the subject I found anything from expensive commercial equipment to DIY projects. The prospect seemed too daunting for me and I wasn't even sure it was the BIOS.

Or the capacitor?
During all that fumbling I had noticed that a capacitor (red arrow) had a little bulge but with my limited electronic knowledge I didn't think it was important. Out of frustration I finally googled the subject and it turns out it's been a well known cause of motherboard failure for years. My "Truckee" motherboard in particular (I had version 1.01, bad sign!) is well known to have all kind of issues, although, mine had been working flawlesly for 18 months.

Calling support
Replacing a capacitor in a motherboard is completely out of my depth so I finally decided to call HP support and have it repaired. After verifying that my one year warranty had expired and that I hadn't purchased the extended warranty (which I never, ever do) the support rep wanted me to buy a one-incident-phone-support contract for 50$  or a one-year-phone-support contract for 100$. At this point I was sure there was nothing that could solve this problem over the phone so I declined.

I asked if they had a service center in my city where I could drop off the computer to be picked up later but they don't offer that possibility. I was quoted and estimate of between 250$ to 350$ and they were going to send a box to ship the computer via UPS. So I agreed but the process of getting this started over the phone took an unusually long time where mostly I had to wait on the phone and occasionally give some information, in the final steps they transferred me to a supervisor but by then I was already late for a meeting so I told him I would call again later. On the second call I had to go over the whole process again with another rep, this time I made sure to ask them to replace the motherboard not just the capacitor, but then the called was dropped. At this point I realized that if all that was required is to replace the motherboard maybe I could do it myself.

Replacing the motherboard
I've never done it before, but I've always been tempted with building my own computer. After all, how hard could it be? By now I've practically disassembled my PC and I have all the components already, the only thing I needed is a new motherboard. After some research I chose the X58M from MSI which had the closest specs to my old board (I almost went for the ASUS Sabertooth until I realized that my old board had MicroATX format).

Four days later (including a weekend) I got the new board and went to Fry's to get me some Thermal Paste for the CPU and possibly the Northbridge heatsink (several reviewers complained the Norhtbridge runs hot and some just reapplied thermal paste to correct it). I dutyfully disconnected and labeled every cable from the old board. I was surprised how easy it was to move the Intel Core i7 920 CPU from the old board to the new. From a previous unpleasant experience with an old Pentium 4 I was expecting this to be very difficult, having to align countless pins with their holes. It turns out there are no pins, just contact plates that are kept in place with a pressure latch. Nice.

Heatsink snag
Next I had to install the old Heatsink-Cooler Fan assembly on top of the CPU. Here I hit my first snag. Although the holes to mount the heatsink were in the same place, the new motherboard expected a heatsink with hooks that clip into the holes, the heatsink I had was screwed to the CPU socket assembly. My solution? take the socket assembly from the old motherboard to the new one, easy (or so I thought).

After this I screwed the new motherboard in the case and reconnected all the cables (except the front panel). I should have done some test with a partially assembled setup instead of connecting everything, but ever the optimist (the impatient, really), I said let's go for it.

Shorted motherboard
So I turn on the computer and I see a blue flash and everything goes dead. No beeps, no fans, no nothing. Uh oh! What could it be? Try again, same thing. Start disconecting, keep trying, same result. Disconnect everything except the motherboard and CPU, try again, blue flash, dead.

Go to the MSI troubleshoot page. They recommend testing "...the motherboard outside of the case to verify that the motherboard is not shorting to the case". Aha! That sounds a likely cause, I knew I should have done some testing before. Take out the motherboard, try again, same result. Take out memory, disconnect CPU fan, try again, same.

At this point I got to thinking, the only thing that is not "kosher" is the CPU socket assembly that I took from the old motherboard, so I unscrewed the heatsink, removed the CPU, removed the socket assembly and sure enough the back plate of the socket assembly was bigger and it's making contact with the pins of three capacitors.

Going rustic
So my options were: order a new Heatsink-Cooler, find some compatible screws (unlikely) or buy a drill. It was 1am so I went to Walmart got me some safety goggles, a new drill, some drill bits and a Dremel accessory. Two hours later (those back plates are made of solid metal) I had a rustic looking back plate with a 1 inch by 1/4 inch section removed. This time I tested thoroughly and everything was working fine. An hour later I had a fully functioning computer again (or so I thought, that story in the next post).