ESX Troubleshooting – The PSOD (Purple Screen of Death)…

Unlike the BSOD of Windows fame, there is actually hope with a PSOD on ESX.  As I learned at VMWorld 2008, this indicates a specific hardware problem in the majority of cases.  Examining the screen dump can actually point you in the right direction to resolving this.

As I was building my junk server cluster (in a lab, not for production use, so a great way to learn safely), I was swapping NICs to plus-up on Gigabit Ethernet connectivity to the Cisco 6509 I am using.  One of my servers (the big one) was largely already completely configured in VIM, right down to the NFS mountpoint it was using.  Without thinking it through, I grabbed a couple of gig NICs to install, since it still had room, and did it, removing two unsupported NICs in the process and sliding the cards over into the blank PCI slots (grouping all the NICs together).  Upon rebooting, it threw up a red log entry proclaiming a pCPU0 warning about something.  Shortly thereafter, the console stopped responding.  Checking further, I saw that the host had a PSOD.  I rebooted, got the same log message on initial ESX console screen, and another PSOD within minutes.

This time, I dug into the PSOD and noticed that the dump was referencing network drivers for the cards I had just installed.  Aha!  I realized that the vmnic numbering had changed – and the server was trying to do all kinds of things using the old vmnic PCI references, including mount the NFS share.  No wonder it vomited!

The solution was to first shut down and pull the new NICs, reboot, and see if the PSOD went away – it did.  Next, I removed the NFS share and updated the vmnic assignments to vswitches to account for any changes.  I rebooted again to make sure all was well.  When that proved to be the case, I shut down and added in the two NICs I wanted to use, rebooted, and everything worked.  I was then able to update my configs with the new vmnics, reboot to make sure there was no PSOD event, and reenable the NFS share.  I rebooted again, one last time, just to tempt fate, but still no PSOD.

Been stable ever since.

So don’t give up on the PSOD – it’s natural to want to do that with Windows, but this sure ain’t Windows, is it?  You CAN troubleshoot and resolve these cases, even if you have to open a support call.  The dump can help you zero in on the bad memory module, failing CPU, or even the occasional misplaced network card, and help you get your server back up on its feet.

Of course, I would never be this reckless in a production environment – which is why everyone should have a lab to play with.  If you can afford the time, effort, and junk servers, it is a great way to learn in safety.

ESX 4.0….

VMware ESX 4.0 is in limited, restricted beta testing right now (beta 2?).  I found some links discussing some of the features expected, although this is always subject to change.  YMMV.

Possibilities Within ESX…

As I learn more about VMware ESX, I am starting to see the flexibility and possibilities available.  You have five major sets of pieces to play with – vswifs, vmknics, portgroups, vswitches, and vmnics.

  • You can tag or untag your portgroups, and can assign multiple portgroups to a vmnic.
  • You can have multiple vswifs on multiple vswitches.
  • You can have multiple vmnics assigned to a portgroup.
  • You can have vswitches with no uplinks (no vmnics assigned).
  • You can have portgroups with no uplinks (no vmnics assigned).
  • You can have vswifs assigned to non-service console portgroups for different traffic cases.
  • You can have up to 100 vswifs (0 to 99).

Things I have yet to determine on my own:

  • How many vmknics can you have?  I assume 100 also – you do not name them like you do with vswifs; you create and assign them to portgroups and they are automatically named and numbered.
  • Can a portgroup span multiple vswitches?  I don’t see why not.
  • Can a vmnic be assigned to multiple vswitches?  I think so…

I am sure that I will come up with plenty more questions.

Then throw in the firewall configs, and appliance VM’s (like firewall/IDS/IPS/proxy devices).  I saw demonstrations of an entire DMZ within a physical server, using such appliances spanning multiple vswitches (some with uplinks, some without).  Talk about amazing – I had not even considered thinking in that direction.  Just imagine how you can move all these pieces around to create new network functionality within an ESX host server.  The more complex it gets, though, the more you [A.] need to know the ESX command line, and [B.] need a kickstart script on a floppy to autoconfigure your stroke of genius onto new ESX servers you deploy.  (Because hand-jamming sucks.)

And finally – this is just the ESX side.  VIM comes along and adds in clusters, resource pools, the concept of shares, VMotion, HA, and DRS, just to name a few.  All configurable, and with a new set of caveats, such as:

  • DRS, VMotion, and HA need shared storage (SAN, iSCSI, or NFS) available before they are enabled.
  • DRS needs to be set to Manual when importing VMs from images or machines – deploying from templates does not (I think).
  • DRS and HA are available only for hosts within a cluster (I think).
  • HA, I believe, requires identical network configs on each ESX host in the cluster to work – so if you build your cluster out of dissimilar junk machines like I have (it’s all I have to work with for now), with different NIC quantities, portgroup assignments, and so on, then HA probably won’t work.  At least, it doesn’t for me, and the differing network configs are the first thing I would suspect.  And if you think it through, it sorta makes sense that it won’t work.

When VMware and Cisco come out with the virtual switch concept they discussed at VMworld2008, this HA limitation should change.  This is where, as I understand it, essentially the network configs are shadowed across each clustered host.  The Cisco switch interconnecting them is reconfigured when a HA event happens to allow those network changes incurred to function.  I think this is basically how it is supposed to work.  Too cool, eh?

NFS Fixed…

Didn’t have a default gateway defined for the two problem servers’ vmkernels.  Once I did that, they mounted the NFS share just fine.  Oddly, it worked mounting from within ESX at the command line…

I see beers in my future…..

Need to Fix NFS…

It occurs to me that I had better fix that NFS issue I am having.  Why?  Well, if I have five servers clustered, and three can mount the NFS datastore with VMs on it, could there be a chance of DRS moving a VM to a server not talking to the VM’s NFS orgin?  I do not think so, but if true, things would fail.  If not true, then my cluster is only as good as three servers, not five.

My strategy:  Mount at the command line on one of the ESX servers first to test.  If that fails, unmount the same NFS share from one of the other servers and try to remount it, from within VIM.  This will tell me quite a lot about what is going on, I hope.  The vmknics on four of the servers (two that can and the two that cannot mount) are on the same subnet, which differs from the subnet the NFS mount is on.  So why can two mount, but the other two not?  They fail instantly, so it is not a timeout.  The firewalls are all off for now, so that is not part of the issue.

And of course, dig through the logs on each of the servers – /usr/var/messages, /vmkernel, /vmksummary, and /vmkwarning at a minimum.

My task list has otherwise been eradicated in the past week (YES!) – outside of NFS, all that really remains is for me to build a golden master of Windows 2003 Server, and maybe fork some application templates (DHCP, DNS, print, AD, web, SQL, FTP, etc.) off of it.  Cake, right?

More Work, More ESX…

Figured out my issue from yesterday – the Service Console NICs were on the wrong port group.  They had the right IPs but were assigned to a portgroup with a different subnet mask, so they were never talking to their gateway.  Fixed.  I knew I was being a chowderhead.

Another thing I learned about ESX and Virtual Center – importing is cool, but be careful to make sure you import machines to ESX hosts that have at least as many CPUs as the target machine.  Otherwise it’ll come over, but fail to start up, and the logs will declare failure.  Just migrate to a more suitable ESX host and start it up.

Now I have fixed almost every issue I am having (still can’t get my two newest servers to mount one particular NFS share, even though they can ping the IP – the logs still say, “no route to host”).  I’ll get to it later.  Feeling pretty good right now – why spoil it?

Work…

Today I worked.  No breaks, almost no emails (like, seven maybe), no phone calls, no meetings, no chit-chat watercooler stuff, barely had lunch (a sandwich from home) – while I worked.  From 8:30 AM straight through to 6:30 PM, and I am tired.  I got a LOT done.

  • I located two possible rack shelves to use, since I need to install one more shelf int the roll around rack I am stuffing with hardware.  Neither was a good fit, but I then found the exact type I needed in another rack.  It was supposed to have been removed anyway, so I did it, adjusted the mounting brackets and seaching for missing screws before finally mounting it where I wanted.
  • I updated the VLAN configs on my Cisco 6509 to account for some new changes I had come up with.  I had to do some minor repatching of my existing ESX servers afterwards.
  • I helped another team diagnose a Layer-2 loop problem (didn’t take long).
  • Next came two legacy 2U servers (old HP DL 308 G3s).  I had to pull them apart and remove the three 100BaseT NICs in each of them.  Then I had to scavenge six Intel 1000BaseT NICs from three old IBM 1U servers that are going away (we have a stack of them, so I will probably be making another trip for more NICs later).  After installing the cards, I stacked them on the new shelf (waaayyy up high), connected the KVM and power, and popped an ESX install CD into each.
  • I loaded ESX on each server using my standard configuration, cabled up all the network cards, rebooted.
  • I imported a 287 GB server image onto a new 81 GB VM in my ESX cluster.
  • I helped yet another team get into their HP blade server chassis switches (didn’t take long).
  • I KVM’d into each new server and hand configured everything from the command line, making new vswif interfaces, vswitches, portgroups, and vmknics.  This took forever – as soon as I get more time, I am making my own kickstart script to do this stuff for me.  I have three vswitches, four vswif interfaces, and four vmknics.  Two vswitches have four portgroups each, and the other has, uh, <doing math in head> 22 portgroups.  Most are not used, but are there for uniformity and flexibility.
  • I updated all my documentation and posted the newest updates on the rack doors (front and back).
  • I worked with our LAN admin to set some routes up for some new networks I will be using, and updated the static routes in my Foundry switch tying my networks to his.
  • I spent the rest of my time troubleshooting why these two new servers cannot talk to most networks, but can talk to a couple.  I am so tired from typing in commands, vlan IDs, etc., I can’t think straight.  I bet I dorked up some VLAN tags on the vswitches or mispatched something.  I checked all the cables meticulously to ensure I had good link on everything, and sure enough, there were loose cables.  I confirmed everything was good on the physical layer with esxcfg-nics -l.  Default routes look good.  Just so much stuff to keep track of…

I am still not done with one server, not quite.  But tomorrow I’ll tackle it and these problems with both, and maybe grab a few more NICs for later.  Kinda sucks having no help, but oh well.  It’ll still get done.

I also need to make templates within ESX, so I have to start copying ISOs to install from.  That can happen tomorrow too.  Just a fairly typical workday.  Oh, and my boss put me in charge of a portion of a big project I am building all this stuff for.  Bonus.

I need a beer….