Some Tips on VMware ESX…

Well, it has been slow posting recently.  For a while.  OK, a long time now.  But I have been working in a lab, building a virtualization environment using VMware ESX Servers and Virtual Center, and lemme tell ya, there are a LOT of moving parts.  I thought it would be useful to jot down some of the tips I have picked up along the way.  This applies to ESX 3.5.0 update 2 and Virtual Center 2.5.0 update 1.

So here goes.  From memory, so there may be *minor* inaccuracies.

  1. Hardware:  Sure, you want as many CPU cores as you can get (VMware counts up to six cores per physical CPU as one).  Sure you want as much RAM as the machine will hold.  Of course you want terabytes of disk space (well, as much as you can get anyway).  Guess what?  You should also make sure you have plenty of network cards handy.  Whatever space isn’t taken up with fiber channel HBAs, iSCSI initiators, etc., throw a NIC in there.  A Gigabit Ethernet NIC, fiber or copper.  10 gig if you can use it.  Just make sure the cards are supported by VMware, or you may be swapping cards a lot learning the hard way…
  2. Network:  Cisco is good – CDP (Cisco Discovery Protocol) and Etherchannel are both great compliments to ESX networking.
  3. Storage:  NFS instead of iSCSI/Fiber Channel.  Huh?  Are you nuts?  Seriously, my mind was blown away at VMWorld 2008 at the sessions covering NFS access to shared storage.  On a NAS.  NetApp appears to be a natural choice, but any NFS will do in a pinch.  The VMware ESX kernel currently supports version 3 of NFS.  Some apps are better fits for FC/iSCSI SANs.  But most should work just great on a NAS over NFS, and it is *way* cheaper, easier to manage, and more flexible.  There are tradeoffs to everything, of course, so investigate closely.
  4. Which NIC is which?  Two ways to find out – being in the ESX command line is useful now.  Assuming you are plugged into a Cisco, you can use CDP.
    • ESX – Set CDP to both listen and advertise on your virtual switch(es) – the default is listen – with this command: ” esxcfg-vswitch -B both vSwitch0 ". Replace vSwitch0 with your vswitch name. Check with the same command, using -b instead of -B.
    • Cisco – Turn on CDP.
    • Cisco – ” show cdp neighbor ” will show you vmnic0, vmnic1, etc. and the Cisco port connecting them.

    Or you can do it all from ESX by plugging in the NIC to the network, and typing in at the ESX command line, ” esxcfg-nics -l “.  Plug in the NICs one at a time and rerun that command each time.  You’ll see.  Be sure to document everything.

  5. Routing:  Can’t ping a service console NIC?  Can’t get to a vmkernel NIC?  Virtual machines not talking to the rest of the network?  Make sure your default routes are set properly with ” esxcfg-route ” (for vmknics), and ” netstat -r ” for your vswif (service console) NICs.  The ” /etc/sysconfig/network ” file also has the service console default route in it for startup – make sure it is correct, change as needed.
  6. VLANs, portgroups, and vmnics:  This is tricky, and something I had to learn on my own.  The ” esxcfg-vswitch ” command lets you create and delete virtual switches, set CDP, add and remove vmnics (the physical network cards ESX detects), and add and remove portgroups (VLANs and their tags, or IDs).  The -L option links the vmnic to the vswitch, on all portgroups.  The -U unlinks.  But then there is also a -M option, which adds a vmnic to a portgroup on the switch, while -N removes vmnics from portgroups.  That is the tricky part – suppose you wanna add a vmnic to one portgroup only (say your switch has three portgroups) – first, you need to be in the command line, because the Virtual Center GUI does not seem to provide this granularity of configuration. If you add it with the -M command to that portgroup, it does not fail, and looks right, but the vmnic does not talk to anything.  You must link it (-L) to the switch first.  THEN add and remove from portgroups using the -M and -N options, one vmnic/portgroup at a time, after which your vmnic will work as you expected.  This is not documented well, and the man page does not clearly explain this, so be aware.
  7. NFS on ESX:  This is not recommended, so do not do it.  Now that you have chosen to ignore my advice, you will need to recompile the kernel with the _NFS_TCPD option set to y – you need this link.  You will need to modprobe two modules, nfs and nfsd.  You will need to start the portmap and NFS services.  You will need to edit the /etc/exports file (using the no_root_squash option) and export it.  Verify with ” showmount -e ” and ” rpcinfo -p “.  Be really careful – I have done this several times in a closed lab environment, just to learn.  But you can seriously gank up your OS trying this – do not miss a step.  I used the /usr/src/linux-2.4.xx source, and did not need to modify the makefile. One more tip – if you do this, and have separated your traffic properly (service consoles, vmotion, NFS, virtual machine nets all on separate IP networks and VLANs), you will need to add in a service console vswif that other machines can access NFS on to keep the traffic away from the service console networks – so if your NFS network traffic flows on the 10.10.11.x network and your service consoles are on the 10.10.10.x network, add a vswif to the .11 NFS network and point other servers to it.  NFS won’t see a vmkernel NIC – it needs to be something that shows up in ifconfig – a vswif.  This allows you to add it to the NFS portgroup (if you are separating the traffic via portgroups/tagged vlans on a single vswitch at layer 2 instead of using multiple vswitches – I do both).  I have had no problems doing this.  YMMV.  Not for production use – get a NetApp (with the NFS license) or build one (FreeNAS, etc.).
  8. Clusters, resource pools, and virtual machines:  So you made a cluster, added your hosts, and created some resource pools.  Ready to import a VM?  READY, SET, FAIL!  I found that importing a Virtual Server 1.x VM from a local disk copy (trying to make it as simple as possible) failed with typically helpful Virtual Center log entries.  You know, the “unknown error” type.  Off to Google.  Turns out that DRS is getting in the way of the import.  Right click on the cluster, edit the properties, set DRS to manual, and then import it directly to the ESX host in the cluster that you want it to go to.  Should work fine after that (knock, knock).  Then you can set DRS back to what it was, and drag the VM to the resource pool desired.  Some posts say to further remove the ESX host from the cluster, but setting DRS to manual was all I needed to do.
  9. ESX host settings:  When you first set up a host (the ESX server itself) on Virtual Center, make sure to set the time properly – use an NTP server source if you can.  You may also want to increase your service console memory  – I max mine out at 800 MB.  This requires a reboot of the ESX host to take effect.  Also, when making partitions during the ESX install (if you do that kind of thing – I always do it manually), make sure you set the vmkcore partition to be larger than 100 MB.  It needs a minimum of 100 MB, so set a number of 104 MB to be sure, as 100 MB may actually format to less than 100 MB, causing your install to fail.
  10. fdisk, vmkcore, and vmfs3:  If you need this, you are really in the deep water.  So, you are not happy and decided to take it out on your partition table.  Using fdisk.  At the ESX command line.  ALLRIGHTY THEN.  I assume you know exactly what the hell you are doing, cuz if you don’t , you sure will.  The partition type for vmfs3 is fb.  The partition type for vmkcore is fc.  You do not need to (and cannot) format the vmkcore partition, but will need to format the vmfs3 partition after rebooting.  You may very well be booted into a maintenance shell (not safe mode, not even that far).  If you change partitions, you change those UUIDs referenced in ” /etc/fstab ” – I got around this by mounting via /dev/ mountpoints instead of UUID within /etc/fstab.  (Hope you like vi.)  Here is a link for the lost and desperate (foolhardy and just-plain-nuts in my case).
  11. Using service consoles creatively:  This can be done, as I mentioned above with NFS on ESX.  I have a situation where I need to get to a time server on our LAN, but the ESX interfaces I want to use for NTP are on private nets – which our LAN administrator absolutely refuses to route on the LAN (good for him).  So, I added a tagged VLAN and interface on my Cisco for an unused production network, adjusted the routing on the uplink switch, and created that VLAN portgroup on all my ESX hosts’ service console vswitches.  I then added a vswif interface on that IP network to the new (NTP) portgroup, and added the service console vmnics (using the -M option) to the new portgroup.  I also had to set the Cisco up as an NTP server, using the upstream NTP server as a peer, and voila!  Accurate time on all my ESX hosts is now a reality.  May not be recommended (I don’t actually know), but you *can* use vswif interfaces for special purpose traffic needs, and still hold true to best practice guidelines (I try to keep all the backend ESX traffic on non-routing private nets).
  12. Troubleshooting extras:  This may cover a few gotchas…
    • Disable the firewall (see below examples).  It might be in the way, so get rid of it as a variable.  Don’t forget to turn it back on and configure it later!
    • Before removing a portgroup from a switch at the ESX command line, make sure you have removed all vswif, vmknic, and vmnic interfaces from the portgroup first. It has to be empty before you remove it.
    • Not sure which NIC has a driver for it?  I loaded up lots of scavenged gigabit NICs and couldn’t tell which was loading (I do things the hard way).  Match the PCI IDs from the ” esxcfg-nics -l ” command with the output of ” lspci ” to be sure.
    • Wanna change a portgroup VLAN ID? Just reissue the esxcfg-vswitch command with the new VLAN ID, like this: ” esxcfg-vswitch -p "Current Portgroup With Wrong ID" -v 74 vSwitch2 “. Now the VLAN tag is changed from whatever it was to 74. No need to remove the portgroup and do it all over.  You can also move a vswif to another portgroup in a similar manner, so you do not need to delete it and recreate it (see below for an example).
  13. Document and plan, plan and document:  This all starts with a plan.  The goal is as robust and flexible a virtual environment as you can afford to make.  You do not want to build this on a poor foundation – you do not want to rip everything up later and do it a better way.  PLAN IT OUT.  Mine took over a month of me researching and dry-running, and I am still not sure I got it all right, but it is far more sophisticated and robust than it was when I start with my original design.  Document all phases of your work.  There are a LOT of moving parts here – you just cannot do this without ruthlessly precise documentation.  No time to be lazy or cut corners – your environment did not come cheap, and it may very well become a critical part of your network.  It has to be your best effort.  Use the best practices available from VMware, Cisco, NetApp, etc.
  14. ESX command line examples:  Here are a few of the really useful ones I have gotten comfortable with… See the man pages for more.
    • Add a vswitch – ” esxcfg-vswitch -a vSwitch4
    • Add a portgroup – ” esxcfg-vswitch -A "My New Portgroup" vSwitch4
    • Add a VLAN tag – ” esxcfg-vswitch -p "My New Portgroup" -v 33 vSwitch4
    • Add a NIC to a vswitch – ” esxcfg-vswitch -L vmnic2 vSwitch4
    • Add a NIC to a portgroup – ” esxcfg-vswitch -M vmnic2 -p "My New Portgroup" vSwitch4
    • Remove a NIC from a portgroup – ” esxcfg-vswitch -N vmnic2 -p "My Old Portgroup" vSwitch4
    • Remove a portgroup – ” esxcfg-vswitch -D "My Old Portgroup" vSwitch4
    • List your vswitches – ” esxcfg-vswitch -l
    • List your vswif NICs – ” esxcfg-vswif -l
    • List your vmkernel NICs – ” esxcfg-vmknic -l
    • List your physical NICs – ” esxcfg-nics -l
    • Add a vswif to a portgroup – ” esxcfg-vswif -a vswif7 -i -n -p "My New Portgroup"
    • Move a vswif to a different portgroup – ” esxcfg-vswif -p "new portgroup" vswif3
    • Add a vmkernel NIC – ” esxcfg-vmknic -a -i -n "My Other New Portgroup"
    • Temporarily disable the firewall for troubleshooting purposes – ” esxcfg-firewall --allowIncoming --allowOutgoing

Well, this ends a pretty long post.  More to come as I progress through this.

2 Responses

  1. enable vmotion
    vimsh -n -e “/hostsvc/vmotion/vnic_set vmk0”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: