This four-part series focuses on problems with Microsoft Hyper-V virtual machine (VM) clusters. The following Hyper-V virtual machine problems and fixes include tips from Microsoft and hardware vendors, as well as personal workarounds that have helped the overall stability of my virtual environment.
Many of these pointers are not exclusive to Hyper-V problems, and they may also apply to VMware and Citrix XenServer. Part one covers hardware, drivers, patches and configurations that may cause virtual environment instability.
All these virtual machine problems have plagued me at one time or another and reduced the reliability of my Hyper-V clustered environment. My goal is to expose these problems so that you may address them before they become an issue.Firmware updates
Upgrading firmware is crucial for an environment's stability. In a clustered arrangement, this involves more than a BIOS update. This setup is more complicated than a standalone environment because you need to consider the entire data path. One firmware update can affect the BIOS, host bus adapter (HBA), Fibre switches and storage area network storage controller. After I moved most of my Hyper-V hosts to blade servers, there were numerous factors that could affect the stability of my virtualisation environment. This arrangement requires more component firmware updates for the blade chassis than a rackmount setup. Because of this, I can rarely update a component's firmware without
New drivers are released all the time for existing hardware. While I don't upgrade drivers just because they are new, some circumstances require an update. Often when firmware is updated, various drivers require updates to correspond with the new firmware revision. Similar to firmware upgrades, driver updates affect numerous interactions on clustered hosts. Remember: Driver consistency across hosts is imperative in a clustered arrangement. Take, for example, Fibre Channel HBA or iSCSI drivers. Most likely, each connects to the multipath I/O (MPIO) framework. When using EMC PowerPath or HP MPIO framework, matching the correct driver across cluster nodes to the MPIO level is important. In some cases, mixing and matching drivers with MPIO levels can cause the clustered resources failover feature to malfunction. This problem is not limited to HBA drivers, as other cluster problems may occur when the network or power management drivers are inconsistent across cluster nodes. I have experienced these problems when adding new cluster nodes. At the time, the latest MPIO, HBA and network drivers were installed. The mismatch between older and newer nodes resulted in more instability and unpredictability within my clustered virtual environment. What is my recommendation? Stick with the same driver level for every clustered host that is also compatible with your current firmware. Sometimes, the most recent firmware upgrade is always the best. I tend to stick with stable configurations. That said, if there is a reason to install new drivers, try to get the new revision out to every host as soon as possible. Patches
Server virtualisation is still maturing. Despite vendors' push to bring these offering to virtualised environments, these new features and capabilities have shortcomings that create problems. Patches are released frequently to fix issues, but they can be hard to find at times. In my Hyper-V clustered environment, there have been only a few instances when I've had lengthy support calls to fix a problem. In most cases, I've found a patch before a problem arose, or an issue was solved after a short call with Microsoft support. Below are three sites I use to find new patches.
ASR is a server reset mechanism that aids in restarting a server "gracefully" when an installed agent senses a problem with the system (i.e., a thermal event or an OS lockup). If you don't use Hyper-V, most vendors have a similar feature. My exposure to ASR comes from HP hardware, and numerous false positives have resulted in my host clusters hard-powering down (Here are other examples with the same problems on HP hardware:1, 2). For this reason, I disable ASR. The technology's reliability has been suspect, and I've lost confidence in the feature because it automatically brings down servers without consideration for the VMs running on the host. In my experience, the HP ProLiant BL460c virtual hosts have been solid. A memory chip may occasionally go down, and drives may fail intermittently; otherwise, its performance has been good. The accompanying HP software, however, is a different story. I recommend disabling the ASR BIOS setting and agents that trigger the reboots to improve virtual host cluster reliability. Ultimately, matching the firmware and drivers, updating patches and disabling ASR reboots will provide a more stable foundation for your virtual clustered hosts. In the remaining three parts of this series, I will address other Hyper-V cluster problems. While some of these issues are product deficiencies, others are administrative errors and oversights. In any case, I will provide a few tips to avoid these problems and VM downtime. Until then, send along any experience or issues you might have seen with your clustered virtual hosts. About the expert Rob McShinsky is a senior systems engineer at Dartmouth Hitchcock Medical Center in Lebanon, N.H., and has more than 12 years of experience in the industry -- including a focus on server virtualisation since 2004. He has been closely involved with Microsoft as an early adopter of Hyper-V and System Center Virtual Machine Manager 2008, as well as a customer reference. In addition, he blogs at VirtuallyAware.com, writing tips and documenting experiences with various virtualisation products.
This was first published in December 2009