XenMotion with HP ProCurve Switches13 Mar
What good is a high availability solution without testing?
Not worth waiting three minutes for, as we found out this morning in a test for a client’s solution implementation.
How Does It Work?
Our design consists of two HP Proliant servers running Citrix XenServer 5, a Compellent SAN, a bunch of Windows hosts (both clustered and non-clustered) and a couple of HP Procurve switches connecting everything together with redundant paths.
XenServer 5 (enterprise and higher editions) has the capability to move active virtual machines between hosts. Live. On-the-fly. No downtime. No interruptions. As with any solution, this capability is dependent on a proper foundational architecture. This includes shared storage (SAN) accessible by all Xen hosts in the resource pool and an enterprise-class Ethernet infrastructure. One popular line of networking products for the SME segment is, of course, HP ProCurve. Cisco is always a good choice as well.
XenMotion VM Migration Normally Quick — Not Today
Xen’s Marathon Technologies-based VM high availability feature restarted the protected VMs after our simulated host failure quickly and without any manual remediation. When the original host was restored, I moved the protected VMs back to their home server. After the move, there was no more network connectivity to any IP outside of the host on which the VMs now reside — for three whole minutes. Not acceptable, nor typical, given our own experiences with XenMotion VM migration. We’ve moved VMs running Citrix XenApp with 20+ active clients between hosts. It’s so seamless normally that users can’t notice.

I immediately suspected MAC problems, since the migrated VMs could still reach the other workloads on the same host and the hosts magically opened to the rest of the world after three minutes. Obviously, it wasn’t an ARP issue since the virtual MAC assigned to a VMs network adapters are permanent – even between hosts. So, it appeared to be a MAC table issue on the switch. This was puzzling to me, as I’d never encountered such a problem before.
A Lesson on HP Procurve Switch MAC Tables and MAC Age
A switch’s MAC table is built from packets leaving a node and entering the switch. It grabs the source MAC address from a packet entering the port and adds it to the table, along with the port it came in on. The switch uses that table to direct other inbound packets with that destination MAC to the proper port to avoid the need to broadcast the packet to every port . Since MAC tables are built passively from inbound packets (as opposed to ARP caches on a node which are built actively), they tend to converge very, very quickly – especially since switches update the table with every passing packet. All switches I’ve encountered exhibit that same update behavior – all of them except this HP ProCurve 2510G.

MAC Table - Courtesy cisconinja.wordpress.com
The Solution
I proved my racing suspicions by dropping the age limit on entries in the MAC table, and the VM’s IP connectivity correlated with the table’s aging time. Then, I came upon a fellow who had the exact same problem with the exact same switch. Turns out that the switch was not accepting updates for entries already on the MAC table. The entry had to expire before the MAC could be added back to the table with it’s new port.
His story can be found on the Citrix forums here. HP has corrected the bug as of version Y.11.08 as indicated in their release notes.
Until next time, happy VM migrations!
One Response to “XenMotion with HP ProCurve Switches”
Leave a Reply



TLM: XenMotion on an HP Procurve switch with old(er) firmware blew up in our faces this morning: http://tinyurl.com/XenMotionProcurveFAIL