We finally found the problem that has been plaguing us for years. The situation is a 3 node 2008 R2 Hyper-V cluster. When trying to live migrate a virtual machine off of one node to another it fails. Trying to logon to the host machine fails with the message of no logon servers. In the past we ended up rebooting the host and taking an outage and never figured out why. This third time we dug a bit deeper before rebooting and found out there was port exhaustion.
Here is how we found the problem: http://www.virtualizationhowto.com/2014/10/viewing-killing-tcp-ip-connections-windows/
There were thousands of ports in the TIME_WAIT state. So after digging some more we found the hotfix: https://support.microsoft.com/en-us/kb/2553549
We did find out that doing a quick migration still works and each virtual machine only took about a minute, so that limited our downtime significantly.
We had some strange, seemingly random, things happening in our environment lately, but have had a hard time tracking down what was going on. We run a Hyper-V environment consisting of 3 host servers in a failover cluster with Server 2008R2 and an iSCSI SAN. The other day I noticed we were running out of room on our PRTG server (virtualized). When I initially created that server and setup PRTG I didn’t put a whole lot of thought into the configuration and just installed everything using defaults. The default setting puts all the data on the C drive. To fix this I created a new volume on the SAN, added two new network adapters to the PRTG server, configured everything and moved the data to the new volume. The next day we noticed a bunch of errors with iSCSI connections dropping on a different server. Read more…