Virtualizing Exchange 2013 –
the right way
by Tony Redmond
Fill in the form to download

How to migrate virtual Exchange Servers with no risk

Migration

Exchange 2013 supports migration technology with some limitations. For example, you can use Hyper-V’s Live Migration or VMware’s vMotion functions to move virtual Exchange servers between hosts but you cannot use Hyper-V’s Quick Migration facility. The essential thing is that a virtual machine running Exchange must remain online during the migration.

Performing a point-in-time save-to-disk and move is unsupported. The reason is simple: to maintain the best possible performance, Exchange manipulates a lot of data in memory. If the Exchange server is a member of a DAG, that memory includes a view of the current state of the underlying Windows Failover Cluster.

Save-to-disk and move might bring an Exchange server back online in a state where the in-memory data causes inconsistency for the moved Exchange server or for another server within the organization. For example, when you bring a DAG member back online, that server might believe that it is a fully functioning member of the Windows Failover Cluster and therefore will attempt to function as such.

But during the time that the migration was happening, the other members of the DAG might have discovered that the server had gone offline and will therefore have adjusted cluster membership by removing the offline server. The result is a synchronization clash where one server has a certain view of the cluster that is not shared by the other members. Restoring the DAG and cluster to full operational health will require manual administrator intervention.

Keeping the virtual machine online during Exchange migrations avoids the issue as it avoids the need for other Exchange servers to take action (such as activating database copies on other servers within a DAG or initiating the replay of in-transit messages from Safety Net) because the other Exchange servers register the fact that the server has failed.

The biggest issue to face with

The biggest issue that you are likely to face with Exchange migration is ensuring that DAG member nodes continue to communicate during the move. Failure to achieve this will cause the cluster heartbeat to timeout and the node being moved will be evicted from the Windows Failover cluster that underpins the DAG. When a migration happens, a point-in-time copy of the virtual machine’s memory is taken from the source to the target host.

At the same time, pages that are being changed are tracked and these pages are also copied to the target as the Exchange server migration progresses. Eventually no more pages are being changed and the “brownout period” occurs, during which the virtual machine is unavailable because it is being transferred from the source host to the target.

If the brownout period is less than the cluster heartbeat timeout (typically five seconds), the Exchange server can continue working from the point that the brownout started and normal operations will continue. But if the brownout lasts longer than the cluster timeout, Windows Failover clustering will consider that the node has gone offline and will evict the node from the cluster.

In turn, this will cause the Active Manager process running within the DAG to initiate a server failover for the now-evicted node and will activate its databases on other DAG members. In effect, the migration failed because service was not maintained and normal operations did not continue when the virtual machine moved to the new host.

The now-moved server will eventually come back online and rejoin the cluster, but a separate, manual administrative intervention will be necessary to reactivate the database copies on the server to rebalance workload across the DAG.

Steps to mitigate the problem

Two steps can be used to mitigate the problem. The first is to ensure that sufficient network bandwidth is available to transfer virtual machines without running the risk that the brownout period exceeds the cluster heartbeat timeout.

The exact amount of bandwidth required depends on the size of the virtual machine, the workload that it is under at the time and the version of the hypervisor that is used, so some testing will be necessary to establish exactly how quickly virtual machines can be moved. The second step is to adjust the cluster heartbeat timeout to reflect the expected brownout period.

Adjusting the cluster heartbeat timeout is not usually recommended but it can be an effective solution to the problem. If you do decide to adjust the timeout, the highest value recommended by the Exchange development group is ten seconds.

See http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx for more information about how to tune the heartbeat interval for Windows Failover clusters.

Tony Redmond
About the author
Tony Redmond is the owner of Tony Redmond & Associates, an Irish consulting company focused on Microsoft technologies. With experience at Vice-President level at HP and Compaq plus recognition as a Microsoft MVP, Tony is considered by many around the world an expert in Microsoft Collaboration Technology. Tony has authored 13 books, filed a patent and more. He is a senior contributing editor to WindowsITPro.com where he writes the “Exchange Unwashed” blog.