Wednesday, August 20, 2008

MSDTC: Cluster Failure

We experienced a problem with one node of our Windows Server 2003 cluster that serves a dual purpose: file sharing and Active Directory services. In fact, one of the nodes of the cluster holds our FSMO roles. The server is typically set to have the PDC as the passive node. Unfortunately (and to this day, I'm not exactly sure why), the primary node (let's call it DC1) would no longer start the MSDTC service, thus causing the cluster to flip over to the passive node (DC2 - the PDC). This caused quite a problem, for several reasons:
  1. No redundancy for the clustered resources, as one of the nodes was unusable.
  2. The active node has several IP address resources that, when on the PDC cause numerous errors due to the fact that the PDC is now multihomed.
It took quite some time for me to figure out what the problem was. There were many different (seemingly unrelated) errors in the System Log, Application Log, and the DFS Replication Log. I tried to keep with the original problem at hand, that the MSDTC resource would not operate on DC1. While developing my course of action, I found that there was something that wasn't configured according to the Microsoft documentation on setting up the MSDT resource: "How to configure Microsoft Distributed Transaction Coordinator on a Windows Server 2003 Cluster - http://support.microsoft.com/kb/301600. The article is very adamant about the fact that the Network DTC service must be installed and configured, prior to starting the MSDTC resouce.

To accomplish this requires installing a Windows component through Add/Remove Programs and then configuring it. The instructions for "How to enable network DTC access in Windows Server 2003" are located here: http://support.microsoft.com/kb/817064/. Since completing all of these things, the MSDTC service has been running smoothly. I have been able to fail it over several times and there have been no related errors for the Clussvc.

Some additional notes about this problem are the many errors that are logged. An explanation of what I found is below:
  • System Log
    • 1137: Clussvc - Event Logger - event log was filling up with events due to a bigger, yet unknown problem
    • 5775: Netlogon - due to multihomed PDC
  • DFS Replication Log
    • 1202: DFSR - failed to contact DC to access configuration information. Replication is stopped.
Currently, I am still receiving 1058 and 1030 errors in the Application Log, but I should be able to get this worked out now that the cluster is stable.

No comments: