Introduction and Database Activation Coordination (DAC) Support
Exchange 2010 introduced a vast amount of changes to the High Availability model with the addition of the Database Availability Group (DAG). Some features of the DAG are having up to 16 members, automatic database *over to another site as long as you still have quorum, and much more. Exchange also introduced Database Activation Coordination (DAC) mode as an optional addition to the new High Availability model to prevent split brain syndrome from occurring during a site switchover when utilizing a multi-site DAG configuration with at least 3 DAG members and more than one Active Directory Site. DAC is off by default and in Exchange 2010 RTM it should not be enabled for:
- 2 member DAGs
- Non-Multisite DAGs
- Multi-site DAGs that are in the same stretched Active Directory Site
In Exchange 2010 SP1, the following changes are introduced and supported for DAC:
- DAGs that contain 2 or more members
- DAGs that are stretched across a single AD Site
Majority Node Set
Before we understand how DAC works, we really have to understand the Cluster Model that DAGs utilize. Both Exchange 2007 and Exchange 2010 Clusters use Majority Node Set Clustering (MNS). This means that 50% of your votes (server votes and/or 1 file share witness) need to be up and running. The proper formula for this is (n / 2) + 1 where n is the number of DAG nodes within the DAG. With DAGs, if you have an odd number of DAG nodes in the same DAG (Cluster), you have an odd number of votes so you don’t have a witness. If you have an even number of DAGs nodes, you will have a file share witness in case half of your nodes go down, you have a witness who will act as that extra +1 number.
So let’s go through an example. Let’s say we have 3 servers. This means that we need (number of nodes which is 3 / 2) + 1 which equals 2 as you round down since you can’t have half a server/witness. This means that at any given time, we need 2 of our nodes to be online which means we can sustain only 1 (either a server or a file share witness) failure in our DAG. Now let’s say we have 4 servers. This means that we need (number of nodes which is 4 / 2) + 1 which equals 3. This means at any given time, we need 3 of our servers/witness to be online which means we can sustain 2 server failures or 1 server failure and 1 witness failure.
Note: Exchange 2010 DAGs do not use the term Majority Node Set anymore. That term is deprecated and is now called Node Majority or Node Majority with File Share Witness.
Database Activation Coordination (DAC)
In short, DAC mode is enabled when you have at least 3 members to prevent split brain syndrome. It’s as simple as that. Let’s take a look at an example and see how DAC can help. The longer explanation below talks about this specific model.
Prevention of Split Brain Syndrome
Short Explanation
When the Primary Site goes offline (or we lose too many servers – refer to Majority Node Set above), the Secondary Site will need to be manually activated should you make the choice that a secondary site activation will be required depending on the magnitude of the failure and how long you anticipate the primary site or servers there will be down. But, when the Primary Site comes back online, the WAN link may be offline. Because the Primary Site’s Exchange Servers don’t necessarily know about the Manual Site Switchover, they will come up thinking they have Quorum since the Primary Site has the majority of the servers and they are still connected to the old FSW. Because of this, they will begin to mount databases since to them, they still have Quorum.
DAC mode will enable the usage of a new protocol, Database Activation Coordination Protocol (DACP). This means that DAG members start up with a special memory bit of 0. They need to contact another DAG node with this special memory bit set to 1. This memory bit will be set to 1 on one of the DAG members in the Secondary Site since that site is hosting active databases. Because the WAN link is down, the Primary Site’s DAG members that just came online won’t be able to contact this DAG member with the special memory bit set to 1. Because of this, they won’t be able to mount databases. The WAN link will have to come back online which means the Primary Site’s DAG members will now be able to contact the DAG member that has the special memory bit set to 1 which will now allow the Primary Site’s DAG Members to be in a state where they are allowed to mount databases.
Longer Explanation
We can see in this example, there are 5 DAG nodes and no FSW as we have an odd number of DAG nodes. Our entire Primary Datacenter Fails (or we lose too many servers – in our case, this would be (5 / 2) + 1 which means 3 of our nodes need to remain operational for the DAG to remain operational), the Secondary Site will need to be manually activated should you make the choice that a secondary site activation will be required depending on the magnitude of the failure and how long you anticipate the primary site or servers there will be down.
Part of the switchover process will have us shrink the DAG by removing the DAG nodes in the Primary Site from the cluster so all that remain of the existing 2 DAG nodes in the Secondary Site. Instructions for shrinking the DAG and doing a manual site actiavtion is located here. Should we decide to proceed with a a manual site switchover , we will provision the FSW in the secondary site during manual site activation to the secondary datacenter. But what happens if the Primary Site’s Exchange Servers come back online? They will think they have majority because the primary site has the majority of the servers and the FSW is located there. Because of this, when they start up, they will begin mounting databases.
Now this is where DAC comes in. Without DAC enabled, the Primary Site’s Exchange Servers would indeed come online, think they have majority, and begin mounting databases and you run into a split-brain syndrome scenario. This is because when power is restored to the datacenter, the servers will usually come up before WAN connectivity is fully restored. The servers cannot communicate with each other between the sites to see that the active databases are already mounted, and because of that, the Primary Exchange Servers will see they have majority since the majority of your servers and your FSW should be in the Primary Site, and mount the databases.
If the servers were allowed to mount databases, and you ran into a split-brain scenario, something called Database Divergence would occur. Database Divergence is where the databases in the primary site would become different from the secondary site causing the need for a reseed from the authority database which would cause some database loss from the new database that went into the diverged database due to split-brain from occurring.
The way DAC works, is that all servers have a new protocol known as Database Activation Coordination Protocol (DACP). One of the DAG Nodes will always have a special memory bit set to 1. What this means is, with DAC on, any time a server wants to mount a database, there are a few ways it will attempt to communicate with other DAG members:
- If the starting DAG member can communicate with all other members, DACP bit switches to 1
- If the starting DAG member can communicate with another member, and that other member’s DACP bit is set to 1, starting DAG member DACP bit switches to 1
- If the starting DAG member can communicate with another member, and that other member’s DACP bits are set to 0, starting DAG member DACP bit remains at 0
Because of this, when the Primary DAG Servers come back online, they will need to either contact all other DAG members or contact a DAG member with DACP bit set to 1 in order to be in a state where it can begin mounting databases. Because the WAN is down, these Primary Datacenter DAG Servers that are now just coming back online won’t be able to mount databases because none of these servers will have that special memory bit set to 1. That memory bit will be set on one of the DAG Servers in the Secondary Site. Once WAN connectivity is restored, these Primary Datacenter DAG Servers will now be able to communicate with the DAG Server that happens to have that special memory bit set to 1 and now these DAG Servers will be allowed to mount databases.
Thankfully, in SP1, DAC will work with 2 node DAGs and multi-site DAGs that are using a stretched AD Site.
DAC and ForceQuorum
If you do not know what Forcequorum is, have a quick look at my blog post here. Essentially, forcequorum allows you to forcefully start a cluster when this cluster has lost quorum. You’re forcing it to bypass the Majority Node Set requirement to become operational. In CCR, forcequorum was used in a geographically dispersed CCR cluster. When the Primary Site went offline, you had to run forcequorum on the node in the Secondary Site and then set a new File Share Witness. This is similar in Exchange 2010 DAGs when the Primary Site goes offline.
The article here is entitled Datacenter Switchovers and is the article to use when planning Site Resiliency with Exchange 2010. You can see, in the procedure for terminating a failed site, there are two methods:
- When the DAG is in DAC mode:
- When the DAG isn’t in DAC mode
When looking at the procedures for when DAC is NOT enabled, there are more steps that have to be done which involve running clussvc commands. When looking at the procedures for when DAC is enabled, there are no steps which involve running clussv commands. This is because when you have DAC mode on, Exchange’s Site Resilient tasks allow it to perform these clussvc tasks in the background. As you can see, it is well worth it to ensure you have at least 3 DAG nodes in a DAG just to utilize DAC. But again, in Exchange 2010 SP1, DAC can be utilized with DAGs that contain two nodes.
When i was reading the article.It is very useful for me.. I got a more information.thank you posting…
<a href ="http://www.serrurier-lyon-69.fr/">serrurier lyon 69000
I have enjoyed this article so much that I have read it multiple times and plan on coming back for any other articles you may publish.Its not the situation that reader should be totally agreed with author's views about post. <a href ="http://igfollowers.us/gain-followers-for-free/">gain followers for free
I have to say I am very impressed with the way you efficiently website and your posts are so informative. <a href ="http://igfollowers.us/more-free-followers-fast/">more free followers fast
If I move all the active databases from the primary datacenter to the standby datacenter first, then perform the following steps below. Do you think the mailbox servers in the primary datacenter will mount the database, assuming that the DAC is enabled?
(1) Stop the DAG in the primary site
(2) Activate the DAG member servers in the standby site. This step will evict all the nodes in primary site. Also it will stop and disable the cluster service on the mailbox servers in the primary site.
(3) Transfer the file witness server from primary datacenter to the standby datacenter.
(4) Suspend the database copy replication between primary datacenter and standby datacenter.
(5) Sever the WAN link between primary site and standby site.
Primary Datacenter: 4 mailbox-role servers and one witness server
Standby Datacenter: 2 mailbox-role servers and one witness server
One DAG spans two datacenters: primary and standby
We are trying to simulate the site failure recovery procedure as much as we can, without any data loss.
I am not able to find any information about this type of simulation on the Internet. I would really appreciate if you can provide me some insights.
If you perform step 5 and DAC is enabled on the DAG before you simulate the failure, when the Primary Site comes back up, those servers will not mount databases since DAC is enabled and cannot contact the rest of the DAG Servers.
Hello Elan,
In my Environment i have implemented two Mailbox server in Primary site and two in DR site, FSW is hosted by CASHT01 server in primary site and Alternate FSW in hosted in DR CASHT03 server. In my case what will happen if primary site gose down ? All DB will mount to DR autometically ? Will Alternate FSW will be act as voter node in case of primary FSW fail ? How the mejority and FSW will count if my primary site goes down?
Ashif.
Hi Elan,
Leeme first tell you that I’m the biggest fan of your articles, because the way you describe scenario and their solution, is simply commendable!
I have some questions, for your kind attention:
Scenario: I’m designing a Exchange 2010 SP1 based Messaging design, where we want DR setup to be available for us using DAG.
All users will connect to Primary Datacenter, and only in the event of disaster happens, then all users will move to the DR site.
We are sharing the same name space for both data centers, HO.ABC.COM on both datacenters.
Primary Data Center will contain:
2 Nodes for CAS / HUB Exchange 2010 SP1 on Windows Server 2008 R2
2 Nodes for Mailbox Servers Exchange 2010 SP1 on Windows Server 2008 R2
Witness Server would one of the HUB Server in Primary Server
Questions:
1) Can we have 1 witness server and two Mailbox Server in the primary site, and only one Mailbox/CAS/HUB Server in the DR site?
2) In this scenario what name space structure do you recommend, either shared name space or different name space?
3) In the event of complete Primary Datacenter failover, and where I have only single server in DR site containing Exchange 2010 SP1 Mailbox/CAS/HUB, do I also need to create standby Witness server to keep required votes in the DR for mounting databases?
Looking forward to hear from you soon;
Zahir Hussain Shah
Dear Friends,
I have a two node DAG, each Exchange 2010 Server Mailbox is located in a different Active Directory Site. Each Exchange 2010 Mailbox has its own mailbox database and each one of them serves local users. (In short, I have active users in both sites). FSW is located in Site 1
Anytime I have a WAN outage, The Exchange Server located in Site 2 will loose contact with ths FSW and will dismount its database… so users located in Site 2 cant work.
I would like to know what should I do to enable service on Site 2
I really hope you can help me with his one..!
Jorge Salinas
The guidance to get around this is have 2 DAGs; Site1DAG and Site2DAG. For Site1DAG, you'll have the majority of servers for Site1DAG in that site including the FSW so if WAN goes down, users can still work since the majority of the servers are in Site1 and it's only the failover servers for that DAG are in Site 2. For Site2DAG, you'll have an entirely new DAG with the majority of servers for this new DAG in Site2 including the FSW for this DAG (since each DAG has its own FSW). The failover servers for this DAG would like in Site1. If the WAN link goes down, because the majority of the servers for this DAG are in Site2, Site2 users will still be able to work.
So yes, you need 2 DAGs if you want to prevent the WAN link bringing down 1 site.
Thanks a lot Alan for your quick response,
I will be making a transition to exchange server 2010 with site resilience in October. I will share my experiences with you. Please pray for smooth transition.
Anwar A.Siddiqui
Dear Shudnow,
I have two questions
1. if i have two servers on Primary site and only one server on secondary site which is on extended LAN, then should I create a FSW on secondary site as well or what?
2. Is this possible to have automatic secondary site failover without using any commands if a common file share witness is used on another site.
Anwar A.Siddiqui
1. You should have the Witness in the location with the most servers (which should be the Primary Site). Like I explained in the Majority Node Set formula, you want most of your servers to be in the Primary Site and your witness. This is because if, or I should say when, your WAN goes down, the Primary Site Servers will stay online since you'll the primary site's servers will have quorum since there's at least (number of nodes / 2) + 1 still online.
2. It doesn't matter where the witness. Again, look at the formula. If Primary Site loses connectivity to DR Site, if DR Site's servers don't see (number of nodes / 2) + 1 still online, the quorum is not maintained in the secondary site and the cluster service goes offline rendering Exchange inoperable until quorum in that site is re-established. If the Primary Site's servers still sees (number of nodes / 2) + 1, quorum is maintained in the Primary Site and the cluster service remains operational. Now let's say a single server goes offline and each location still sees (number of nodes / 2) + 1 of nodes being online, quorum is maintained everywhere and any server, whether that's in the Primary Site or Secondary Site, can still mount databases or failover databases to each other.
How do you activate one of the servers in the primary site that have the bit set to 0 if the WAN is not available?
(And we know that the remote server has not activated the databases)
You don't want to. This would cause split brain syndrome and is the entire reason you have DAC enabled that uses the 0 bit. You need to get your WAN back up so the databases can start synchronizing and the source servers would then have their bit set to 1 which would allow for the PAM to mount databases on those servers. You can then schedule maintenance and switch everything back to the source datacenter (DNS for example).
So when we shut down the primary data centre for semi-annual building power maintenance, if the WAN equipment doesn't come up we cannot use exchange, even though the remote server is administratively stoped and will never mount the databases automatically?
There is no way to activate the server that the DAC is keeping down?
Well in this case, you should be able to just Start- the DAG with the Primary Site to force those servers to mount just like you would do in a recovery scenario. As long as the secondary datacenter never mounted, I would think it would be fine since you won't be in a split brain scenario and end up diverged (two different sets of data in your separated datacenters). Like with all situations, test and don't just take my word for what I say based on how I think it would work.
isnt there amistake here:
Exchange 2010 RTM it should not be enabled for:
•2 member DAGs
•Non-Multisite DAGs
•Multi-site DAGs that are in the same stretched Active Directory Site
In Exchange 2010 SP1, the following changes are introduced and supported for DAC:
•DAGs that contain 2 or more members
•DAGs that are stretched across a single AD Site
i thought in rtm you need 3 members minimum and in sp1 its 2 membes?
What I say is correct. I think you need to re-read it. I say in Exchange 2010 RTM it should not be enabled for 2 node DAGs (which is correct since DAC in RTM should only be 3+ Servers), Non-Multisite DAGs (meaning you have Exchange in 1 AD Site and DAC in RTM doesn't support a single AD Site), Multi-Site DAGs with stretched AD Site (DAC has to be used where the Primary and Secondary Datacenter is both in a separate AD Site).
Then say in SP1, you can now use DAGs that contain 2 or more members (which again is correct since SP1 its 2+ members) and DAGs that are in 1 AD Site which is also correct.
Elan,
Though people like us are grateful to you for this and other articles, sometimes, you state things in confusing way. Why say "DAC should not be enabled for 2 member DAGs or multi-ste DAGs in Ex 2010 RTM" since there is no such thing as '2 member DAGs or multi-ste DAGs in Ex 2010 RTM.'. There was absolutely no need for that advisory!
I meant 'Non-Multisite DAGs'…
There's no such thing as 2 node DAGs? Sure there is, create a DAG and add 2 Mailbox Servers into this DAG. You now have a 2-node DAG. And in Exchange 2010 RTM, if you only had 2 nodes (2 Mailbox Servers) in a DAG, you could not enable DAC since DAC required you to have 3+ Mailbox Servers within this DAG.
And there's no such thing as non multi-site DAGs? I already explained myself that with Exchange 2010 RTM you can enable DAC when you have 3 Mailbox Servers in a DAG as long as one of those DAG members is in a different site. You cannot enable DAC if your DAG members are all in the same AD Site. Hence why you can't enable DAC in a non-multisite DAG.
Nice Job!