Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 1

August 5, 2011 by Elan Shudnow 31 Comments

I’ve talked about this topic in some of my other articles but wanted to create an article that talks specifically about this model and show several different examples in a Database Availability Group (DAG)’s tolerance for node and File Share Witness (FSW) failure. Many people don’t properly understand how the Majority Node Set Clustering Model works. In my article here, I talk about Database Activation Coordination Mode and have a section on Majority Node Set. In this article, I want to visibly show show some real world examples on how the Majority Node Set Clustering Model works. This will be a multi-part article and each Part will have its own example.

Part 1

Part 2

Part 3

Majority Node Set

Majority Node Set is a Windows Clustering Model such as the Shared Quorum Model, but different. Both Exchange 2007 and Exchange 2010 Clusters use Majority Node Set Clustering (MNS). This means that 50% of your votes (server votes and/or 1 file share witness) need to be up and running. The proper formula for this is (n / 2) + 1 where n is the number of DAG nodes within the DAG. With DAGs, if you have an odd number of DAG nodes in the same DAG (Cluster), you have an odd number of votes so you don’t have a witness. If you have an even number of DAGs nodes, you will have a file share witness in case half of your nodes go down, you have a witness who will act as that extra +1 number.

So let’s go through an example. Let’s say we have 3 servers. This means that we need (number of nodes which is 3 / 2) + 1 which equals 2 as you round down since you can’t have half a server/witness. This means that at any given time, we need 2 of our nodes to be online which means we can sustain only 1 (either a server or a file share witness) failure in our DAG. Now let’s say we have 4 servers. This means that we need (number of nodes which is 4 / 2) + 1 which equals 3. This means at any given time, we need 3 of our servers/witness to be online which means we can sustain 2 server failures or 1 server failure and 1 witness failure.

Real World Examples

Each of these examples will show DAG Models with a Primary Site and a Failover Site.

2 Node DAG (One in Primary and One in Failover)

In the following screenshot, we have 3 Servers. Two are Exchange 2010 Multi-Role Servers; one in the Primary Site and one on the Failover Site. The Cluster Service is running only on the two Exchange Multi-Role Servers. More specifically, it would run on the Exchange 2010 Servers that have the Mailbox Server Role. When Exchange 2010 utilizes an even number of Nodes, it utilizes Node Majority with File Share Witness. If you have dedicated HUB and/or HUB/CAS Servers, you can place the File Share Witness on those Servers. However, the File Share Witness cannot be placed on the Mailbox Server Role.

So now we have our three Servers; two of them being Exchange. This means we have two voters and a File Share Witness. Two of the Mailbox Servers that are running the cluster service are voters and the File Share Witness is just a witness that the voters use to determine cluster majority. So the question is, how many voters/servers can I lose? Well if you read the section on Majority Node Set (which you have to understand), you know the formula is (number of nodes /2) + 1. This means we have (2 Exchange Servers / 2) = 1 + 1 = 2. This means that 2 cluster objects must always be online for your Exchange Cluster to remain operational.

But now let’s say one of your Exchange Servers go offline. Well, you still have at least two cluster objects online. This means your cluster will be still be operational. If all users/services were utilizing the Primary Site, then everything continues to remain completely operational. If you were sending SMTP to the Failover Site or users were for some reason connecting to the Failover Site, they will need to be pointed to the Exchange Server in the Primary Site.

But what happens if you lose a second node? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times. At this time, the entire cluster goes offline. You need to go through steps provided in the site switchover process but in this case, you would be activating the Primary Site and specify a new Alternative File Share Witness Server that exists in the Primary Site so you can active the Exchange 2010 Server in the Primary Site. The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.

But what happens if you lose two nodes in the Primary Site? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times. At this time, the entire cluster goes offline. You need to go through steps provided in the site switchover process but in this case, you would be activating the Failover Site and specify a new Alternative File Share Witness Server that exists (or will exist) in the Failover Site so you can activate the Exchange 2010 Server in the Primary Site. The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.

Once the Datacenter Switchover has occurred, you will be in a state that looks as such. An Alternate File Share Witness is not for redundancy for your 2010 FSW that was in your Primary Site. It’s used only during a Datacenter Switchover which is a manual process.

Once your Primary Site becomes operational, you will re-add the Primary DAG Server to the existing DAG which will still be using the 2010 Alternate FSW Server in the Failover Site and you will now be switched into a Node Majority with File Share Witness Cluster instead of just Node Majority. Remember I said with an odd number of DAG Servers, you will be in Node Majority and with an even number, the Cluster will automatically switch itself to Node Majority with File Share Witness? You will now be in a state that looks as such.

Part of the Failback Process would be to switch back to the old FSW Server in the Primary Site. Once done, you will be back into your original operational state.

As you can see with how this works, the question that may arise is where to put your FSW? Well, it should be in the Primary Site with the most users or the site that has the most important users. With that in mind, I bet another question arises? Well, why with the most users or the most important users? Because some environments may want to use the above with an Active/Active Model instead of an Active/Passive. Some databases may be activated in both sites. But, with that, if the WAN link goes down, the Exchange 2010 Server in the Failover Site loses quorum since it can’t contact at least 1 other voter. Again, you must have two voters online. This also means that each voter must be able to see one other voter. Because of that, the Exchange 2010 Server will go completely offline.

To survive this, you really must use 2 different DAGs. One DAG where the FSW is in the First Site and a second DAG where its FSW is in the Second Site. Users that live in the First Active Site would primarily be using the Exchange 2010 DAG Members in the First Active Site. Users that live in the Second Active Site would primarily be using the Exchange 2010 DAG Members in the Second Active Site. This way, if anything happens with the WAN link, users in the First Active Site would still be operational as the FSW for their DAG is in the First Active Site and DAG 1 would maintain Qourum. Users in the Second Active Site would still be operational as the FSW for their DAG is in the Second Active Site and DAG 2 would maintain Quorum.

Note: This would require twice the amount of servers since a DAG Member cannot be a part of more than one DAG. As shown below, each visual representation below of a 2010 HUB/CAS/MBX is a separate server.

The Multi-DAG Model would look like this.

Comments

Ahmed Al-Haffar says
January 22, 2014 at 12:30 am
Thanks Elan for this informative, very well structured article.
i have a doubt currently we have 2 DAG Members and 1 FSW, what will happen if i restart the FSW, as per the formula nothing will be happened but i want to double check with you.
regards.
Reply
Sam says
October 24, 2013 at 6:22 pm
Hi Elan,
Part 2 actually answered my question. :-)
Thanks
Reply
Sam says
October 24, 2013 at 5:58 pm
Hi Elan,
This has been very informative. A scenario I'm currently facing with a client is that they have two datacentres. DC1 contains MBX 1 & 2 and a FSW. DC2 contains MBX 3 only. All databases are active only on the 2 MBX servers in DC1 whilst MBX 3 holds passive copies. My intention is to recommend putting in a MBX 4 at DC2 so in case DC1 goes down, MBX 3 doesn't have to deal with the load by itself.
Currently, there is just one DAG however. I know you recommended above to set up a MBX 4 in a separate DAG and set up a FSW in DC2. However, in the even the client does not want to invest in additional servers, what options do I have if DC1 goes down? We are also looking to implement DAC mode for the client.
Reply
said selfani says
April 27, 2013 at 3:41 am
hi
i have immplement exchange 2010 cluster
PRIMARY
MBX1
HUB /CAS/FSW
and i have DR
MBX2
HUB2/CAS2
i have 1 fsw but when the primary site goes down i think dr site will not work because the fsw is down
so please help me how to make dr site work in case the primary site is down
if i need to creat alternate fsw so how i can do that ?
Reply
Shea Werner says
March 23, 2013 at 10:25 pm
Rephrase of above question; Can I use DNS Failover (i.e. offered by dnsmadeeasy.com)
So if the primary hub/cas/mbx failed but the fsw did not, the dns failover would see site one hub/cas/mbx is not working and would redirect clients to site 2 allowing for automatic failover at least in this case.
Reply
Shea Werner says
March 20, 2013 at 6:15 pm
correction to my question. I meant "DNS failover"
Reply
Shea Werner says
March 20, 2013 at 4:22 pm
can I use load balanced dns for this setup.
So if the primary hub/cas/mbx but the fsw did not, the dns lb would see site one hub/cas/mbx is not working and would redirect clients to site 2 allowing for automatic failover at least in this case?
Reply
Antonio says
February 8, 2013 at 4:39 pm
What about the bandwidth consumption between site1 and site2
I should not worry about that?
Reply
- Elan Shudnow says
  February 10, 2013 at 8:39 am
  Depends on several factors. Are users active in both sites? If so, are databases active in both sites? If so, is mail for users in the second site have their MX records going into the other site? And don't forget about replication traffic? For users active in the second site, are they doing centralized webmail.domain.com which is going to the primary site and proxying traffic to the site they are in?
  As you can see, there are definitely bandwidth consumption considerations that need to be accounted for. There are two calculators which can help here:
  1. Client Network Bandwidth Calculator: http://blogs.technet.com/b/exchange/archive/2012/…
  2. Exchange 2010 Mailbox Server Role Requirements Calculator: http://blogs.technet.com/b/exchange/archive/2009/…
  Reply
Sunita says
August 1, 2012 at 8:57 am
Great Article. I had a question,
I have 2 sites. Site 1 is my primary and Site 2 I would like to setup as my DR. I am planning on move to a new building for Site 1 so I will need to power off all of the servers in Site 1. My question is with this design will Site 2 be able to work?
My configuration at Site 1 is:
CAS1FSW
MBX1
MBX2
Site 2:
CAS2FSW
MBX3
Thanks for your help
S
Reply
- Elan Shudnow says
  August 1, 2012 at 5:48 pm
  You will not have quorum in Site 2. You would need to go through the manual DR site procedures in Site 2 in order to have quorum in Site 2 and using CAS2FSW as the alternate FSW. Then when Site 1 is back up and running, you would re-add Site 1 MBX Servers back into the DAG.
  What you could do here is add MBX4 to Site 2 and before you take down Site 1, move the active FSW to CAS2FSW in Site 2 so you have quorum there and wouldn't have to run through manual DR procedures. Obviously you'd still need to ensure that HTTP traffic will be pointed to Site 2 and SMTP mail is delivered to Site 2.
  Reply
current version says
May 27, 2012 at 9:51 pm
Thanks for give me this information you give very nice information on this topic.
Reply
Bulk sms india says
May 25, 2012 at 11:39 pm
It is very special and interesting news provide on your website.really good article and special learn by your article.thanks
Reply
Nowin says
March 21, 2012 at 1:17 pm
Hi Elan,
Please can you help me with my problem. I have 2 nodes active and passive exchange 2007 roles installed. Whenever i switch off the node that has the quorum, the cluster fails. I understad this is because of the fomula and that quorum is not maintained. So how to i maintain quorum. When i switch off the serve that has the quorum, is the quorum suppose to failover to the passive node so that quorum could be maintained? i notice that my quorum disk is not moving over to the passive node once the server its connected on is shutdown. My SCC disks are connected on a SAN,
Reply
abidalilliane says
March 11, 2012 at 11:30 pm
I can see that the information is quite helpful , specially for those people who don't have any idea on this.
baby eagle
Reply
waterproof camera says
February 1, 2012 at 10:10 am
have read some of your blog post and it is very informative blog, i already bookmarked your blog..thanks
Reply
John Panicci says
January 19, 2012 at 9:37 am
Elan, Great website, awesome articles. I have 2 site (Prod/DR) active passive dag. 1 mbx in each site. and 2 hub/cas servers in each array in each site. I have FSW on one of the hub/cas servers in primary site. I have alternate fsw setup on one of the hub/cas servers in DR site. you mention the following: "Part of the Failback Process would be to switch back to the old FSW Server in the Primary Site. Once done, you will be back into your original operational state." I agree that this is what you need to do, but how do you actually do the switch back to fsw in primary site. Im basically worrying about site link being down between prod and DR..
Reply
- Elan Shudnow says
  January 25, 2012 at 4:05 pm
  The official documentation is here: http://technet.microsoft.com/en-us/library/dd3510…
  There is a section entitled, "Restoring Service to the Primary Datacenter"
  It discusses on how to go back to your original FSW.
  Reply
Craig says
December 1, 2011 at 1:45 pm
Elan
What if just the primary hub/cas/mbx failed and the fsw and the server in the other datacenter is still active? Does everything switchover fine or is there manual intervention to get the clients working on the datacenter hub/cas/mbx
Reply
- Elan Shudnow says
  December 1, 2011 at 2:08 pm
  If Quorum is maintained then the Mailbox role stays operational. But, clients may not be able to access it since the FQDNs will most likely be pointing to the servers in the primary site. So while Mailbox Role may be operational, the CAS Role may not.
  Reply
Jim says
December 1, 2011 at 11:03 am
Great. thanks for the fast response. Any idea on the actual issue though of when I switchover and shutdown the primary that they won't connect up to the secondary? Prompts credentials and after authenticating doesn't work bring up outlook either. It seems it is something between the CAS/MBX.
I've tried somethings I found on google but nothing has solved it
Reply
- Elan Shudnow says
  December 1, 2011 at 2:07 pm
  Well it's most likely because they can't authenticate because the Outlook Anywhere FQDN is pointed to the primary site. That server is no longer up. This is why part of the DR plan includes downtime and switching over the FQDNs to point to the DR Site. That way clients can contact the DR Server(s) and authenticate since the FQDNs are now pointed there. This includes moving over all CAS Namespaces.
  Reply
Jim says
December 1, 2011 at 10:21 am
Thanks for the reply Elan, one more issue I'm having. Basically whenever i switchover to the other server then shutdown the primary server all outlook clients get prompted for credentials and even if they type in the credentials it doesn't work. I found a fix here disabling outlook anywhere for that. http://port25.wordpress.com/2011/01/26/users-rece… but now the it just sits at trying to connect and can't establish a connection.
If i fire up the primary node again it works fine. I can then switchover everything to the primary node and shutoff the secondary node and everything is happy. It's only when the primary node is down does it not work. Switchover works fine, but once the primary node is shutoff clients can't connect. Any ideas? I'm stuck.
thanks!
Jim
Reply
- Elan Shudnow says
  December 1, 2011 at 10:41 am
  I wouldn't do what that article said. The basic idea during a site failover is that you have a lower TTL value for your DNS records. For example, 5 minutes. When your primary server goes down, you cut over all your DNS records to point to the second server.
  Because your second datacenter is strictly DR only and you're not in an active/active scenario, you can set the Outlook Anywhere FQDN on your DR Servers to have the same FQDN as Outlook Anywhere in the Primary Site. Then when you switch over DNS to the secondary Datacenter, your Outlook Anywhere FQDN will be the same. Obviously you'll want to make sure that the certificate in the secondary site has the Outlook Anywhere FQDN and the Common Name on the certificate is the same. This is because clients older than Vista SP1 don't have the capability to have the Certificate Principle Name (MSSTD value in Outlook Anywhere) to be a SAN name on the certificate.
  Reply
  - ponzekap2 says
    October 25, 2012 at 7:52 pm
    Hey Elan. Huge fan of your blog, and Im actually the author of the article Jim references. I was wondering your thoughts on what I had wrote. In a situation where a DB fails over and the client is using MAPI to connect to the CAS array, and OA is in basic authentication mode, you'll have a situation where the Outlook client tries to connect using HTTPS, particularly the public folder connection point. Was just wondering besides having NTLM enabled (saying that isnt an option), I'm interested in what you would recommend in that situation? Huge fan of the material you put up, and would be interested to hear what you think. You can email me at my posting name AT gmail.com if you want. Would love to hear your thoughts.
    Reply
Jim says
November 30, 2011 at 10:21 am
I have the same setup here I'm implementing. One mbx/cas/hub onsite with the FSW. The other mbx/cas/hub located at our datacenter(DR location). Should I also turn on DAC for this now? Exchange 2010 sp1
thanks
Jim
Reply
- Elan Shudnow says
  November 30, 2011 at 1:48 pm
  Yes, you should turn DAC on. It's designed for every environment where you have 1 DAG in more than one AD Site or Datacenter with stretched AD Sites. Read more on DAC here: https://www.shudnow.io/2010/06/30/exchange-2010-d…
  Reply
Vincent says
August 31, 2011 at 11:32 am
I tought that each mailbox server could only be member of 1 DAG.
Considering this, how can you have 2 DAG^
Reply
- Elan Shudnow says
  August 31, 2011 at 12:48 pm
  That's correct. You have 2 DAGs with twice the amount of servers that you would with 1 DAG.
  Reply
  - Vincent says
    August 31, 2011 at 6:09 pm
    Ok. That's what I was thinking… The diagrams were not that clear so I was wondering if there were a trick to do it anyways…
    Thanks fir the reply.
    Vincent
    Reply

Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 1

Majority Node Set

Real World Examples

2 Node DAG (One in Primary and One in Failover)

About Me

Recent

Search

Majority Node Set

Real World Examples

2 Node DAG (One in Primary and One in Failover)

Share this:

Reader Interactions

Comments

Leave a Reply Cancel reply

Footer

About Me

Recent

Search

Tags