I’ve talked about this topic in some of my other articles but wanted to create an article that talks specifically about this model and show several different examples in a Database Availability Group (DAG)’s tolerance for node and File Share Witness (FSW) failure. Many people don’t properly understand how the Majority Node Set Clustering Model works. In my article here, I talk about Database Activation Coordination Mode and have a section on Majority Node Set. In this article, I want to visibly show show some real world examples on how the Majority Node Set Clustering Model works. This will be a multi-part article and each Part will have its own example.
Part 1
Majority Node Set
Majority Node Set is a Windows Clustering Model such as the Shared Quorum Model, but different. Both Exchange 2007 and Exchange 2010 Clusters use Majority Node Set Clustering (MNS). This means that 50% of your votes (server votes and/or 1 file share witness) need to be up and running. The proper formula for this is (n / 2) + 1 where n is the number of DAG nodes within the DAG. With DAGs, if you have an odd number of DAG nodes in the same DAG (Cluster), you have an odd number of votes so you don’t have a witness. If you have an even number of DAGs nodes, you will have a file share witness in case half of your nodes go down, you have a witness who will act as that extra +1 number.
So let’s go through an example. Let’s say we have 3 servers. This means that we need (number of nodes which is 3 / 2) + 1 which equals 2 as you round down since you can’t have half a server/witness. This means that at any given time, we need 2 of our nodes to be online which means we can sustain only 1 (either a server or a file share witness) failure in our DAG. Now let’s say we have 4 servers. This means that we need (number of nodes which is 4 / 2) + 1 which equals 3. This means at any given time, we need 3 of our servers/witness to be online which means we can sustain 2 server failures or 1 server failure and 1 witness failure.
Real World Examples
Each of these examples will show DAG Models with a Primary Site and a Failover Site.
2 Node DAG (One in Primary and One in Failover)
In the following screenshot, we have 3 Servers. Two are Exchange 2010 Multi-Role Servers; one in the Primary Site and one on the Failover Site. The Cluster Service is running only on the two Exchange Multi-Role Servers. More specifically, it would run on the Exchange 2010 Servers that have the Mailbox Server Role. When Exchange 2010 utilizes an even number of Nodes, it utilizes Node Majority with File Share Witness. If you have dedicated HUB and/or HUB/CAS Servers, you can place the File Share Witness on those Servers. However, the File Share Witness cannot be placed on the Mailbox Server Role.
So now we have our three Servers; two of them being Exchange. This means we have two voters and a File Share Witness. Two of the Mailbox Servers that are running the cluster service are voters and the File Share Witness is just a witness that the voters use to determine cluster majority. So the question is, how many voters/servers can I lose? Well if you read the section on Majority Node Set (which you have to understand), you know the formula is (number of nodes /2) + 1. This means we have (2 Exchange Servers / 2) = 1 + 1 = 2. This means that 2 cluster objects must always be online for your Exchange Cluster to remain operational.
But now let’s say one of your Exchange Servers go offline. Well, you still have at least two cluster objects online. This means your cluster will be still be operational. If all users/services were utilizing the Primary Site, then everything continues to remain completely operational. If you were sending SMTP to the Failover Site or users were for some reason connecting to the Failover Site, they will need to be pointed to the Exchange Server in the Primary Site.
But what happens if you lose a second node? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times. At this time, the entire cluster goes offline. You need to go through steps provided in the site switchover process but in this case, you would be activating the Primary Site and specify a new Alternative File Share Witness Server that exists in the Primary Site so you can active the Exchange 2010 Server in the Primary Site. The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.
But what happens if you lose two nodes in the Primary Site? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times. At this time, the entire cluster goes offline. You need to go through steps provided in the site switchover process but in this case, you would be activating the Failover Site and specify a new Alternative File Share Witness Server that exists (or will exist) in the Failover Site so you can activate the Exchange 2010 Server in the Primary Site. The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.
Once the Datacenter Switchover has occurred, you will be in a state that looks as such. An Alternate File Share Witness is not for redundancy for your 2010 FSW that was in your Primary Site. It’s used only during a Datacenter Switchover which is a manual process.
Once your Primary Site becomes operational, you will re-add the Primary DAG Server to the existing DAG which will still be using the 2010 Alternate FSW Server in the Failover Site and you will now be switched into a Node Majority with File Share Witness Cluster instead of just Node Majority. Remember I said with an odd number of DAG Servers, you will be in Node Majority and with an even number, the Cluster will automatically switch itself to Node Majority with File Share Witness? You will now be in a state that looks as such.
Part of the Failback Process would be to switch back to the old FSW Server in the Primary Site. Once done, you will be back into your original operational state.
As you can see with how this works, the question that may arise is where to put your FSW? Well, it should be in the Primary Site with the most users or the site that has the most important users. With that in mind, I bet another question arises? Well, why with the most users or the most important users? Because some environments may want to use the above with an Active/Active Model instead of an Active/Passive. Some databases may be activated in both sites. But, with that, if the WAN link goes down, the Exchange 2010 Server in the Failover Site loses quorum since it can’t contact at least 1 other voter. Again, you must have two voters online. This also means that each voter must be able to see one other voter. Because of that, the Exchange 2010 Server will go completely offline.
To survive this, you really must use 2 different DAGs. One DAG where the FSW is in the First Site and a second DAG where its FSW is in the Second Site. Users that live in the First Active Site would primarily be using the Exchange 2010 DAG Members in the First Active Site. Users that live in the Second Active Site would primarily be using the Exchange 2010 DAG Members in the Second Active Site. This way, if anything happens with the WAN link, users in the First Active Site would still be operational as the FSW for their DAG is in the First Active Site and DAG 1 would maintain Qourum. Users in the Second Active Site would still be operational as the FSW for their DAG is in the Second Active Site and DAG 2 would maintain Quorum.
Note: This would require twice the amount of servers since a DAG Member cannot be a part of more than one DAG. As shown below, each visual representation below of a 2010 HUB/CAS/MBX is a separate server.
The Multi-DAG Model would look like this.
Ahmed Al-Haffar says
Thanks Elan for this informative, very well structured article.
i have a doubt currently we have 2 DAG Members and 1 FSW, what will happen if i restart the FSW, as per the formula nothing will be happened but i want to double check with you.
regards.
Sam says
Hi Elan,
Part 2 actually answered my question. :-)
Thanks
Sam says
Hi Elan,
This has been very informative. A scenario I'm currently facing with a client is that they have two datacentres. DC1 contains MBX 1 & 2 and a FSW. DC2 contains MBX 3 only. All databases are active only on the 2 MBX servers in DC1 whilst MBX 3 holds passive copies. My intention is to recommend putting in a MBX 4 at DC2 so in case DC1 goes down, MBX 3 doesn't have to deal with the load by itself.
Currently, there is just one DAG however. I know you recommended above to set up a MBX 4 in a separate DAG and set up a FSW in DC2. However, in the even the client does not want to invest in additional servers, what options do I have if DC1 goes down? We are also looking to implement DAC mode for the client.
said selfani says
hi
i have immplement exchange 2010 cluster
PRIMARY
MBX1
HUB /CAS/FSW
and i have DR
MBX2
HUB2/CAS2
i have 1 fsw but when the primary site goes down i think dr site will not work because the fsw is down
so please help me how to make dr site work in case the primary site is down
if i need to creat alternate fsw so how i can do that ?
Shea Werner says
Rephrase of above question; Can I use DNS Failover (i.e. offered by dnsmadeeasy.com)
So if the primary hub/cas/mbx failed but the fsw did not, the dns failover would see site one hub/cas/mbx is not working and would redirect clients to site 2 allowing for automatic failover at least in this case.
Shea Werner says
correction to my question. I meant "DNS failover"
Shea Werner says
can I use load balanced dns for this setup.
So if the primary hub/cas/mbx but the fsw did not, the dns lb would see site one hub/cas/mbx is not working and would redirect clients to site 2 allowing for automatic failover at least in this case?
Antonio says
What about the bandwidth consumption between site1 and site2
I should not worry about that?
Elan Shudnow says
Depends on several factors. Are users active in both sites? If so, are databases active in both sites? If so, is mail for users in the second site have their MX records going into the other site? And don't forget about replication traffic? For users active in the second site, are they doing centralized webmail.domain.com which is going to the primary site and proxying traffic to the site they are in?
As you can see, there are definitely bandwidth consumption considerations that need to be accounted for. There are two calculators which can help here:
1. Client Network Bandwidth Calculator: http://blogs.technet.com/b/exchange/archive/2012/…
2. Exchange 2010 Mailbox Server Role Requirements Calculator: http://blogs.technet.com/b/exchange/archive/2009/…
Sunita says
Great Article. I had a question,
I have 2 sites. Site 1 is my primary and Site 2 I would like to setup as my DR. I am planning on move to a new building for Site 1 so I will need to power off all of the servers in Site 1. My question is with this design will Site 2 be able to work?
My configuration at Site 1 is:
CAS1FSW
MBX1
MBX2
Site 2:
CAS2FSW
MBX3
Thanks for your help
S
Elan Shudnow says
You will not have quorum in Site 2. You would need to go through the manual DR site procedures in Site 2 in order to have quorum in Site 2 and using CAS2FSW as the alternate FSW. Then when Site 1 is back up and running, you would re-add Site 1 MBX Servers back into the DAG.
What you could do here is add MBX4 to Site 2 and before you take down Site 1, move the active FSW to CAS2FSW in Site 2 so you have quorum there and wouldn't have to run through manual DR procedures. Obviously you'd still need to ensure that HTTP traffic will be pointed to Site 2 and SMTP mail is delivered to Site 2.
current version says
Thanks for give me this information you give very nice information on this topic.
Bulk sms india says
It is very special and interesting news provide on your website.really good article and special learn by your article.thanks
Nowin says
Hi Elan,
Please can you help me with my problem. I have 2 nodes active and passive exchange 2007 roles installed. Whenever i switch off the node that has the quorum, the cluster fails. I understad this is because of the fomula and that quorum is not maintained. So how to i maintain quorum. When i switch off the serve that has the quorum, is the quorum suppose to failover to the passive node so that quorum could be maintained? i notice that my quorum disk is not moving over to the passive node once the server its connected on is shutdown. My SCC disks are connected on a SAN,
abidalilliane says
I can see that the information is quite helpful , specially for those people who don't have any idea on this.
baby eagle
waterproof camera says
have read some of your blog post and it is very informative blog, i already bookmarked your blog..thanks
John Panicci says
Elan, Great website, awesome articles. I have 2 site (Prod/DR) active passive dag. 1 mbx in each site. and 2 hub/cas servers in each array in each site. I have FSW on one of the hub/cas servers in primary site. I have alternate fsw setup on one of the hub/cas servers in DR site. you mention the following: "Part of the Failback Process would be to switch back to the old FSW Server in the Primary Site. Once done, you will be back into your original operational state." I agree that this is what you need to do, but how do you actually do the switch back to fsw in primary site. Im basically worrying about site link being down between prod and DR..
Elan Shudnow says
The official documentation is here: http://technet.microsoft.com/en-us/library/dd3510…
There is a section entitled, "Restoring Service to the Primary Datacenter"
It discusses on how to go back to your original FSW.
Craig says
Elan
What if just the primary hub/cas/mbx failed and the fsw and the server in the other datacenter is still active? Does everything switchover fine or is there manual intervention to get the clients working on the datacenter hub/cas/mbx
Elan Shudnow says
If Quorum is maintained then the Mailbox role stays operational. But, clients may not be able to access it since the FQDNs will most likely be pointing to the servers in the primary site. So while Mailbox Role may be operational, the CAS Role may not.
Jim says
Great. thanks for the fast response. Any idea on the actual issue though of when I switchover and shutdown the primary that they won't connect up to the secondary? Prompts credentials and after authenticating doesn't work bring up outlook either. It seems it is something between the CAS/MBX.
I've tried somethings I found on google but nothing has solved it
Elan Shudnow says
Well it's most likely because they can't authenticate because the Outlook Anywhere FQDN is pointed to the primary site. That server is no longer up. This is why part of the DR plan includes downtime and switching over the FQDNs to point to the DR Site. That way clients can contact the DR Server(s) and authenticate since the FQDNs are now pointed there. This includes moving over all CAS Namespaces.
Jim says
Thanks for the reply Elan, one more issue I'm having. Basically whenever i switchover to the other server then shutdown the primary server all outlook clients get prompted for credentials and even if they type in the credentials it doesn't work. I found a fix here disabling outlook anywhere for that. http://port25.wordpress.com/2011/01/26/users-rece… but now the it just sits at trying to connect and can't establish a connection.
If i fire up the primary node again it works fine. I can then switchover everything to the primary node and shutoff the secondary node and everything is happy. It's only when the primary node is down does it not work. Switchover works fine, but once the primary node is shutoff clients can't connect. Any ideas? I'm stuck.
thanks!
Jim
Elan Shudnow says
I wouldn't do what that article said. The basic idea during a site failover is that you have a lower TTL value for your DNS records. For example, 5 minutes. When your primary server goes down, you cut over all your DNS records to point to the second server.
Because your second datacenter is strictly DR only and you're not in an active/active scenario, you can set the Outlook Anywhere FQDN on your DR Servers to have the same FQDN as Outlook Anywhere in the Primary Site. Then when you switch over DNS to the secondary Datacenter, your Outlook Anywhere FQDN will be the same. Obviously you'll want to make sure that the certificate in the secondary site has the Outlook Anywhere FQDN and the Common Name on the certificate is the same. This is because clients older than Vista SP1 don't have the capability to have the Certificate Principle Name (MSSTD value in Outlook Anywhere) to be a SAN name on the certificate.
ponzekap2 says
Hey Elan. Huge fan of your blog, and Im actually the author of the article Jim references. I was wondering your thoughts on what I had wrote. In a situation where a DB fails over and the client is using MAPI to connect to the CAS array, and OA is in basic authentication mode, you'll have a situation where the Outlook client tries to connect using HTTPS, particularly the public folder connection point. Was just wondering besides having NTLM enabled (saying that isnt an option), I'm interested in what you would recommend in that situation? Huge fan of the material you put up, and would be interested to hear what you think. You can email me at my posting name AT gmail.com if you want. Would love to hear your thoughts.
Jim says
I have the same setup here I'm implementing. One mbx/cas/hub onsite with the FSW. The other mbx/cas/hub located at our datacenter(DR location). Should I also turn on DAC for this now? Exchange 2010 sp1
thanks
Jim
Elan Shudnow says
Yes, you should turn DAC on. It's designed for every environment where you have 1 DAG in more than one AD Site or Datacenter with stretched AD Sites. Read more on DAC here: https://www.shudnow.io/2010/06/30/exchange-2010-d…
Vincent says
I tought that each mailbox server could only be member of 1 DAG.
Considering this, how can you have 2 DAG^
Elan Shudnow says
That's correct. You have 2 DAGs with twice the amount of servers that you would with 1 DAG.
Vincent says
Ok. That's what I was thinking… The diagrams were not that clear so I was wondering if there were a trick to do it anyways…
Thanks fir the reply.
Vincent