Introduction
Welcome to Part 3 of this article series. In Part 1, we started off by discussing the goal of this lab. That goal is to wrap all the information out there on how to utilize Central Site Resilience in regards to failovers, fallbacks, how redirects function, how SRV records fit in, etc… We first discussed what the lab setup is going to be using Hyper-V, and then proceeded to take a look at the base topology and configuration. In Part 2 of this article series, I went through the sign-in process for each user. Because the SRV record for sign-in was pointed to A-L14FE1.shudlab.net, the sign-in process was different for each user logging in.
In this Part, we’ll do a failover test and a failback test without a SRV record in place. We’ll take a look at what happens to ClientUser1 using A-L14FE1 Pool and what happens to ClientUser2 using A-L14FE2 Pool when we take down one of the Pools. We will then take a look at what happens when the Pool that came down comes back online. And finally, we will end our tests by seeing what happens when a second SRV record is in place.
Part 3
The Failover with only one SRV
As shown in Part 1, our SRV record is pointing to A-L14FE1.shudlab.net. If you recall from Part 2, when ClientUser1 signed in and connected to A-L14FE1.shudlab.net, he received no 301 Redirect and therefore was not informed of their Primary and Backup Registrar. We also saw ClientUser2 connect to A-L14FE1.shudlab.get, and received a 301 redirect with his Primary and Backup Registrar and ClientUser2 then connected and registered to A-L14FE2.shudlab.net.
ClientUser1
Let’s start with disabling the NIC on A-L14FE1. I wanted to see the behavior of ClientUser1. Don’t forget, ClientUser1 initially connected to A-L14FE1.shudlab.net with no 301 redirection. Because of this, it has no idea what its backup registrar is and there is no additional SRV records other than the one that has you connecting to A-L14FE1.
After approximately 30 seconds, ClientUser1 gets disconnected.
Remember in the Topology, we had a failover detection time of 30 seconds. I let this sit here for about 5-8 minutes and it stayed disconnected.
ClientUser2
First thing I did is re-enable the NIC on A-L14FE1 and let ClientUser1 sign back in. I want to be in a normal operational state. Now let’s disable the NIC on A-L14FE2.shudlab.net and see what happens with ClientUser2. What we should see happen is that it fails over to A-L14FE2.shudlab.net. The reason being is that it signed in using the SRV record, received the 301 redirect, and was informed of both its Primary Registrar and its Backup Registrar. While ClientUser2 should be able to fail over, don’t forget about the Endpointconfiguration.cache file. It this client were to sign out and sign back in, it would not use the SRV record and connect directly to A-L14FE2.shudlab.net. Because of that, it would no longer know about its Backup Registrar and would have no idea where to reconnect.
But let’s take a look at both scenarios. Let’s first take a look at if it fails over properly since the last sign-in it completed it received a 301 redirect.
We’ll go ahead and disable the NIC on A-L14FE2.shudlab.net.
After around 30 or so seconds, ClientUser2 signs out. What I would expect now is ClientUser2 connects to A-L14FE1.shudlab.net since again, when ClientUser2 initially signed in, it received a 301 redirect which informed ClientUser2 of both the Primary and Backup Registrar.
And just as I thought, ClientUser2 connects to A-L14FE1.shudlab.net
After re-enabling the NIC on A-LyncFE2.shudlab.net, within 40 seconds (which is the failback detection time), ClientUser2 reconnects.
The Failover with a second SRV pointing to the secondary pool
So we’ve seen ClientUser1 fail to connect when A-L14FE1.shudlab.net goes down because ClientUser1 never received a 301 redirect message and because there is no 2nd SRV record in the environment. Let’s go ahead and add our second SRV record with a priority of 10.
And just to verify A-Client1 sees the change, let’s do a new nslookup.
Ok, now let’s run the same test we initially did. I’m shutting down A-L14FE1.shudlab.net server’s NIC. What we saw earlier on in our tests is that ClientUser1 would just sit signed out with nowhere to go. What should happen now is the Lync client signs out, ends up finding the second SRV record, and now is able to connect to the second pool, A-L14FE2.shudlab.net.
After around 30 or so seconds, ClientUser1 signs out. Let’s see if it picks up the 2nd SRV record and then signs into A-L14FE2.shudlab.net
After a little bit of waiting, sure enough, ClientUser1 can now successfully sign into A-L14FE2.shudlab.net
Now let’s take a look at a Netmon Trace and see what exactly ClientUser1 did for DNS lookups.
When the server is down, we see the client query for _sipinternaltls._tcp.shudlab.net. We can see in the red highlights at the bottom, we have a-l14fe1.shudlab.net and l14fe2.shudlab.net returned. Part of the data return is obviously the priority information. What we end up seeing below is ClientUser1 ends up trying to connec tto a-l14fe2.shudlab.net because it knows it is having problems connecting to a-l14fe1.shudlab.net. Because of that 2nd SRV being in place, ClientUser1 found it, is doing another query for a-l14fe2.shudlab.net to find its IP address, and now makes a connection to this server. Voila, we now have a failed over client.
Reviewing some key points
- If a client gets redirected to a server, it is a 301 redirect that informs the client of their Primary and Backup Registrar. If the Primary happens to be down (for example, if you connected to a Director), the client will automatically be able to connect to their Backup Registrar. If their Primary happens to be operational, the user connects, and their Primary Goes down, that user will failover to their Backup Registrar.
- If a client has signed in at least once, their Primary Server has been cached into a file called Endpointconfiguration.cache. That client will always connect directly to that server instead of potentially getting a 301 redirect. It is because of this it is very important to have multiple SRV records in the environment to increase the chance that regardless if a server is cached in the Endpointconfiguration.cache file, that client will have another means to find another registrar in the environment. If that registrar happens to be another pool that is not their primary, the user will get a 301 redirect to their Primary and Backup Registrar Pool.
- A registrar does help as it will redirect clients to their correct pool and provides the clients with a 301 redirect thus letting the client know what their Primary and Backup Registrar is. But as you have seen, do not completely rely on this due to the client caching server information in the Endpointconfiguration.cache. You absolutely should have at least 2 SRV records with two different priorities to ensure a client will failover to another registrar regardless if you have a Director in your environment or not.
Conclusion
Well folks, that is all for not just Part 3, but the entire article series. In this part, we performed a failover test and a failback test without a SRV record in place. We then took a look at what happens to ClientUser1 using A-L14FE1 Pool and what happens to ClientUser2 using A-L14FE2 Pool when we take down one of the Pools. We then took a look at what happens when the Pool that came down comes back online. And we finally ended our tests in seeing what happens when a second SRV record is in place.
Hopefully these articles have helped you understand more on how the deployment of Lync 2010 Central Site Resilience works. Feel free to ask questions in the comments below and I will do my best to answer questions.
1. while the client is working under "limited fucntioanlity" what are the things that user cant use?
2. consider if the Meidation server is in place what must be the client behaviour when working at "limited fuctionality"?
Hey there would you mind letting me know which web
host you’re working with? I’ve loaded your blog in 3 completely different web browsers
and I must say this blog loads a lot quicker
then most. Can you suggest a good internet hosting provider at a fair price?
Many thanks, I appreciate it!
Great post.
Thank you very much.
Hi Elan,
I have a question regarding the Edge failover. We have a topology very similar to your test environment but at present we only have an edge in the first site with our central store so even if we have user failover as we do have 2 srv records they loose federation and response groups which we have our main support line routing through which is far from ideal. Our second site has an association with the first sites edge for media etc but we have had issues with site to site resiliency. Can I retrospectivley add a second edge server to site 2 if so what do I need to consider ? i.e. certificates, DNS records etc. Thanks fo rth epost by teh way it's most informative
Hi Elan,
Could you please create a step by step article about the installation of Lync SBS ? It would be very helpful as I have searched the internet for something like that but didn't find any !
thanks
Elan, looks like your screenshots of the second SRV pointing to the secondary pool have the weight/port transposed. not a big thing, and most people will figure it out but just a headsup.
Good catch. I just updated both the DNS screenshot and the nslookup screenshot. Thanks Matt!
Elan:
in case of HardwareLoadbalanced Lync Enterprise pool (for this example lets assume no failover, no director, just a single HW loadbakanced EE pool with >1 FE servers), do users get any 301Redirect message when connecting to this pool, or there isnt really any preferred server to home the user in HW loadbalanced pools?
Yes, in a HLB environment, users are still assigned a Primary and Secondary Server in the Pool. Just ran the following command a few minutes ago on a Hardware load Balanced Pool (everything Load Balanced) to verify which shows you what the Primary/Secondary Home Servers are for a user in a Pool:
Get-CsUserPoolInfo –Identity “user” | Select-Object –ExpandProperty PrimaryPoolMachinesInPreferredOrder
So if we take a look at the F5 Deployment Guide, I see that the Front End Pool has SNAT enabled for everything which means user always talks to HLB and never the Pool Servers directly. To me, it would seem there would still be a 301 redirect but since the HLB is using SNAT which means the Source IP is being re-written with the F5's IP, that the 301 redirect would go to the F5 and the F5 would start talking to the correct server and then start doing Source IP Affinity to that specific Front End that it was redirected to.
Hi, As primary and backup registrar pools have separate backend databases, please advise how you would make sure that contacts and scheduled conference details are restored during failover/fallback scenarios. Thanks.
Reddsan, you don't in the traditional sense though there are some ways to cheat. In the documented methods, you would have to deploy Metrosite which is one pool stretched across 2 sites. You must have no more than 20ms across the datacenters to ensure there's no RPC latency problems from Front End to Active SQL Cluster node.
You may want to check out Justin Morris' post here on achieving Site Resilience without breaking the bank as he shows you how to back up some stuff in the primary site and how to restore and forcefully move users to a second pool in the remaining site in case the primary site goes offline. http://www.justin-morris.net/achieving-lync-serve…
Kinda off topic but figured If I posted to an old blog entry I wouldn't get a reply
I've been wondering is it possible for a Lync client to talk to a OCS server?
Just wondering as if its possible then it would make transitioning at some unknown point in the future easier.
Also if so what is required to get it working?
Hi, great series!!
"You absolutely should have at least 2 SRV records with two different weights"
This should be two different priorities
Hi, Nice posts! Very interesting!
I don't understant your last bullet in the Reviewing section, particularely the last sentence : "regardless if you have a Director in your environment or not."
If you have a director, the srv is pointing on it. Even if the user has a Endpointconfiguration.cache file, if the primary register fail, the lync client will perform a dns request for the srv record, then will contact the director server, then will get the primary and the backup registrars, and will connect to the backup registrar. did I miss something??
Thank you!
Nope. Your logic is correct. The point is that you don't want to rely entirely on a director and not have any secondary SRV records.