Welcome to Part 3 of this article series. In Part 1, we started off by discussing the goal of this lab. That goal is to wrap all the information out there on how to utilize Central Site Resilience in regards to failovers, fallbacks, how redirects function, how SRV records fit in, etc… We first discussed what the lab setup is going to be using Hyper-V, and then proceeded to take a look at the base topology and configuration. In Part 2 of this article series, I went through the sign-in process for each user. Because the SRV record for sign-in was pointed to A-L14FE1.shudlab.net, the sign-in process was different for each user logging in.
In this Part, we’ll do a failover test and a failback test without a SRV record in place. We’ll take a look at what happens to ClientUser1 using A-L14FE1 Pool and what happens to ClientUser2 using A-L14FE2 Pool when we take down one of the Pools. We will then take a look at what happens when the Pool that came down comes back online. And finally, we will end our tests by seeing what happens when a second SRV record is in place.
The Failover with only one SRV
As shown in Part 1, our SRV record is pointing to A-L14FE1.shudlab.net. If you recall from Part 2, when ClientUser1 signed in and connected to A-L14FE1.shudlab.net, he received no 301 Redirect and therefore was not informed of their Primary and Backup Registrar. We also saw ClientUser2 connect to A-L14FE1.shudlab.get, and received a 301 redirect with his Primary and Backup Registrar and ClientUser2 then connected and registered to A-L14FE2.shudlab.net.
Let’s start with disabling the NIC on A-L14FE1. I wanted to see the behavior of ClientUser1. Don’t forget, ClientUser1 initially connected to A-L14FE1.shudlab.net with no 301 redirection. Because of this, it has no idea what its backup registrar is and there is no additional SRV records other than the one that has you connecting to A-L14FE1.
After approximately 30 seconds, ClientUser1 gets disconnected.
Remember in the Topology, we had a failover detection time of 30 seconds. I let this sit here for about 5-8 minutes and it stayed disconnected.
First thing I did is re-enable the NIC on A-L14FE1 and let ClientUser1 sign back in. I want to be in a normal operational state. Now let’s disable the NIC on A-L14FE2.shudlab.net and see what happens with ClientUser2. What we should see happen is that it fails over to A-L14FE2.shudlab.net. The reason being is that it signed in using the SRV record, received the 301 redirect, and was informed of both its Primary Registrar and its Backup Registrar. While ClientUser2 should be able to fail over, don’t forget about the Endpointconfiguration.cache file. It this client were to sign out and sign back in, it would not use the SRV record and connect directly to A-L14FE2.shudlab.net. Because of that, it would no longer know about its Backup Registrar and would have no idea where to reconnect.
But let’s take a look at both scenarios. Let’s first take a look at if it fails over properly since the last sign-in it completed it received a 301 redirect.
We’ll go ahead and disable the NIC on A-L14FE2.shudlab.net.
After around 30 or so seconds, ClientUser2 signs out. What I would expect now is ClientUser2 connects to A-L14FE1.shudlab.net since again, when ClientUser2 initially signed in, it received a 301 redirect which informed ClientUser2 of both the Primary and Backup Registrar.
And just as I thought, ClientUser2 connects to A-L14FE1.shudlab.net
After re-enabling the NIC on A-LyncFE2.shudlab.net, within 40 seconds (which is the failback detection time), ClientUser2 reconnects.
The Failover with a second SRV pointing to the secondary pool
So we’ve seen ClientUser1 fail to connect when A-L14FE1.shudlab.net goes down because ClientUser1 never received a 301 redirect message and because there is no 2nd SRV record in the environment. Let’s go ahead and add our second SRV record with a priority of 10.
And just to verify A-Client1 sees the change, let’s do a new nslookup.
Ok, now let’s run the same test we initially did. I’m shutting down A-L14FE1.shudlab.net server’s NIC. What we saw earlier on in our tests is that ClientUser1 would just sit signed out with nowhere to go. What should happen now is the Lync client signs out, ends up finding the second SRV record, and now is able to connect to the second pool, A-L14FE2.shudlab.net.
After around 30 or so seconds, ClientUser1 signs out. Let’s see if it picks up the 2nd SRV record and then signs into A-L14FE2.shudlab.net
After a little bit of waiting, sure enough, ClientUser1 can now successfully sign into A-L14FE2.shudlab.net
Now let’s take a look at a Netmon Trace and see what exactly ClientUser1 did for DNS lookups.
When the server is down, we see the client query for _sipinternaltls._tcp.shudlab.net. We can see in the red highlights at the bottom, we have a-l14fe1.shudlab.net and l14fe2.shudlab.net returned. Part of the data return is obviously the priority information. What we end up seeing below is ClientUser1 ends up trying to connec tto a-l14fe2.shudlab.net because it knows it is having problems connecting to a-l14fe1.shudlab.net. Because of that 2nd SRV being in place, ClientUser1 found it, is doing another query for a-l14fe2.shudlab.net to find its IP address, and now makes a connection to this server. Voila, we now have a failed over client.
Reviewing some key points
- If a client gets redirected to a server, it is a 301 redirect that informs the client of their Primary and Backup Registrar. If the Primary happens to be down (for example, if you connected to a Director), the client will automatically be able to connect to their Backup Registrar. If their Primary happens to be operational, the user connects, and their Primary Goes down, that user will failover to their Backup Registrar.
- If a client has signed in at least once, their Primary Server has been cached into a file called Endpointconfiguration.cache. That client will always connect directly to that server instead of potentially getting a 301 redirect. It is because of this it is very important to have multiple SRV records in the environment to increase the chance that regardless if a server is cached in the Endpointconfiguration.cache file, that client will have another means to find another registrar in the environment. If that registrar happens to be another pool that is not their primary, the user will get a 301 redirect to their Primary and Backup Registrar Pool.
- A registrar does help as it will redirect clients to their correct pool and provides the clients with a 301 redirect thus letting the client know what their Primary and Backup Registrar is. But as you have seen, do not completely rely on this due to the client caching server information in the Endpointconfiguration.cache. You absolutely should have at least 2 SRV records with two different priorities to ensure a client will failover to another registrar regardless if you have a Director in your environment or not.
Well folks, that is all for not just Part 3, but the entire article series. In this part, we performed a failover test and a failback test without a SRV record in place. We then took a look at what happens to ClientUser1 using A-L14FE1 Pool and what happens to ClientUser2 using A-L14FE2 Pool when we take down one of the Pools. We then took a look at what happens when the Pool that came down comes back online. And we finally ended our tests in seeing what happens when a second SRV record is in place.
Hopefully these articles have helped you understand more on how the deployment of Lync 2010 Central Site Resilience works. Feel free to ask questions in the comments below and I will do my best to answer questions.