Business Continuity and Disaster Recovery (BCDR) is an important topic to understand to ensure your organization designs workloads to provide reliable services. The purpose of this article isn’t to discuss BCDR itself, but rather different Azure Geo-Redundant Storage (GRS) patterns and considerations when designing workloads with BCDR in mind.
For more information on BCDR, please see this article: Business continuity and disaster recovery – Cloud Adoption Framework | Microsoft Learn
Prerequisite Knowledge for the remainder of this article:
- What is an Azure Region
- What are Availability Zones within an Azure Region
- What are Paired Regions
Storage Replication Options
Geo-Redundant Storage is one of several platform-provided replication types for Azure Storage. Let’s outline the different replication types:
Locally Redundant Storage (LRS) | LRS copies your data synchronously three times within a single physical location in the primary region. |
Zone-redundant Replication (ZRS) | ZRS copies your data synchronously across three Azure availability zones in the primary region. |
Geo-redundant Storage (GRS) | GRS copies your data synchronously three times within a single physical location in the primary region using LRS. It then copies your data asynchronously to a single physical location in the secondary region. Within the secondary region, your data is copied synchronously three times using LRS. |
Geo-zone-redundant storage (GZRS) | GZRS copies your data synchronously across three Azure availability zones in the primary region using ZRS. It then copies your data asynchronously to a single physical location in the secondary region. Within the secondary region, your data is copied synchronously three times using LRS. |
The focus for this conversation will be around GRS which is a common consideration when customers are architecting Geo-Redundant solutions in Azure.
Storage Account Endpoints using GRS
A storage account provides a unique namespace in Azure for your data. Every object that you store in Azure Storage has a URL address that includes your unique storage account name. The combination of the storage account name and the service endpoint forms the endpoint urls for your storage account.
Let’s take a look at a deeper look at Endpoints. I created a GRS Storage Account in Central US (paired with East US 2) and have opened the Endpoints blade for this Storage Account. We will focus on Blob Endpoints.
In the following Image, we can see our unique namespace/endpoint, https://elanbloggrs.blob.core.windows.net/ with our elanbloggrs being our Storage Account Name. This is the reason Storage Accounts must be unique across the world. We can also see that for the blog service, we only have a single endpoint.
Because we have enabled this Storage Account to leverage GRS for data redundancy, if we click on the Redundancy blade, we can see that Status of our data is showing as Available in both Central US and East US 2.
I decided to change the configuration of the Storage Account to leverage RA-GRS. RA-GRS means that a new secondary read-only endpoint will be created in East US 2 allowing our applications to read the data from the paired region, in our case, East US 2.
Taking another look at our Storage Account Endpoints, we now see we have a Primary Endpoint and Secondary Endpoint, rather than just a single endpoint as we had with GRS.
GRS & RA-GRS Failover
As we previously saw, the difference between GRS and RA-GRS is that with GRS, you have a single endpoint for blob. With RA-GRS, you have a primary read/write endpoint for blob and a secondary read-only endpoint for blob.
Let’s take a look at what the failover process entails using our Storage Account which is configured to leverage RA-GRS. First, let’s take a look at how DNS resolves for both the Primary Endpoint and Secondary Endpoint:
Primary Endpoint (Read/Write in Primary Region, Central US)
Secondary Endpoint (Read-Only in Secondary/Paired Region, East US 2)
Our Primary Endpoint resolves to an IP Address belonging to Central US. Our Secondary Endpoint resolves to an IP Address belonging to East US 2.
Let’s go ahead and initiate a Storage Account Failover. In the Redundancy Blade of our Storage Account, we will click the button, “Prepare for Failover.”
A popup will appear asking us to choose whether we want to leverage Unplanned Failover or Planned Failover (preview). Currently, we’re limited to Unplanned failover. What this means is that our Storage Account Endpoints will fail over to be hosted on East US 2. The Storage Account will also be converted to LRS. We are failing over this Storage Account because some issue has occurred in our Primary Region resulting in our needs to have a read/write copy in our secondary/paired region, East US 2.
Our Failover is now in progress.
After failover has completed, we can see the Azure Storage Account is now configured to be LRS and the Endpoint only exists within East US 2.
Next, let’s take a look at what our endpoints look like. Prior to failover, we saw our two endpoints, primary (read/write) and secondary (read-only).
After failover, our LRS Storage Account now only has a single Blob Endpoint. As the Endpoint is made up using the Storage Account Name, our endpoint continues to be https://elanbloggrs.blob.core.windows.net
Let’s do another DNS lookup
Previously, our secondary endpoint was pointing to 52.239.174.19. Now, our single endpoint is pointing to this IP. This means during the Failover Process, a DNS update was made to repoint our Primary Endpoint to the Secondary Copy and then convert the Storage Account to be LRS. This also means that in your failover documentation, applications will need to point to the Primary Storage Account Endpoint URL which is now our single Endpoint URL to write into the Storage Account existing in the Secondary/Failover Region, in our case, East US 2.
GRS Storage Account Fail Back to Primary Region
As can be seen above, the Azure Storage Account has now been converted to an LRS Storage Account. The Fail Back Process would require re-configuring the Storage Account to be GRS again, initiating another fail over, and re-configuring the Storage Account again to be GRS which would put the Storage Account back into a state where the Primary Endpoint exists in the Primary Region (Central US in our example) and the Secondary Endpoint exists in the Secondary/Paired Region (East US 2 in our example).
This process will be improved in the future once Customer-Managed Planned Failover becomes available. With Customer-Managed Planned Failover, it would require the Primary Endpoint to be available during the failover process, and if so, the Endpoints are transitioned from the Primary Region to the Secondary Region while maintaining GRS.
Region-Wide Failures
It is also important to understand the documented approach to failover your Storage Account Endpoints during a region-wide or scale-unit event. A process known as Microsoft-Managed Failover is documented here.
It is very important to understand what is documented in this section linked above. Specifically, I want to call out two points:
- In extreme circumstances where the original primary region is deemed unrecoverable within a reasonable amount of time due to a major disaster, Microsoft may initiate a regional failover
- Your disaster recovery plan should be based on customer-managed failover. Do not rely on Microsoft-managed failover, which might only be used in extreme circumstances.
You should only design your BCDR strategies around Customer-Managed Failovers. With that said, let’s talk about GRS Limitations that will currently inhibit you from leveraging Customer Managed Failover Options; both Unplanned and Planned Failover.
Limitations
There are several limitations to consider when leveraging GRS in your BCDR strategy:
- As can be seen above, Planned Failover is in Preview. Therefore, ensure in your failover documentation, you include the process on how to fail back to the Primary Region (renabling GRS), failing back, and enabling GRS again to ensure there’s a copy of data in your Secondary/Failover Region
- Storage Accounts that have a hierarchal namespace (Azure Data Lake Storage Gen2) do not yet support either Unplanned Failover or Planned Failover. These two features are in preview. See Azure Data Lake Storage Gen 2.
Other unsupported features and services as outlined here.
BCDR and Hierarchal Namespaces
As seen above, there’s some challenges with Region Failover when leveraging Azure Storage and that Storage Account has been enabled for a Hierarchal Namespace (Azure Data Lake Storage Gen2). We see that we are currently, at the time of this writing on 6/18/2024, are unable to leverage Customer-Managed Failover Options when our Storage Accounts are enabled for Hierarchal Namespace (Azure Data Lake Storage Gen2) and these features are in Preview.
As stated earlier, your disaster recovery plan should not rely on Microsoft-managed failover. If we cannot leverage Customer-managed failover, and we shouldn’t rely on Microsoft-managed failover options, what options are there?
There are two primary strategies:
- Strategy #1 – Storage Account Replication – When data comes into the Storage Account in the Primary Region, leverage a tool to replicate the data between the Storage Accounts in both Regions. It is recommended to leverage ZRS for these Storage Accounts for additional redundancy within the Azure Region. Unfortunately, as of today, you cannot leverage Azure Blob Object Replication feature for this as this is unsupported for Storage Accounts enabled for Hierarchal Namespace (Azure Data Lake Storage Gen2).
- Strategy #2 – Dual-Writing – When data comes into the Storage Account, the process that pulls in the data into the Storage Accounts should write the data into both Storage Accounts in both Regions. Just as with Strategy #1, it is recommended to leverage ZRS for these Storage Accounts for additional redundancy within the Azure Region.
Strategy #1 – Storage Account Replication
In Strategy #1, a tool such as AZCopy (which supports Azure Data Lake Storage Gen2 as a source and destination) or a third-party tool that supports Azure Storage Account Hierarchal Namespaces would need to be deployed to ensure the data replicates between the Storage Accounts.
During Failover of this solution, there would need to be some decisions made:
- Should we have Data Factory deployed in the Secondary Region to continue pulling data into the Failover Storage Account and continue operations? If yes, how should we handle data consistency when the Primary Region comes online to make sure our data does not become diverged? For example, do we enable bi-directional data synchronization so the Failover Storage Account will replicate new data into the Primary Region Storage Account when it comes back online? These considerations would need to be addressed.
- How to prevent duplicate data entries in a case where when failover/failback has occured, that backend systems that may be geo-replicating are not taking in the same data twice?
- How to handle data loss that may occur during a failure event?
Let’s take a look at a visual of both of these strategies:
Strategy #2 – Dual-Writing
In Strategy #2, a tool such as Azure Data Factory would write incoming data into the Storage Accounts in both Regions.
During Failover of this solution, there would need to be some decisions made:
- Should we have Data Factory deployed in the Secondary Region to continue pulling data into the Failover Storage Account and continue operations? If yes, how should we handle data consistency when the Primary Region comes online to make sure our data does not become diverged? These considerations would need to be addressed.
- How to prevent duplicate data entries in a case where when failover/failback has occured, that backend systems that may be geo-replicating are not taking in the same data twice?
- How to handle data loss that may occur during a failure event?
Handy Links
- Azure storage disaster recovery planning and failover – Azure Storage | Microsoft Learn
- How Azure Storage account customer-managed failover works – Azure Storage | Microsoft Learn
- azcopy copy | Microsoft Learn
- Public Preview: Customer Managed Failover for ADLS Gen2 | Azure updates | Microsoft Azure
- Use geo-redundancy to design highly available applications – Azure Storage | Microsoft Learn
PS: Thanks for some of the links and the idea for the article, Trey!
Leave a Reply