Introduction
What is disaster recovery?
- Disaster recovery is the process of preparing for and recovering from a disruptive event.
What is a disaster?
In the context of a company’s IT environment, a disaster is an event that partially or completely disrupts the operations of one or more applications. A disaster normally requires human intervention to fail over to secondary copies of applications in order to maintain their functionality.
The four main categories of a disaster:
Human errors – Unintentional actions leading to a security breach such as inadvertent misconfiguration of the software or a database
Malicious attacks – Unauthorized actions that affect a victim’s system such as a denial-of-service (DoS) or ransomware attack
Natural disasters – Environmental factors that cause a system failure such as earthquakes or floods
Technical failures – A malfunction of software, hardware, or a facility such as a power failure or a network connectivity failure
There are several factors that need to be considered when planning your response to a specific disaster:
Expected duration of the disaster – How soon will the application recover and how likely is the disaster to resolve on its own?
Size of impact (also known as blast radius) – Which applications are affected and to what extent is their functionality impaired?
Geographic impact – May be regional, national, continental, or global.
Tolerance of downtime – How significant is the impact of the application not functioning?
Why disaster recovery?
A properly planned and implemented disaster recovery solution helps mitigate the following issues that can be caused by a disaster:
Direct and indirect financial loss – The impact of direct financial loss is mostly relevant for applications that are critical for any revenue-generating processes. For example, external-facing IT systems that are provided to customers for a fee or internal IT systems that process data relevant for revenue generation. Indirect financial loss includes, for example, customers switching to a competing product and the cost of work needed to resume normal operation after the disaster is over.
Reputational damage – In addition to financial loss as described previously, downtime caused by unexpected incidents can significantly harm a company’s reputation. A short recovery period aided by a disaster recovery solution can help avoid irreversible damage to the corporate image.
Failure to abide by compliance standards – Multiple compliance standards, including System and Organization Controls (SOC), the Payment Card Industry (PCI) Data Security Standard, and the Health Insurance Portability and Accountability Act (HIPAA), require a disaster recovery plan. Some standards even add very specific requirements, such as minimal physical distance between the source site and the disaster recovery site.
How does disaster recovery help business continuity?
Disaster recovery is a component of the overall business continuity strategy of an organization. Business continuity is the ability of the organization and all the supporting applications to run critical business functions at all times, including during emergency events.
To achieve business continuity, you must implement various types of resilience mechanisms. Resilience is the ability of an application to recover from an outage, either automatically or with human intervention.
Disaster recovery (sometimes called business continuity/disaster recovery, BC/DR, or DR) is an important part of your resilience strategy and determines how you respond when a disaster strikes. This response varies between applications and should be based on your organization's business objectives for each application. These objectives should specify (among other things) the strategy for minimizing loss of data and reducing downtime when your applications are not available for use. This approach helps your organization maintain operations as part of business continuity planning (BCP).
Recovery objectives
As part of disaster recovery planning, you need to define a recovery time objective (RTO) and recovery point objective (RPO) for each application based on impact analysis and risk assessment.
Recovery time objective (RTO) is the maximum acceptable delay between the interruption of an application and the restoration of its service. This objective determines what is considered an acceptable time window for an application to be unavailable.
Recovery point objective (RPO) is the maximum acceptable gap between the data in the disaster recovery site and the latest data stored in the application when the disaster strikes. This objective determines what is considered the maximum amount of time acceptable for interruption/loss of data that can be caused by a disaster.
Recovery objectives
RTO and RPO for each application depend on many factors (such as service level agreements (SLA) and external compliance requirements), but there are some common standards. Common figures for mission-critical applications (tier-1 applications) include an RTO of 15 minutes and a near-zero RPO. For important applications that are not mission critical (tier-2 applications), the RTO is typically 4 hours and the RPO is 2 hours. For all other applications (tier-3 applications), a typical RTO is 8 to 24 hours and RPO is 4 hours.
Solutions and methodologies
- The following section provides an overview of common resilience solutions and methodologies and explains the difference between these solutions and disaster recovery.
Various resilience solutions compared
Backup – Backup protects against data loss by storing historical data so that if any data is lost, it can be recovered from the backup. Backup solutions can store historical data locally, in a remote location, or in both. The advantage for local backups is recovery speed and for remote backups the advantage is more resiliency. Backup solutions often have relatively low total cost of ownership (TCO) as the only infrastructure needed is storage, and the performance requirements for that storage are low (for example, some companies still use tape-based backup because of the low cost of tapes).
Archiving – A subcategory of backup solutions is archiving. Archives provide unchanged historical copies of data to meet legal and compliance requirements. Archives are normally kept for a longer term than standard backups. Unlike backup, which may provide quicker file restoration (normally measured in hours or days), archives are not utilized by routine business operations and can be stored in low-cost, off-site locations.
High availability (HA) – High availability enables an application to continue operating uninterrupted if a component of that application malfunctions. Detecting the malfunction and ensuring that the application continues to work as normal is almost always an automated process. Ideally, a user of the application would not experience anything unusual in case of such a failure. A typical example is a multi-node database. Most modern multi-node databases continue operation uninterrupted if a single component fails. High availability is normally introduced as part of the design and implementation of the system as it is much harder to add to an existing application that was not designed with high availability in mind. High availability solutions ensure minimal impact on users (ideally, no impact) in case of issues. However, they are only meant to deal with a small localized event (for example, failure of a single server or subnet). A high availability solution will not be able to handle a wider disaster, such as the failure of an entire data center or a corrupted software update.
Disaster recovery – Disaster recovery helps ensure business continuity for applications in case of an issue that prevents the application from recovering automatically or that requires a significant amount of time until recovery is achieved. Disaster recovery includes the ability to use a secondary application in a secondary location that will serve the application’s users until the original instance is fixed or recovered. Switching users to the secondary location is not an automatic process, but is instead performed on the basis of an explicit decision by an authorized person or group of people in the organization, because there are costs associated with it. For instance, there is some downtime while the failover is commencing and there is the cost of the labor for people participating in the switch. These implications need to be weighed against the chances of the source site returning to normal operation in a timely manner. Secondary location solutions usually have a higher TCO than backup, because the secondary site needs to be maintained at all times (during normal operation as well) and needs to be advanced enough to support the functionality of the application in case of a disaster.
These three resilience solutions are complementary of each other. Business requirements may dictate that workloads should apply a combination of these solutions, depending on the business resilience requirements of each application.
Disaster recovery compared to backup
There’s an important distinction between backup and disaster recovery. Backup is the process of making an extra copy (or multiple copies) of data. You back up data to be able to restore it in case it is lost or corrupted. You might need to restore backup data if you encounter an accidental deletion, database corruption, or problem with a software upgrade.
It is important to have a backup solution in place. Backup protects your data in case of theft of equipment storing data, employee accidents (deletion of an important file), a technical issue (crashed hard drive), or malicious tampering (ransomware). With this protection, you can access a copy of your data and restore it easily.
Disaster recovery, on the other hand, refers to the plan and processes for quickly reestablishing access to applications, data, and IT resources after an outage. This plan might involve switching over to a redundant set of servers and storage systems until your source data center is functional again.
For example, a disaster can lead to a disruption of your entire network, resulting in your employees not being able to work for the entire day (or even longer). However, a proper disaster recovery solution would allow your employees to continue to work using the mirrored system, while your IT team fixes the problem in the original network.
Some organizations mistake backup for disaster recovery. But as they may discover after a serious outage, simply having copies of data doesn’t mean you can keep your business running. To ensure business continuity, you need a robust, tested disaster recovery solution that enables maintaining normal operation until the disaster is resolved.
In terms of similarities, both backup and disaster recovery solutions maintain copies of historical data that may have changed in the source storage (often referred to as snapshots or point-in-time copies). In the case of backup solutions, this is a core part of the solution’s value: to be able to restore a previous version of data in case it was incorrectly modified or corrupted.
In the case of disaster recovery solutions, this is done to enable successful recovery if the latest state of the data prevents normal operation. Database corruptions, ransomware data encryption, and incorrect software configuration all fall under this category and would require the disaster recovery site to be based on a previous version of the data.
However, when backup and disaster recovery are compared, there are multiple distinct differences that exist between the two:
Purpose - Backups work best when you need to gain access to a lost or damaged file or object, such as an email, PowerPoint presentation, or database. Backups are also used for long-term data archival, or for purposes such as data retention. However, if you want your business to quickly restore its functions after an unforeseen event, you should opt for disaster recovery. With both the disaster recovery site and solution in place, you can perform a failover to transfer applications to the disaster recovery site, and your business can continue to function as normal even if the production site is unavailable. On the other hand, restoring a single piece of data (such as a file) is much easier to do using a backup of that data, rather than recovering an entire server where that data was stored.
RTO and RPO Setting RTO and RPO is crucial for any business. Because restoring data from backups often does not help with business continuity, the concept of RTOs and RPOs is not applicable. Disaster recovery, on the other hand, implies replicating your critical applications with the aim of quickly performing failover if necessary to assure the business continuity of the affected applications.
Resource allocation - Backups are usually stored in a compressed state and do not need to be restored quickly. Therefore, backups normally use low-cost and low-performance storage (frequently off site). Disaster recovery, on the other hand, requires a separate site with operational IT infrastructure that should always be ready for a possible failover at any time.
In recent years, the term disaster recovery solution has become very popular, with different meanings in different cases. Therefore, it’s important to analyze each product to make sure it fulfills the business continuity needs of the organization, including RPO, RTO, and the ability to quickly continue running the application from the disaster recovery site in case the source site loses functionality.
Disaster recovery compared to high availability High availability (HA) and disaster recovery rely on some of the same best practices, such as monitoring for failures, deploying to multiple locations, and failing over. However, high availability focuses on a single component failure, whereas disaster recovery focuses on continuity in case of a wider failure of the entire application or significant parts of the application.
Disaster recovery has different objectives from high availability.
Your disaster recovery strategy requires different approaches than those for high availability, focusing on deploying discrete systems (usually to multiple locations to minimize the impact of a local issue), so that you can failover the entire application if necessary.
For example, an application that runs on a single virtual machine (VM) in a data center is not highly available. If a local flooding issue affects that data center, this scenario requires failover to another location to meet recovery objectives.
Compare this scenario to a highly available application that is deployed across multiple active Availability Zones in the same AWS Region and all Availability Zones are serving production traffic.
In this case, even in the localized event of one Availability Zone failing, the high availability strategy is accomplished by automatically routing all traffic to the remaining functional Availability Zones.
How you approach data resilience is also different between high availability and disaster recovery. Consider a storage solution that synchronously replicates to a nearby storage appliance to achieve the high availability of persistent data. If a file or files are mistakenly deleted or corrupted, those destructive changes will be replicated to the secondary storage device.
In this scenario, despite the high availability of the storage itself, the ability to recover data in the case of data deletion or corruption is not present. When using a disaster recovery solution in the same scenario, normally a point-in-time-recovery capability is included that can be used.
Another difference between high availability and disaster recovery is how a failover is initiated. In high availability solutions, an event is initiated automatically when needed for high availability (normally within seconds), which results in little to no impact on the end user.
In disaster recovery solutions, failing over often incurs additional financial or non-financial impact (for example, the need to failback all the new data after the disaster is over or the need to provision more resources in the disaster recovery site). Therefore, human intervention is required to initiate a failover event. Also, failing over is normally not instantaneous, and the application remains down until the failover is complete.
A well-designed disaster recovery plan should define who is authorized to initiate a failover, how to reach these people, and what they need to consider when making the decision to failover.
Lastly, in most cases, high availability solutions need to be selected at the time an application is designed (or refactored), as they are an integral part of the application. Disaster recovery solutions may be added to an existing application without significant re-architecture or modification work in the application itself.
Which applications require disaster recovery?
A malfunction of almost any application has a negative impact on the organization. No matter the size or role of the application, any malfunction on a key application or even a non-production application can have a negative impact; the more critical the application, the greater the impact. Therefore, all applications can benefit from a disaster recovery solution that can help quickly and easily mitigate any malfunction.
To determine whether to implement a disaster recovery solution, you need to consider the return on investment (ROI). On the one hand, each disaster recovery solution has direct and indirect costs such as software licenses and hardware, infrastructure, maintenance, and drills. On the other hand, every time a disaster strikes, it incurs costs of its own.
To determine your maximum TCO for a disaster recovery solution for each of your applications, you’ll need to perform a disaster risk analysis, what is the probability of a disaster happening and what are the direct and indirect financial consequences of the disaster?
Hope this guide gives you an Introduction to What is disaster recovery?
Let me know your thoughts in the comment section 👇 And if you haven't yet, make sure to follow me on below handles:
👋 connect with me on LinkedIn 🤓 connect with me on Twitter 🐱💻 follow me on github ✍️ Do Checkout my blogs
Like, share and follow me 🚀 for more content.
{% user aditmodi %}