- Karan Pratap Singh
Disaster recovery (DR) is a process of regaining access and functionality of the infrastructure after events like a natural disaster, cyber attack, or even business disruptions.
Disaster recovery relies upon the replication of data and computer processing in an off-premises location not affected by the disaster. When servers go down because of a disaster, a business needs to recover lost data from a second location where the data is backed up. Ideally, an organization can transfer its computer processing to that remote location as well in order to continue operations.
Disaster Recovery is often not actively discussed during system design interviews but it's important to have some basic understanding of this topic. You can learn more about disaster recovery from AWS Well-Architected Framework.
Why is disaster recovery important?
Disaster recovery can have the following benefits:
- Minimize interruption and downtime
- Limit damages
- Fast restoration
- Better customer retention
Let's discuss some important terms relevantly for disaster recovery:
Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.
Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.
A variety of disaster recovery (DR) strategies can be part of a disaster recovery plan.
This is the simplest type of disaster recovery and involves storing data off-site or on a removable drive.
In this type of disaster recovery, an organization sets up basic infrastructure in a second site.
A hot site maintains up-to-date copies of data at all times. Hot sites are time-consuming to set up and more expensive than cold sites, but they dramatically reduce downtime.