Disaster Recovery (DR) and Business Continuity Planning (BCP) ensure organizations can recover from disruptions and maintain critical operations. Cloud platforms provide powerful tools for implementing robust DR strategies with defined Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).
DR Strategy Tiers
- Backup & Restore: Lowest cost, highest RTO (hours)
- Pilot Light: Core systems running, scale on demand
- Warm Standby: Scaled-down version always running
- Multi-Site Active/Active: Full redundancy, lowest RTO
AWS Backup Configuration
# Terraform - AWS Backup plan
resource "aws_backup_vault" "main" {
name = "production-vault"
kms_key_arn = aws_kms_key.backup.arn
}
resource "aws_backup_plan" "production" {
name = "production-backup-plan"
rule {
rule_name = "daily-backup"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 5 ? * * *)" # Daily at 5 AM UTC
lifecycle {
cold_storage_after = 30
delete_after = 365
}
copy_action {
destination_vault_arn = aws_backup_vault.dr_region.arn
lifecycle {
delete_after = 365
}
}
}
rule {
rule_name = "hourly-backup"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 * ? * * *)" # Every hour
lifecycle {
delete_after = 7
}
}
}
resource "aws_backup_selection" "production" {
name = "production-resources"
plan_id = aws_backup_plan.production.id
iam_role_arn = aws_iam_role.backup.arn
selection_tag {
type = "STRINGEQUALS"
key = "Backup"
value = "true"
}
}Cross-Region RDS Replication
# Terraform - RDS with read replica in DR region
resource "aws_db_instance" "primary" {
identifier = "production-db"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6g.xlarge"
allocated_storage = 100
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
backup_retention_period = 7
backup_window = "03:00-04:00"
multi_az = true
}
resource "aws_db_instance" "replica" {
provider = aws.dr_region
identifier = "production-db-replica"
replicate_source_db = aws_db_instance.primary.arn
instance_class = "db.r6g.large"
kms_key_id = aws_kms_key.rds_dr.arn
}S3 Cross-Region Replication
# Terraform - S3 replication
resource "aws_s3_bucket_replication_configuration" "replication" {
bucket = aws_s3_bucket.source.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-all"
status = "Enabled"
filter {
prefix = ""
}
destination {
bucket = aws_s3_bucket.destination.arn
storage_class = "STANDARD_IA"
encryption_configuration {
replica_kms_key_id = aws_kms_key.dr.arn
}
}
delete_marker_replication {
status = "Enabled"
}
}
}Route 53 Failover
# Terraform - DNS failover
resource "aws_route53_health_check" "primary" {
fqdn = "primary.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30
}
resource "aws_route53_record" "primary" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
}
resource "aws_route53_record" "secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
alias {
name = aws_lb.secondary.dns_name
zone_id = aws_lb.secondary.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "secondary"
}DR Runbook Automation
# AWS Systems Manager Automation
schemaVersion: '0.3'
description: 'DR Failover Runbook'
mainSteps:
- name: PromoteRDSReplica
action: aws:executeAwsApi
inputs:
Service: rds
Api: PromoteReadReplica
DBInstanceIdentifier: production-db-replica
- name: UpdateDNS
action: aws:executeAwsApi
inputs:
Service: route53
Api: ChangeResourceRecordSets
HostedZoneId: '{{ HostedZoneId }}'
ChangeBatch:
Changes:
- Action: UPSERT
ResourceRecordSet:
Name: app.example.com
Type: A
AliasTarget:
DNSName: '{{ DRLoadBalancerDNS }}'
HostedZoneId: '{{ DRLoadBalancerZoneId }}'
EvaluateTargetHealth: trueTesting DR
- Schedule regular DR drills (quarterly minimum)
- Document and measure actual RTO/RPO
- Test data restoration procedures
- Validate runbook automation
- Update procedures based on lessons learned
Conclusion
Effective DR/BCP requires careful planning, automation, and regular testing. Cloud services provide the building blocks for robust disaster recovery, but success depends on proper implementation and continuous validation of recovery procedures.

