Disaster Recovery and Business Continuity Planning (DR/BCP) for Cloud

Disaster Recovery (DR) and Business Continuity Planning (BCP) ensure organizations can recover from disruptions and maintain critical operations. Cloud platforms provide powerful tools for implementing robust DR strategies with defined Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).

DR Strategy Tiers

  • Backup & Restore: Lowest cost, highest RTO (hours)
  • Pilot Light: Core systems running, scale on demand
  • Warm Standby: Scaled-down version always running
  • Multi-Site Active/Active: Full redundancy, lowest RTO

AWS Backup Configuration

# Terraform - AWS Backup plan
resource "aws_backup_vault" "main" {
  name        = "production-vault"
  kms_key_arn = aws_kms_key.backup.arn
}

resource "aws_backup_plan" "production" {
  name = "production-backup-plan"

  rule {
    rule_name         = "daily-backup"
    target_vault_name = aws_backup_vault.main.name
    schedule          = "cron(0 5 ? * * *)"  # Daily at 5 AM UTC

    lifecycle {
      cold_storage_after = 30
      delete_after       = 365
    }

    copy_action {
      destination_vault_arn = aws_backup_vault.dr_region.arn
      lifecycle {
        delete_after = 365
      }
    }
  }

  rule {
    rule_name         = "hourly-backup"
    target_vault_name = aws_backup_vault.main.name
    schedule          = "cron(0 * ? * * *)"  # Every hour

    lifecycle {
      delete_after = 7
    }
  }
}

resource "aws_backup_selection" "production" {
  name         = "production-resources"
  plan_id      = aws_backup_plan.production.id
  iam_role_arn = aws_iam_role.backup.arn

  selection_tag {
    type  = "STRINGEQUALS"
    key   = "Backup"
    value = "true"
  }
}

Cross-Region RDS Replication

# Terraform - RDS with read replica in DR region
resource "aws_db_instance" "primary" {
  identifier           = "production-db"
  engine               = "postgres"
  engine_version       = "15.4"
  instance_class       = "db.r6g.xlarge"
  allocated_storage    = 100
  storage_encrypted    = true
  kms_key_id          = aws_kms_key.rds.arn
  
  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  
  multi_az = true
}

resource "aws_db_instance" "replica" {
  provider = aws.dr_region
  
  identifier          = "production-db-replica"
  replicate_source_db = aws_db_instance.primary.arn
  instance_class      = "db.r6g.large"
  
  kms_key_id = aws_kms_key.rds_dr.arn
}

S3 Cross-Region Replication

# Terraform - S3 replication
resource "aws_s3_bucket_replication_configuration" "replication" {
  bucket = aws_s3_bucket.source.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"

    filter {
      prefix = ""
    }

    destination {
      bucket        = aws_s3_bucket.destination.arn
      storage_class = "STANDARD_IA"
      
      encryption_configuration {
        replica_kms_key_id = aws_kms_key.dr.arn
      }
    }

    delete_marker_replication {
      status = "Enabled"
    }
  }
}

Route 53 Failover

# Terraform - DNS failover
resource "aws_route53_health_check" "primary" {
  fqdn              = "primary.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 30
}

resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "PRIMARY"
  }

  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id
}

resource "aws_route53_record" "secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"

  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }

  failover_routing_policy {
    type = "SECONDARY"
  }

  set_identifier = "secondary"
}

DR Runbook Automation

# AWS Systems Manager Automation
schemaVersion: '0.3'
description: 'DR Failover Runbook'
mainSteps:
  - name: PromoteRDSReplica
    action: aws:executeAwsApi
    inputs:
      Service: rds
      Api: PromoteReadReplica
      DBInstanceIdentifier: production-db-replica
      
  - name: UpdateDNS
    action: aws:executeAwsApi
    inputs:
      Service: route53
      Api: ChangeResourceRecordSets
      HostedZoneId: '{{ HostedZoneId }}'
      ChangeBatch:
        Changes:
          - Action: UPSERT
            ResourceRecordSet:
              Name: app.example.com
              Type: A
              AliasTarget:
                DNSName: '{{ DRLoadBalancerDNS }}'
                HostedZoneId: '{{ DRLoadBalancerZoneId }}'
                EvaluateTargetHealth: true

Testing DR

  • Schedule regular DR drills (quarterly minimum)
  • Document and measure actual RTO/RPO
  • Test data restoration procedures
  • Validate runbook automation
  • Update procedures based on lessons learned

Conclusion

Effective DR/BCP requires careful planning, automation, and regular testing. Cloud services provide the building blocks for robust disaster recovery, but success depends on proper implementation and continuous validation of recovery procedures.