Skip to content

4.2 Reliability & Resilience

Recommended AWS Reliability Azure Reliability GCP Reliability

Reliability & Resilience focuses on the solution’s ability to withstand failures, recover from disruptions, and scale to meet demand. This quality attribute is closely tied to the Physical View (3.3) and Data View (3.4) for infrastructure and data backup details. Evaluate this quality attribute across all architectural views documented in Section 3.

4.2.1 Geographic Footprint & Disaster Recovery

Section titled “4.2.1 Geographic Footprint & Disaster Recovery”
Recommended
QuestionResponse
Is the application deployed across multiple hosting venues for continuity?Yes / No - [details]
What is the DR strategy?Active-Active / Active-Passive / Pilot Light / Backup & Restore
Are there data sovereignty requirements affecting geographic choices?Yes / No - [details]

[Insert geographic deployment diagram if applicable]

Recommended
AttributeResponse
Scaling capabilityNo dynamic scaling (pre-sized) / Manual scaling / Partial auto-scaling / Full auto-scaling
Scaling details[describe how scaling works, triggers, limits]
AttributeResponse
Dependencies adequately sized?Yes (confirmed) / Unconfirmed / Known insufficient
Dependency details[describe scaling posture of dependencies]
Recommended

Has the application been designed to tolerate unexpected disruptions such as failure or degradation of internal components or external dependencies?

  • Yes - [describe fault tolerance design, including:]
    • How the application handles component failures
    • Graceful degradation strategies
    • Circuit breaker or retry patterns
    • Health check and self-healing mechanisms
    • Testing practices (chaos engineering, game days)
  • No - [describe why not]
Recommended

Document how the solution behaves when individual components or dependencies fail:

Component / DependencyFailure ModeDetection MethodRecovery BehaviourUser Impact
[component][how it fails][how detected: health check, alert, timeout][auto-restart, failover, graceful degradation, manual intervention][full outage, degraded service, transparent]

Guidance

For each critical component, consider:

  • What happens when it becomes unavailable? Does the solution fail entirely, degrade gracefully, or continue with reduced functionality?
  • How is the failure detected? Health checks, heartbeats, error thresholds, timeouts?
  • How is it recovered? Automatic restart, failover to secondary, circuit breaker, manual intervention?
  • What do users experience? Full outage, degraded experience, increased latency, or transparent failover?

This section is frequently missing from architecture documents but is one of the most valuable for operations teams and SREs.

Recommended
AttributeDetail
Backup strategy[what is backed up and how]
Backup product/service[tool used]
Backup typeFull / Incremental / Differential
Backup frequency[schedule]
Backup retention[period]
ControlDetail
Immutability[how backups are protected against modification/deletion]
Encryption[how backup data is encrypted]
Access control[who can access backups]
Recommended

Document how the solution recovers under different failure scenarios:

#ScenarioRecovery ApproachRTORPO
1Primary hosting venue / AZ / region failure[approach][time][time]
2Critical software component failure[approach][time][time]
3Key infrastructure failure (hardware, storage, network)[approach][time][time]
4Network connectivity failure between venues[approach][time][time]
5External connectivity failure (customer-facing)[approach][time][time]
6Ransomware / cyber-attack[approach][time][time]
7Accidental or malicious data corruption / deletion[approach][time][time]

Guidance

For each scenario, describe:

  • How the failure is detected
  • Automatic vs. manual recovery steps
  • Expected Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
  • Any dependencies on other teams or systems for recovery
  • Whether recovery has been tested and when

Scoring Guidance

ScoreWhat This Looks Like
1DR strategy identified but RTO/RPO not defined
3DR strategy documented with RTO/RPO targets, backup configured, scalability approach defined
5All of the above plus fault tolerance designed, chaos testing practised, backup immutability and encryption confirmed, DR tested