MTBF & MTTR

MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair) are industry-standard reliability metrics that Saturn tracks automatically.

MTBF: Mean Time Between Failures

Definition: Average time between incidents

Formula:

MTBF = Total Uptime / Number of Incidents

Example:

Period: 30 days (720 hours)
Incidents: 3
Downtime: 2 hours total

MTBF = (720 - 2) / 3 = 239.3 hours

Interpretation: On average, this monitor runs for 239 hours (≈10 days) before an incident occurs.

MTBF Targets

Target	Quality Level
> 720 hours (30 days)	Excellent
360-720 hours (15-30 days)	Good
168-360 hours (7-15 days)	Acceptable
72-168 hours (3-7 days)	Needs improvement
< 72 hours (3 days)	Poor

Improving MTBF

Fix root causes: Don't just resolve incidents, prevent recurrence
Add redundancy: Fallback mechanisms for dependencies
Better error handling: Graceful degradation instead of failures
Proactive maintenance: Fix issues before they cause incidents

MTTR: Mean Time To Repair

Definition: Average time from incident creation to resolution

Formula:

MTTR = Σ(Incident Duration) / Number of Incidents

Example:

Incidents:
1. Oct 5: 3h 20m (200 minutes)
2. Oct 15: 45m
3. Oct 22: 2h 10m (130 minutes)

MTTR = (200 + 45 + 130) / 3 = 125 minutes ≈ 2.08 hours

Interpretation: On average, it takes 2 hours to resolve an incident for this monitor.

MTTR Targets

Target	Response Quality
< 15 minutes	Excellent (automated recovery)
15-60 minutes	Good (quick manual fix)
1-4 hours	Acceptable
4-24 hours	Slow
> 24 hours	Critical issue

Improving MTTR

Faster detection: Reduce time from failure to alert
Better alerts: Include context, logs, likely causes
Runbooks: Document common fixes
Automation: Auto-remediation for known issues
On-call rotation: Ensure someone can respond 24/7

MTTR Components

Break down MTTR into phases:

MTTD: Mean Time To Detect

Time from failure to alert creation.

Improve by:

Shorter grace periods
Start pings for long-running jobs
Proactive health checks

MTTA: Mean Time To Acknowledge

Time from alert to team acknowledgment.

Improve by:

24/7 on-call rotation
Multi-channel alerting
Escalation policies

MTTI: Mean Time To Investigate

Time to identify root cause.

Improve by:

Better logging and output capture
Historical context in alerts
Dashboards for quick diagnosis

MTTF: Mean Time To Fix

Time to implement fix.

Improve by:

Automation (runbooks, scripts)
Pre-approved changes
Rollback mechanisms

MTTV: Mean Time To Verify

Time to confirm fix worked.

Improve by:

Automated verification
Quick feedback loops
Monitoring fix impact

Dashboard Visualization

Monitor: Database Backup

┌─────────────────────────────────────────┐
│ MTBF: 15.2 days  (↗ +3.1 vs last month)│
│ MTTR: 1.8 hours  (↘ -0.4 vs last month)│
└─────────────────────────────────────────┘

Recent Incidents:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Oct 5  │ MISSED │ 3h 20m │ MTTD: 5m, MTTA: 12m
Oct 15 │ FAIL   │ 45m    │ MTTD: 1m, MTTA: 3m
Oct 22 │ MISSED │ 2h 10m │ MTTD: 5m, MTTA: 8m

Industry Benchmarks

By Industry

Industry	Avg MTBF	Avg MTTR
SaaS	20-30 days	1-2 hours
E-commerce	15-25 days	30-90 min
Financial	30-45 days	15-60 min
Healthcare	25-40 days	1-3 hours

By Criticality

Priority	Target MTBF	Target MTTR
Critical	> 30 days	< 30 min
High	> 15 days	< 2 hours
Medium	> 7 days	< 8 hours
Low	> 3 days	< 24 hours

Availability Calculation

MTBF and MTTR combine to determine availability:

Availability = MTBF / (MTBF + MTTR)

Example:

MTBF: 240 hours (10 days)
MTTR: 2 hours

Availability = 240 / (240 + 2) = 99.17%

Target Availability

Availability	MTBF Needed (for 2h MTTR)
99.9% ("3 nines")	2,000 hours (83 days)
99.5%	400 hours (17 days)
99%	200 hours (8 days)
95%	40 hours (1.7 days)

API Access

# Get MTBF/MTTR for a monitor
GET /api/monitors/YOUR_MONITOR_ID/reliability

Response:
{
  "monitorId": "mon_abc123",
  "period": "30d",
  "mtbf": {
    "hours": 239.3,
    "days": 9.97,
    "trend": "improving"
  },
  "mttr": {
    "minutes": 125,
    "hours": 2.08,
    "trend": "stable",
    "breakdown": {
      "mttd": 3.7,
      "mtta": 8.2,
      "mtti": 45.3,
      "mttf": 52.1,
      "mttv": 15.7
    }
  },
  "availability": 99.13
}

Reports

Export MTBF/MTTR reports:

curl -X POST https://api.saturn.example.com/api/reports/reliability \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{
    "monitorIds": ["mon_abc123"],
    "startDate": "2025-10-01",
    "endDate": "2025-10-31"
  }' > reliability-report.pdf

Next Steps

Health Scores — Overall reliability grading
Uptime & SLA — Track uptime percentages
Incident Lifecycle — Managing incidents efficiently

MTBF: Mean Time Between Failures​

MTBF Targets​

Improving MTBF​

MTTR: Mean Time To Repair​

MTTR Targets​

Improving MTTR​

MTTR Components​

MTTD: Mean Time To Detect​

MTTA: Mean Time To Acknowledge​

MTTI: Mean Time To Investigate​

MTTF: Mean Time To Fix​

MTTV: Mean Time To Verify​

Dashboard Visualization​

Industry Benchmarks​

By Industry​

By Criticality​

Availability Calculation​

Target Availability​

API Access​

Reports​

Next Steps​

MTBF: Mean Time Between Failures

MTBF Targets

Improving MTBF

MTTR: Mean Time To Repair

MTTR Targets

Improving MTTR

MTTR Components

MTTD: Mean Time To Detect

MTTA: Mean Time To Acknowledge

MTTI: Mean Time To Investigate

MTTF: Mean Time To Fix

MTTV: Mean Time To Verify

Dashboard Visualization

Industry Benchmarks

By Industry

By Criticality

Availability Calculation

Target Availability

API Access

Reports

Next Steps