Incident Lifecycle

Incidents flow through states from detection to resolution. Understanding the lifecycle helps teams coordinate response.

Incident States

OPEN

When: Incident first detected

Characteristics:

Alerts sent immediately
Dashboard shows red badge
Incident appears at top of list
Timer starts for MTTR calculation

Actions:

Review incident details
Check captured output
Investigate root cause
Acknowledge to signal ownership

ACKNOWLEDGED

When: Team member clicks "Acknowledge"

Characteristics:

No new alerts sent for this incident
Dashboard shows yellow badge
Timestamp of acknowledgment recorded
Acknowledging user tracked

Purpose:

Signal "someone is working on it"
Prevent duplicate work
Stop alert spam
Track response time

Example:

# Via API
curl -X POST https://api.saturn.example.com/api/incidents/YOUR_INCIDENT_ID/acknowledge \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"userId": "user_abc123", "note": "Investigating database connection"}'

RESOLVED

When: Issue fixed and incident manually resolved, or auto-resolved

Characteristics:

Dashboard shows green badge
Incident moves to history
MTTR calculated and recorded
Resolution note stored

Auto-Resolution Triggers:

Next successful ping received
Manual resolution by team member
Maintenance window ends (optional)

Example:

# Via API
curl -X POST https://api.saturn.example.com/api/incidents/YOUR_INCIDENT_ID/resolve \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"userId": "user_abc123", "note": "Database credentials rotated, backup completed successfully"}'

Incident Timeline

Every incident maintains a detailed event log:

{
  "id": "inc_abc123",
  "monitorId": "mon_xyz789",
  "type": "FAIL",
  "status": "RESOLVED",
  "timeline": [
    {
      "timestamp": "2025-10-14T10:15:00Z",
      "event": "CREATED",
      "details": {
        "exitCode": 1,
        "output": "Database connection failed"
      }
    },
    {
      "timestamp": "2025-10-14T10:15:02Z",
      "event": "ALERTS_SENT",
      "details": {
        "channels": ["slack", "email"],
        "recipients": 3
      }
    },
    {
      "timestamp": "2025-10-14T10:17:30Z",
      "event": "ACKNOWLEDGED",
      "details": {
        "userId": "user_abc123",
        "userName": "Alice",
        "note": "Checking database logs"
      }
    },
    {
      "timestamp": "2025-10-14T10:22:00Z",
      "event": "NOTE_ADDED",
      "details": {
        "userId": "user_abc123",
        "note": "Found issue: connection pool exhausted"
      }
    },
    {
      "timestamp": "2025-10-14T10:28:00Z",
      "event": "RESOLVED",
      "details": {
        "userId": "user_abc123",
        "note": "Increased connection pool size, backup rerun successful"
      }
    }
  ],
  "mttr": 780  // 13 minutes (780 seconds)
}

Deduplication

Saturn prevents alert spam by deduplicating incidents.

Same Incident Type

If a monitor already has an OPEN incident of the same type:

Option 1: Update Existing (default)

Adds new occurrence to timeline
Updates incident count
No new alerts sent

Option 2: Create New

Creates separate incident
Alerts sent for each
Useful for tracking multiple failures

Configure per monitor:

{
  "name": "Critical Job",
  "deduplication": {
    "strategy": "update_existing",  // or "create_new"
    "windowSeconds": 3600  // Dedupe within 1 hour
  }
}

Example: Multiple Fails

Monitor: Database Backup

Timeline:
10:00 - FAIL (Exit code 1) → Incident #1 OPEN, alerts sent
10:05 - FAIL (Exit code 1) → Update Incident #1, no new alerts
10:10 - FAIL (Exit code 1) → Update Incident #1, no new alerts
10:15 - SUCCESS → Auto-resolve Incident #1

Result: 1 incident with 3 failed runs, 1 alert sent

Suppression

Temporarily silence alerts without disabling the monitor.

Use Cases

Known Issues: Issue identified, fix in progress
Maintenance Windows: Planned downtime (see Maintenance Windows)
Flaky Jobs: Investigating intermittent failures
External Dependencies: Third-party service down

Manual Suppression

In dashboard:

Open incident
Click Suppress Alerts
Set duration (15 min, 1 hour, 4 hours, 24 hours, custom)
Confirm

Effects:

Incidents still created
No alerts sent
Dashboard still shows incidents
Suppression expires automatically

Suppression Rules

Create rules for recurring suppression:

{
  "monitorId": "mon_xyz789",
  "suppress": {
    "days": ["Saturday", "Sunday"],
    "hours": [0, 1, 2, 3, 4, 5],  // Midnight to 6 AM
    "types": ["LATE"]  // Only suppress LATE incidents
  }
}

API Example

# Suppress for 2 hours
curl -X POST https://api.saturn.example.com/api/incidents/YOUR_INCIDENT_ID/suppress \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"durationSeconds": 7200, "reason": "Database maintenance in progress"}'

# Remove suppression early
curl -X DELETE https://api.saturn.example.com/api/incidents/YOUR_INCIDENT_ID/suppress \
  -H "Authorization: Bearer YOUR_TOKEN"

Alert Routing

Control who gets alerted for which incidents.

Default Routing

All alert channels receive all incident types.

Advanced Routing Rules

{
  "rules": [
    {
      "name": "Critical to PagerDuty",
      "condition": {
        "severity": ["HIGH"],
        "types": ["MISSED", "FAIL"]
      },
      "channels": ["pagerduty"],
      "escalation": {
        "delayMinutes": 15,
        "fallback": ["slack:oncall"]
      }
    },
    {
      "name": "Anomalies to Slack",
      "condition": {
        "types": ["ANOMALY"]
      },
      "channels": ["slack:monitoring"]
    },
    {
      "name": "Non-urgent to Email",
      "condition": {
        "severity": ["LOW", "MEDIUM"],
        "types": ["LATE"]
      },
      "channels": ["email"]
    }
  ]
}

Incident Notes

Add investigation notes to incidents:

Via Dashboard

Open incident
Scroll to "Notes" section
Type note and click Add

Via API

curl -X POST https://api.saturn.example.com/api/incidents/YOUR_INCIDENT_ID/notes \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"note": "Restarted service, monitoring for next run"}'

Markdown Support

Notes support markdown:

## Investigation

- [x] Checked server logs
- [x] Verified disk space
- [ ] Review with database team

**Root cause**: Connection pool exhausted during peak load

**Fix**: Increased pool size from 10 to 20

Bulk Actions

Manage multiple incidents at once:

Acknowledge All

# Acknowledge all OPEN incidents for a monitor
curl -X POST https://api.saturn.example.com/api/monitors/YOUR_MONITOR_ID/incidents/acknowledge-all \
  -H "Authorization: Bearer YOUR_TOKEN"

Auto-Resolve on Success

Default behavior: Next successful ping auto-resolves OPEN incidents.

Disable per monitor:

{
  "name": "My Monitor",
  "autoResolve": false  // Require manual resolution
}

Incident Reports

Export incident data for analysis:

Via Dashboard

Go to Incidents page
Apply filters (date range, monitor, type)
Click Export CSV or Export JSON

Via API

curl -X GET "https://api.saturn.example.com/api/incidents?from=2025-10-01&to=2025-10-14&format=csv" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  > incidents.csv

Best Practices

✅ Do

Acknowledge promptly to prevent duplicate work
Add notes during investigation for future reference
Review patterns weekly to identify recurring issues
Set up routing to avoid alert fatigue
Use suppression for known issues during fixes

❌ Don't

Auto-acknowledge everything — defeats the purpose
Ignore low-severity incidents — they can indicate larger problems
Delete incidents — historical data is valuable
Over-suppress — might miss real issues

Incident Metrics

Track incident management effectiveness:

Metric	Formula	Target
MTTR	Time from OPEN to RESOLVED	< 30 min
Acknowledgment Time	Time from OPEN to ACKED	< 5 min
Resolution Rate	Resolved / Total incidents	> 95%
Recurring Incidents	Same issue in 7 days	< 10%

View in Analytics → Incidents dashboard.

Next Steps

Maintenance Windows — Schedule planned downtime
Alert Channels — Configure notification channels
Analytics — Track MTTR and MTBF trends

Incident States​

OPEN​

ACKNOWLEDGED​

RESOLVED​

Incident Timeline​

Deduplication​

Same Incident Type​

Example: Multiple Fails​

Suppression​

Use Cases​

Manual Suppression​

Suppression Rules​

API Example​

Alert Routing​

Default Routing​

Advanced Routing Rules​

Incident Notes​

Via Dashboard​

Via API​

Markdown Support​

Bulk Actions​

Acknowledge All​

Auto-Resolve on Success​

Incident Reports​

Via Dashboard​

Via API​

Best Practices​

✅ Do​

❌ Don't​

Incident Metrics​

Next Steps​

Incident States

OPEN

ACKNOWLEDGED

RESOLVED

Incident Timeline

Deduplication

Same Incident Type

Example: Multiple Fails

Suppression

Use Cases

Manual Suppression

Suppression Rules

API Example

Alert Routing

Default Routing

Advanced Routing Rules

Incident Notes

Via Dashboard

Via API

Markdown Support

Bulk Actions

Acknowledge All

Auto-Resolve on Success

Incident Reports

Via Dashboard

Via API

Best Practices

✅ Do

❌ Don't

Incident Metrics

Next Steps