Incident Types
Saturn creates incidents automatically when jobs deviate from expected behavior. Understanding each type helps you respond appropriately.
Incident Types
MISSED
Trigger: No ping received within expected schedule + grace period
Meaning: Job didn't run at all, or couldn't reach Saturn
Common Causes:
- Cron daemon not running
- Job disabled/commented out
- Server down or unreachable
- Network connectivity issues
- DNS resolution failure
Example:
Incident: MISSED
Monitor: Daily Backup
Expected: 2025-10-14 03:00:00 UTC
Grace Period: 30 minutes
Status: No ping received by 03:30:00 UTC
Response Actions:
- Check if the server/service is running
- Verify cron configuration (
crontab -l) - Check system logs for errors
- Test network connectivity to Saturn API
LATE
Trigger: Ping received after grace period expired
Meaning: Job ran but started/finished late
Common Causes:
- Server under heavy load
- Resource constraints (CPU/memory)
- Dependency delays (database, API, etc.)
- Grace period too short
- Clock skew between systems
Example:
Incident: LATE
Monitor: Hourly Sync
Expected: 2025-10-14 10:00:00 UTC
Grace Period: 5 minutes
Actual Ping: 10:07:23 UTC (7 minutes late)
Response Actions:
- Review job duration trends in analytics
- Check server resources during incident time
- Consider increasing grace period if consistently late
- Investigate dependencies that might be slow
FAIL
Trigger: Explicit fail ping received (exit code ≠ 0)
Meaning: Job ran but reported failure
Common Causes:
- Application logic error
- External service unavailable
- Invalid input data
- Permission issues
- Disk space full
- Database connection failed
Example:
Incident: FAIL
Monitor: Database Backup
Exit Code: 1
Duration: 3.2 seconds
Output: ERROR: Connection to database failed after 3 retries
Host: db.example.com:5432
Reason: Connection timeout
Response Actions:
- Review captured output for error details
- Check external dependencies (databases, APIs)
- Verify permissions and credentials
- Review application logs
- Test the job manually to reproduce
ANOMALY
Trigger: Job succeeded but behaved abnormally
Meaning: Statistical analysis detected deviation from baseline
Common Causes:
- Performance degradation
- Increased data volume
- Resource contention
- Network latency spike
- Configuration change
- Code regression
Anomaly Sub-Types:
Duration Anomaly (Z-Score)
Incident: ANOMALY (Duration)
Monitor: ETL Pipeline
Duration: 47 minutes (typical: 12 minutes)
Z-Score: 4.2 (threshold: 3.0)
Analysis:
Mean: 12.3 minutes
Std Dev: 8.2 minutes
This run: 3.5 standard deviations above mean
Duration Anomaly (Median Multiplier)
Incident: ANOMALY (Duration)
Monitor: Report Generation
Duration: 89 seconds (median: 15 seconds)
Multiplier: 5.9x (threshold: 5.0x)
Analysis:
Median: 15 seconds
This run: 5.9 times the median
Output Size Drop
Incident: ANOMALY (Output Size)
Monitor: Data Export
Output: 234 bytes (median: 45 KB)
Drop: 99.5% (threshold: 50%)
Analysis:
Median Output: 45,231 bytes
This run: 234 bytes
Possible empty export or early termination
Response Actions:
- Review analytics for performance trends
- Compare with recent successful runs
- Check for infrastructure changes
- Investigate data volume changes
- Review recent code deployments
- Use Anomaly Tuning to adjust sensitivity
Incident Severity
| Type | Default Severity | Description |
|---|---|---|
| MISSED | High | Job didn't run — immediate action needed |
| LATE | Medium | Job ran late — monitor for patterns |
| FAIL | High | Job failed — investigate and fix |
| ANOMALY | Low-Medium | Job succeeded but unusual — investigate if recurring |
Severity affects:
- Alert priority
- Escalation rules
- Dashboard sorting
Incident Metadata
Every incident includes:
{
"id": "inc_abc123",
"type": "ANOMALY",
"monitorId": "mon_xyz789",
"severity": "MEDIUM",
"status": "OPEN",
"createdAt": "2025-10-14T10:15:00Z",
"acknowledgedAt": null,
"resolvedAt": null,
"details": {
"zScore": 3.7,
"meanMs": 9800,
"stddevMs": 1100,
"durationMs": 14050,
"rule": "zscore>3"
},
"runId": "run_def456",
"alertsSent": ["slack", "email"]
}
Incident Timeline
Each incident maintains an event log:
- Created: Incident detected
- Alerts Sent: Notification channels triggered
- Acknowledged: Team member acknowledged
- Notes Added: Investigation notes
- Resolved: Issue fixed and incident closed
View full timeline in the dashboard.
Multiple Incidents
Same monitor can have multiple open incidents:
Monitor: Nightly Backup
Incident 1: LATE (Oct 13) - RESOLVED
Incident 2: ANOMALY (Oct 14) - OPEN
Incident 3: FAIL (Oct 14) - OPEN
Each is tracked independently.
Deduplication
Saturn prevents alert spam:
- MISSED: One alert per grace period expiration
- LATE: One alert per late ping
- FAIL: One alert per failed run
- ANOMALY: One alert per anomalous run
If a monitor is already in incident state, new pings update the existing incident rather than creating duplicates (configurable).
Next Steps
- Incident Lifecycle — Managing incidents from open to resolved
- Maintenance Windows — Suppressing alerts during maintenance
- Anomalies — Deep dive on anomaly detection