Anomaly Detection Overview
Anomaly detection is what makes Saturn different. While other monitors only tell you when jobs fail, Saturn alerts you when jobs succeed but behave abnormally — often catching problems before they escalate to failures.
The Problem with Binary Monitoring
Traditional monitoring is binary: ✅ Success or ❌ Failure.
This misses critical signals:
Day 1: Backup completes in 10 minutes ✅
Day 2: Backup completes in 12 minutes ✅
Day 3: Backup completes in 45 minutes ✅ ← Problem brewing
Day 4: Backup times out after 60 minutes ❌ ← Too late!
With anomaly detection, you're alerted on Day 3 when the job is still succeeding but taking 3.5x longer than normal.
How It Works
Step 1: Baseline Collection
For the first 10 successful runs, Saturn collects data without triggering anomaly alerts:
- Duration (milliseconds)
- Output size (bytes)
- Exit code
- Timestamp
Step 2: Statistical Calculation
Using Welford's online algorithm (see Welford's Algorithm), Saturn calculates:
- Mean (μ): Average duration
- Standard Deviation (σ): Measure of variability
- Median: Middle value
- Min/Max: Performance bounds
These are updated incrementally with each run — no need to store all historical data.
Step 3: Anomaly Detection
Each new run is compared against the baseline using multiple rules (see Anomaly Rules):
- Z-Score Rule: Duration > mean + 3σ
- Median Multiplier: Duration > 5× median
- Output Size Drop: Output < 50% of median
If any rule triggers, an ANOMALY incident is created.
Why Statistics Matter
Example: Database Backup Job
Historical Performance (50 runs):
- Mean: 12.3 minutes
- Std Dev: 1.8 minutes
- Median: 12.1 minutes
Today's Run:
- Duration: 24.7 minutes
- Exit code: 0 (success)
- Output: "Backup completed"
Analysis:
Z-Score = (24.7 - 12.3) / 1.8 = 6.9
Threshold: 3.0
Result: ANOMALY (6.9 > 3.0)
Incident Message:
ANOMALY: Database Backup ran 6.9 standard deviations slower than normal
Duration: 24.7 minutes (typical: 12.3 ± 1.8 minutes)
Possible causes:
- Increased data volume
- Database performance degradation
- Resource contention
- Network latency
Baseline Requirements
| Requirement | Value | Reason |
|---|---|---|
| Minimum runs | 10 | Statistical validity |
| Successful runs only | Required | Failed runs skew statistics |
| Recent data priority | Last 100 runs | Adapts to changing patterns |
View baseline status in monitor settings. You'll see:
- "Collecting baseline (5/10 runs)"
- "Baseline established ✓"
What Gets Detected
Performance Degradation
Normal: 10-15 minutes
Anomaly: 45 minutes (still succeeds)
Cause: Database query optimization lost after schema change
Data Volume Changes
Normal output: 50 KB
Anomaly: 2 KB
Cause: Only 10 records exported instead of 5,000 (silent data loss)
Resource Contention
Normal: Consistent 12 minutes
Anomaly: Varies 8-35 minutes with high variance
Cause: Shared CPU with other services
Early Warning Signs
Week 1: Mean 10 min, Z-Score 0.5 ✓
Week 2: Mean 12 min, Z-Score 1.2 ✓
Week 3: Mean 15 min, Z-Score 2.1 ✓
Week 4: Mean 20 min, Z-Score 3.4 🚨 ANOMALY
Real-World Examples
Case 1: ETL Pipeline
Scenario: Data pipeline runs nightly, pulling from external API
Normal: 8-10 minutes, processing 50,000 records
Anomaly Detected: Duration 45 minutes, Z-Score 12.4
Investigation: API rate limits silently reduced, job spent 40 minutes in retries
Outcome: Caught before customer-facing reports were affected
Case 2: WordPress Site Backups
Scenario: Backup plugin runs hourly
Normal: 15-20 seconds
Anomaly Detected: Duration 8 minutes, median multiplier 24x
Investigation: Backup location disk full, job retrying writes
Outcome: Fixed before backup failed completely
Case 3: Kubernetes CronJob
Scenario: Log aggregation job
Normal: Output 2-3 MB compressed logs
Anomaly Detected: Output 45 bytes
Investigation: LogStash container not starting, job exited with success but no logs
Outcome: Silent failure detected, would have been missed by traditional monitoring
Benefits
1. Early Detection
Catch issues before they become failures:
- Performance regressions
- Capacity problems
- Configuration drift
- Silent data loss
2. Reduced Downtime
Traditional monitoring: Alert when job fails
Saturn: Alert when job slows down → fix before failure
Average 40% reduction in downtime with anomaly detection.
3. Context-Aware Alerts
Instead of "Backup failed", you get:
ANOMALY: Backup took 47 min (typical: 12 min, +29x stddev)
Last 5 runs: 11, 12, 13, 12, 47 minutes
Likely cause: Data volume spike or resource constraint
4. Self-Adjusting
Baselines automatically adapt to changing conditions:
- More data over time → higher durations expected
- Optimization applied → baseline shifts down
- No manual threshold tuning required
Limitations
Not a Silver Bullet
Anomaly detection works best for:
- ✅ Consistent, predictable jobs
- ✅ Jobs with stable patterns
- ✅ Long-running jobs (> 10 seconds)
Less effective for:
- ❌ Highly variable jobs (use median multiplier instead)
- ❌ Very short jobs (< 1 second)
- ❌ Brand new jobs (need baseline first)
False Positives
Legitimate changes can trigger anomalies:
- Data volume doubled (expected)
- New feature added more processing
- Infrastructure upgraded (faster)
Use Anomaly Tuning to reduce false positives.
Configuration
Enable/disable anomaly detection per monitor:
{
"name": "My Monitor",
"anomalyDetection": {
"enabled": true,
"rules": {
"zScore": {
"enabled": true,
"threshold": 3.0 // Standard deviations
},
"medianMultiplier": {
"enabled": true,
"threshold": 5.0 // Times median
},
"outputSizeDrop": {
"enabled": true,
"threshold": 0.5 // 50% drop
}
}
}
}
Analytics Dashboard
View anomaly trends:
- Go to monitor detail page
- Click Analytics tab
- See:
- Duration distribution (histogram)
- Z-Score over time
- Anomaly frequency
- Statistical summary
Next Steps
- Anomaly Rules — Deep dive on detection rules
- Welford's Algorithm — How statistics are calculated
- Anomaly Tuning — Reduce false positives and optimize detection