Anomaly Detection Rules
Saturn uses multiple statistical rules to detect different types of anomalies. Each rule targets specific failure patterns.
Rule 1: Z-Score (Duration)
Detects: Unusually slow execution times
Formula:
Z-Score = (duration - mean) / stddev
Trigger if: Z-Score > 3.0
Best for: Jobs with consistent, predictable durations
Example
Baseline (50 runs):
- Mean (μ): 12.5 minutes
- Std Dev (σ): 2.1 minutes
Current run:
- Duration: 19.8 minutes
Calculation:
Z-Score = (19.8 - 12.5) / 2.1
= 7.3 / 2.1
= 3.48
3.48 > 3.0 → ANOMALY
Incident Message:
ANOMALY: Duration outlier detected
Duration: 19.8 minutes
Mean: 12.5 minutes
Std Dev: 2.1 minutes
Z-Score: 3.48 (threshold: 3.0)
This run was 3.5 standard deviations slower than average
Z-Score Interpretation
| Z-Score | Probability | Interpretation |
|---|---|---|
| 1.0 | 16% of runs | Normal variance |
| 2.0 | 2.5% of runs | Noteworthy |
| 3.0 | 0.15% of runs | Anomalous |
| 4.0 | 0.003% of runs | Highly anomalous |
| 5.0+ | < 0.0001% | Extreme outlier |
Configuration
{
"anomalyDetection": {
"rules": {
"zScore": {
"enabled": true,
"threshold": 3.0, // Adjust for sensitivity
"applyTo": "duration"
}
}
}
}
Tuning:
threshold: 2.5→ More sensitive (more alerts)threshold: 4.0→ Less sensitive (fewer alerts)
Rule 2: Median Multiplier (Duration)
Detects: Extreme duration spikes, even for variable jobs
Formula:
Multiplier = duration / median
Trigger if: Multiplier > 5.0
Best for: Jobs with high variance or skewed distributions
Example
Baseline (50 runs):
- Median: 8.2 minutes
- Durations: 5, 6, 7, 8, 8, 9, 10, 12, 15, 45 minutes (one outlier)
Current run:
- Duration: 42.0 minutes
Calculation:
Multiplier = 42.0 / 8.2
= 5.12
5.12 > 5.0 → ANOMALY
Incident Message:
ANOMALY: Duration 5.1x higher than median
Duration: 42.0 minutes
Median: 8.2 minutes
This run took over 5 times longer than typical runs
Why Median vs Mean?
Mean is affected by outliers:
Runs: 8, 9, 10, 11, 12, 120 minutes
Mean: 28.3 minutes (skewed by outlier)
Median: 10.5 minutes (represents typical runs)
Median is robust:
- Not affected by extreme values
- Better for variable jobs
- More intuitive ("middle value")
Configuration
{
"anomalyDetection": {
"rules": {
"medianMultiplier": {
"enabled": true,
"threshold": 5.0, // Times the median
"applyTo": "duration"
}
}
}
}
Tuning:
threshold: 3.0→ Catch 3x slowdownsthreshold: 10.0→ Only catch extreme spikes
Rule 3: Output Size Drop
Detects: Silent failures with significantly reduced output
Formula:
Drop Percentage = 1 - (current_size / median_size)
Trigger if: Drop > 50%
Best for: Jobs that export data or generate reports
Example
Baseline (50 runs):
- Median output: 45,231 bytes (45 KB)
Current run:
- Output: 234 bytes
- Exit code: 0 (success)
Calculation:
Drop = 1 - (234 / 45231)
= 1 - 0.0052
= 0.9948 (99.48%)
99.48% > 50% → ANOMALY
Incident Message:
ANOMALY: Output size dropped 99.5%
Current output: 234 bytes
Median output: 45,231 bytes
Job may have terminated early or produced incomplete results
Why This Matters
Silent data loss scenarios:
Example 1: Database Export
Normal: 50,000 records → 5 MB JSON
Anomaly: 3 records → 500 bytes
Cause: Database query filtered by wrong date, returned almost nothing
Example 2: Report Generation
Normal: 200-page PDF → 2 MB
Anomaly: Empty PDF → 1 KB
Cause: Data source unavailable, generated header-only PDF
Example 3: Log Aggregation
Normal: 1000 log files → 50 MB compressed
Anomaly: 0 logs → 100 bytes (empty archive)
Cause: Log rotation broke symlinks, no logs collected
Configuration
{
"anomalyDetection": {
"rules": {
"outputSizeDrop": {
"enabled": true,
"threshold": 0.5, // 50% drop
"applyTo": "output",
"minMedianBytes": 1024 // Ignore for small outputs
}
}
}
}
Tuning:
threshold: 0.3→ Alert on 30% dropthreshold: 0.8→ Only alert on 80%+ dropminMedianBytes: 10240→ Ignore if median < 10 KB
Rule Combinations
Multiple rules can trigger on the same run:
Monitor: Data Export Job
Baseline: 12 minutes, 50 KB output
Current Run: 45 minutes, 500 bytes
Anomalies Detected:
1. Z-Score: 8.2 (duration outlier)
2. Median Multiplier: 3.8 (3.8x slower)
3. Output Size Drop: 99% (data loss)
Severity: CRITICAL (multiple rules)
Severity Calculation
| Rules Triggered | Severity |
|---|---|
| 1 rule | LOW |
| 2 rules | MEDIUM |
| 3 rules | HIGH |
Custom Rules (Future)
Planned custom rule types:
Memory Usage:
Trigger if: memory > mean + 3σ
CPU Usage:
Trigger if: cpu_percent > 80% AND duration > 2x median
Error Rate:
Trigger if: error_count / total_operations > 0.05
Currently in beta. Contact us for early access.
Rule Priorities
When multiple rules apply, they're evaluated in order:
- Output Size Drop (most critical)
- Z-Score (standard deviation)
- Median Multiplier (extreme outliers)
All triggered rules are included in the incident.
Disabling Rules
Disable rules that don't apply:
{
"anomalyDetection": {
"rules": {
"zScore": {"enabled": true},
"medianMultiplier": {"enabled": true},
"outputSizeDrop": {"enabled": false} // No output capture
}
}
}
Testing Rules
Test rule sensitivity before deploying:
- Go to monitor analytics
- View duration distribution
- Click Test Anomaly Rules
- Adjust thresholds
- See which historical runs would trigger
Rule Performance
| Rule | Overhead | Memory | Best For |
|---|---|---|---|
| Z-Score | O(1) | ~40 bytes | Consistent jobs |
| Median Multiplier | O(1) | ~400 bytes | Variable jobs |
| Output Size Drop | O(1) | ~8 bytes | Data exports |
All rules use constant memory thanks to Welford's algorithm.
Next Steps
- Welford's Algorithm — How statistics are calculated efficiently
- Anomaly Tuning — Reduce false positives
- Analytics — Visualize anomaly trends