Skip to main content

Anomaly Detection Rules

Saturn uses multiple statistical rules to detect different types of anomalies. Each rule targets specific failure patterns.

Rule 1: Z-Score (Duration)

Detects: Unusually slow execution times

Formula:

Z-Score = (duration - mean) / stddev

Trigger if: Z-Score > 3.0

Best for: Jobs with consistent, predictable durations

Example

Baseline (50 runs):

  • Mean (μ): 12.5 minutes
  • Std Dev (σ): 2.1 minutes

Current run:

  • Duration: 19.8 minutes

Calculation:

Z-Score = (19.8 - 12.5) / 2.1
= 7.3 / 2.1
= 3.48

3.48 > 3.0 → ANOMALY

Incident Message:

ANOMALY: Duration outlier detected
Duration: 19.8 minutes
Mean: 12.5 minutes
Std Dev: 2.1 minutes
Z-Score: 3.48 (threshold: 3.0)
This run was 3.5 standard deviations slower than average

Z-Score Interpretation

Z-ScoreProbabilityInterpretation
1.016% of runsNormal variance
2.02.5% of runsNoteworthy
3.00.15% of runsAnomalous
4.00.003% of runsHighly anomalous
5.0+< 0.0001%Extreme outlier

Configuration

{
"anomalyDetection": {
"rules": {
"zScore": {
"enabled": true,
"threshold": 3.0, // Adjust for sensitivity
"applyTo": "duration"
}
}
}
}

Tuning:

  • threshold: 2.5 → More sensitive (more alerts)
  • threshold: 4.0 → Less sensitive (fewer alerts)

Rule 2: Median Multiplier (Duration)

Detects: Extreme duration spikes, even for variable jobs

Formula:

Multiplier = duration / median

Trigger if: Multiplier > 5.0

Best for: Jobs with high variance or skewed distributions

Example

Baseline (50 runs):

  • Median: 8.2 minutes
  • Durations: 5, 6, 7, 8, 8, 9, 10, 12, 15, 45 minutes (one outlier)

Current run:

  • Duration: 42.0 minutes

Calculation:

Multiplier = 42.0 / 8.2
= 5.12

5.12 > 5.0 → ANOMALY

Incident Message:

ANOMALY: Duration 5.1x higher than median
Duration: 42.0 minutes
Median: 8.2 minutes
This run took over 5 times longer than typical runs

Why Median vs Mean?

Mean is affected by outliers:

Runs: 8, 9, 10, 11, 12, 120 minutes
Mean: 28.3 minutes (skewed by outlier)
Median: 10.5 minutes (represents typical runs)

Median is robust:

  • Not affected by extreme values
  • Better for variable jobs
  • More intuitive ("middle value")

Configuration

{
"anomalyDetection": {
"rules": {
"medianMultiplier": {
"enabled": true,
"threshold": 5.0, // Times the median
"applyTo": "duration"
}
}
}
}

Tuning:

  • threshold: 3.0 → Catch 3x slowdowns
  • threshold: 10.0 → Only catch extreme spikes

Rule 3: Output Size Drop

Detects: Silent failures with significantly reduced output

Formula:

Drop Percentage = 1 - (current_size / median_size)

Trigger if: Drop > 50%

Best for: Jobs that export data or generate reports

Example

Baseline (50 runs):

  • Median output: 45,231 bytes (45 KB)

Current run:

  • Output: 234 bytes
  • Exit code: 0 (success)

Calculation:

Drop = 1 - (234 / 45231)
= 1 - 0.0052
= 0.9948 (99.48%)

99.48% > 50% → ANOMALY

Incident Message:

ANOMALY: Output size dropped 99.5%
Current output: 234 bytes
Median output: 45,231 bytes
Job may have terminated early or produced incomplete results

Why This Matters

Silent data loss scenarios:

Example 1: Database Export

Normal: 50,000 records → 5 MB JSON
Anomaly: 3 records → 500 bytes
Cause: Database query filtered by wrong date, returned almost nothing

Example 2: Report Generation

Normal: 200-page PDF → 2 MB
Anomaly: Empty PDF → 1 KB
Cause: Data source unavailable, generated header-only PDF

Example 3: Log Aggregation

Normal: 1000 log files → 50 MB compressed
Anomaly: 0 logs → 100 bytes (empty archive)
Cause: Log rotation broke symlinks, no logs collected

Configuration

{
"anomalyDetection": {
"rules": {
"outputSizeDrop": {
"enabled": true,
"threshold": 0.5, // 50% drop
"applyTo": "output",
"minMedianBytes": 1024 // Ignore for small outputs
}
}
}
}

Tuning:

  • threshold: 0.3 → Alert on 30% drop
  • threshold: 0.8 → Only alert on 80%+ drop
  • minMedianBytes: 10240 → Ignore if median < 10 KB

Rule Combinations

Multiple rules can trigger on the same run:

Monitor: Data Export Job
Baseline: 12 minutes, 50 KB output

Current Run: 45 minutes, 500 bytes

Anomalies Detected:
1. Z-Score: 8.2 (duration outlier)
2. Median Multiplier: 3.8 (3.8x slower)
3. Output Size Drop: 99% (data loss)

Severity: CRITICAL (multiple rules)

Severity Calculation

Rules TriggeredSeverity
1 ruleLOW
2 rulesMEDIUM
3 rulesHIGH

Custom Rules (Future)

Planned custom rule types:

Memory Usage:

Trigger if: memory > mean + 3σ

CPU Usage:

Trigger if: cpu_percent > 80% AND duration > 2x median

Error Rate:

Trigger if: error_count / total_operations > 0.05

Currently in beta. Contact us for early access.

Rule Priorities

When multiple rules apply, they're evaluated in order:

  1. Output Size Drop (most critical)
  2. Z-Score (standard deviation)
  3. Median Multiplier (extreme outliers)

All triggered rules are included in the incident.

Disabling Rules

Disable rules that don't apply:

{
"anomalyDetection": {
"rules": {
"zScore": {"enabled": true},
"medianMultiplier": {"enabled": true},
"outputSizeDrop": {"enabled": false} // No output capture
}
}
}

Testing Rules

Test rule sensitivity before deploying:

  1. Go to monitor analytics
  2. View duration distribution
  3. Click Test Anomaly Rules
  4. Adjust thresholds
  5. See which historical runs would trigger

Rule Performance

RuleOverheadMemoryBest For
Z-ScoreO(1)~40 bytesConsistent jobs
Median MultiplierO(1)~400 bytesVariable jobs
Output Size DropO(1)~8 bytesData exports

All rules use constant memory thanks to Welford's algorithm.

Next Steps