Anomaly Tuning

Anomaly detection requires tuning to match your job characteristics. Too sensitive = alert fatigue, too lenient = missed issues.

The Tuning Process

Step 1: Baseline Collection

First 2 weeks: Don't tune anything, just collect data.

Let baseline stabilize (≥50 runs)
Review duration distribution
Check for natural patterns

Step 2: Evaluate Current State

Go to Monitor → Analytics → Anomalies and review:

Key Metrics

Metric	Formula	Target
False Positive Rate	False anomalies / Total anomalies	< 10%
Anomaly Rate	Anomalies / Total runs	1-5%
Coefficient of Variation	Std Dev / Mean	< 0.3 for consistent jobs

Example Review

Monitor: Daily Backup
Total runs: 100
Anomalies: 15
False positives: 12 (80%)

Analysis: WAY too sensitive

Step 3: Adjust Thresholds

Z-Score Threshold

Default: 3.0 standard deviations

For consistent jobs (CV < 0.2):

{
  "zScore": {
    "threshold": 2.5  // More sensitive
  }
}

For variable jobs (CV > 0.5):

{
  "zScore": {
    "threshold": 4.0  // Less sensitive
  }
}

Median Multiplier Threshold

Default: 5.0× median

For stable jobs:

{
  "medianMultiplier": {
    "threshold": 3.0  // Catch 3x slowdowns
  }
}

For highly variable jobs:

{
  "medianMultiplier": {
    "threshold": 10.0  // Only extreme spikes
  }
}

Output Size Drop Threshold

Default: 50% drop

For critical data exports:

{
  "outputSizeDrop": {
    "threshold": 0.3,  // Alert on 30% drop
    "minMedianBytes": 10240  // Ignore if median < 10 KB
  }
}

For less critical:

{
  "outputSizeDrop": {
    "threshold": 0.8  // Only alert on 80%+ drop
  }
}

Common Scenarios

Scenario 1: Too Many False Positives

Symptoms:

Anomaly alerts multiple times per week
Team ignoring alerts (alert fatigue)
Most anomalies are "actually fine"

Example:

Monitor: Report Generation
Mean: 10 minutes
Std Dev: 5 minutes (high variance)
Current Z-Score threshold: 3.0

Recent anomalies:
- 26 min (Z=3.2) - Actually fine, just more data
- 27 min (Z=3.4) - Fine
- 29 min (Z=3.8) - Fine

Solution: Increase threshold or use median multiplier

{
  "zScore": {
    "enabled": false  // Disable for high-variance jobs
  },
  "medianMultiplier": {
    "enabled": true,
    "threshold": 5.0  // Only alert if 5x slower than median
  }
}

Scenario 2: Missing Real Issues

Symptoms:

Job performance degrading but no alerts
Issues only caught when job fails completely
Threshold too lenient

Example:

Monitor: Database Backup
Typical: 8-10 minutes
Recent trend: 12 → 15 → 18 → 22 minutes
Median Multiplier threshold: 5.0x (50 minutes)

Problem: Degradation not caught until failure

Solution: Lower threshold or enable Z-Score

{
  "zScore": {
    "enabled": true,
    "threshold": 2.5  // More sensitive
  },
  "medianMultiplier": {
    "enabled": true,
    "threshold": 2.0  // Alert at 2x slower
  }
}

Scenario 3: Legitimate Variability

Symptoms:

Job legitimately varies (time of day, data volume)
Cannot use strict thresholds

Example:

Monitor: ETL Pipeline
Monday: 20 min (lots of weekend data)
Tuesday-Friday: 5-8 min (normal)
Saturday-Sunday: 3 min (low traffic)

Solution 1: Split into separate monitors

[
  {
    "name": "ETL - Weekdays",
    "schedule": {"type": "cron", "expression": "0 3 * * 1-5"},
    "zScore": {"threshold": 3.0}
  },
  {
    "name": "ETL - Weekends",
    "schedule": {"type": "cron", "expression": "0 3 * * 0,6"},
    "zScore": {"threshold": 3.0}
  }
]

Solution 2: Use maintenance windows

{
  "name": "ETL Pipeline",
  "maintenanceWindows": [
    {
      "name": "Monday Peak",
      "rrule": "FREQ=WEEKLY;BYDAY=MO",
      "duration": 3600  // Suppress Monday anomalies
    }
  ]
}

Scenario 4: Bimodal Distribution

Symptoms:

Two distinct performance profiles
"Fast path" and "slow path" in same job

Example:

Monitor: Image Processing
Fast path (cache hit): 2-5 seconds (90% of runs)
Slow path (cache miss): 30-60 seconds (10% of runs)

Solution: Disable Z-Score, use median multiplier

{
  "zScore": {
    "enabled": false  // Mean is misleading in bimodal distribution
  },
  "medianMultiplier": {
    "enabled": true,
    "threshold": 10.0  // Only catch extreme outliers
  }
}

Or: Refactor job to use separate monitors for fast/slow paths.

Scenario 5: Growing Job

Symptoms:

Job handles more data over time
Baseline keeps shifting
Constant anomalies as job grows

Example:

Month 1: Mean 10 min
Month 2: Mean 12 min
Month 3: Mean 15 min
Each month: "Anomaly" because growing

Solution: Rolling baseline window

{
  "anomalyDetection": {
    "baselineWindow": 50  // Only consider last 50 runs
  }
}

Saturn automatically adapts baseline as job grows.

Interpreting Z-Scores

Z-Score Guidelines

Z-Score Range	Interpretation	Action
< 1.0	Normal	None
1.0 - 2.0	Slightly elevated	Monitor
2.0 - 3.0	Noteworthy	Review if recurring
3.0 - 4.0	Anomalous	Investigate
4.0 - 6.0	Highly anomalous	Immediate attention
> 6.0	Extreme outlier	Critical issue

Example Incident Messages

Z-Score: 3.2

ANOMALY: Duration slightly outside normal range
Duration: 19.8 minutes
Mean: 12.5 minutes (±2.1)
Z-Score: 3.2

This is mildly anomalous. Check for:
- Slightly increased data volume
- Minor resource contention
- Small configuration changes

Z-Score: 8.5

ANOMALY: Duration EXTREMELY anomalous
Duration: 45.3 minutes
Mean: 12.5 minutes (±2.1)
Z-Score: 8.5

This is a critical outlier. Likely causes:
- Major infrastructure issue
- Database performance problem
- Significant code regression

Testing Thresholds

Before applying changes, test against historical data:

Via Dashboard

Go to Monitor → Analytics
Click Test Anomaly Rules
Adjust thresholds
See which past runs would trigger
Review false positive rate
Apply if satisfied

Via API

curl -X POST https://api.saturn.example.com/api/monitors/YOUR_MONITOR_ID/test-anomaly-rules \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{
    "zScore": {"threshold": 2.5},
    "medianMultiplier": {"threshold": 4.0},
    "outputSizeDrop": {"threshold": 0.5}
  }'

Response shows which historical runs would trigger.

Advanced Tuning

Per-Environment Thresholds

{
  "name": "API Sync - Production",
  "zScore": {"threshold": 2.5},  // Strict
  "tags": ["env:prod"]
},
{
  "name": "API Sync - Staging",
  "zScore": {"threshold": 4.0},  // Lenient
  "tags": ["env:staging"]
}

Time-Based Sensitivity

Future feature: Different thresholds by time/day.

{
  "anomalyDetection": {
    "rules": {
      "zScore": {
        "threshold": 3.0,
        "overrides": [
          {
            "days": ["Monday"],
            "threshold": 4.0  // More lenient on Mondays
          }
        ]
      }
    }
  }
}

Combining Rules

Use multiple rules with AND/OR logic:

{
  "anomalyDetection": {
    "logic": "OR",  // Trigger if ANY rule matches (default)
    "rules": {
      "zScore": {"enabled": true, "threshold": 3.0},
      "medianMultiplier": {"enabled": true, "threshold": 5.0}
    }
  }
}

Or require multiple rules:

{
  "anomalyDetection": {
    "logic": "AND",  // Trigger only if ALL rules match
    "rules": {
      "zScore": {"enabled": true, "threshold": 2.0},
      "outputSizeDrop": {"enabled": true, "threshold": 0.5}
    }
  }
}

Best Practices

✅ Do

Start conservative (high thresholds), tighten over time
Review weekly for first month after enabling
Document legitimate anomalies (e.g., "Monday peak expected")
Use tags to organize monitors with similar characteristics
Test changes before applying to production

❌ Don't

Over-tune — some noise is acceptable
Ignore patterns — recurring anomalies indicate real issues
Use same thresholds everywhere — jobs differ
Disable alerts — tune instead
Forget to re-evaluate — job characteristics change

Monitoring Tuning Effectiveness

Track these metrics over time:

False Positive Rate = (Acknowledged as false / Total anomalies) × 100%

Target: < 10%

Time to Detection = (First anomaly alert) - (Performance degradation start)

Target: < 1 day

Alert Fatigue = (Ignored alerts / Total alerts) × 100%

Target: < 5%

Next Steps

Analytics — Visualize performance trends
Incident Lifecycle — Managing anomaly incidents
Health Scores — Understanding overall health

The Tuning Process​

Step 1: Baseline Collection​

Step 2: Evaluate Current State​

Key Metrics​

Example Review​

Step 3: Adjust Thresholds​

Z-Score Threshold​

Median Multiplier Threshold​

Output Size Drop Threshold​

Common Scenarios​

Scenario 1: Too Many False Positives​

Scenario 2: Missing Real Issues​

Scenario 3: Legitimate Variability​

Scenario 4: Bimodal Distribution​

Scenario 5: Growing Job​

Interpreting Z-Scores​

Z-Score Guidelines​

Example Incident Messages​

Testing Thresholds​

Via Dashboard​

Via API​

Advanced Tuning​

Per-Environment Thresholds​

Time-Based Sensitivity​

Combining Rules​

Best Practices​

✅ Do​

❌ Don't​

Monitoring Tuning Effectiveness​

Next Steps​

The Tuning Process

Step 1: Baseline Collection

Step 2: Evaluate Current State

Key Metrics

Example Review

Step 3: Adjust Thresholds

Z-Score Threshold

Median Multiplier Threshold

Output Size Drop Threshold

Common Scenarios

Scenario 1: Too Many False Positives

Scenario 2: Missing Real Issues

Scenario 3: Legitimate Variability

Scenario 4: Bimodal Distribution

Scenario 5: Growing Job

Interpreting Z-Scores

Z-Score Guidelines

Example Incident Messages

Testing Thresholds

Via Dashboard

Via API

Advanced Tuning

Per-Environment Thresholds

Time-Based Sensitivity

Combining Rules

Best Practices

✅ Do

❌ Don't

Monitoring Tuning Effectiveness

Next Steps