Grace Periods
A grace period is the extra time Saturn waits before marking a job as LATE or MISSED. It's your buffer against false positives from network delays, server load, or natural variability.
How Grace Periods Work
Default Grace Periods
| Schedule Type | Default Grace |
|---|---|
| Interval (< 5 min) | 1 minute |
| Interval (5-30 min) | 2 minutes |
| Interval (30-60 min) | 5 minutes |
| Interval (> 1 hour) | 10 minutes |
| Cron | 10 minutes |
Calculating Grace Periods
Formula
Grace Period = max(job_duration_p95 * 0.2, minimum_grace)
- job_duration_p95: 95th percentile job duration
- minimum_grace: At least 1-2 minutes for network tolerance
Examples
Fast job (30 seconds P95):
Grace = max(30s * 0.2, 120s) = 120s (2 minutes)
Medium job (5 minutes P95):
Grace = max(300s * 0.2, 120s) = 120s (2 minutes)
Slow job (30 minutes P95):
Grace = max(1800s * 0.2, 120s) = 360s (6 minutes)
Long job (2 hours P95):
Grace = max(7200s * 0.2, 120s) = 1440s (24 minutes)
Tuning Strategies
Too Many False Positives
Symptom: Jobs marked LATE/MISSED but actually running fine
Solutions:
- Increase grace period by 20-50%
- Check for variability: Review duration histogram in analytics
- Network issues: Add extra time for network latency
- Server load: Account for high-load periods (end of month, etc.)
Example:
{
"name": "Nightly Backup",
"schedule": {"type": "cron", "expression": "0 3 * * *"},
"gracePeriod": 1800 // 30 minutes (was 10)
}
Missing Real Issues
Symptom: Jobs fail but you're not notified quickly enough
Solutions:
- Decrease grace period if jobs are very consistent
- Use start pings to detect stuck jobs faster
- Add timeout in the job itself to fail fast
Example:
{
"name": "Health Check",
"schedule": {"type": "interval", "seconds": 300},
"gracePeriod": 60 // 1 minute (was 2)
}
Special Scenarios
Variable Duration Jobs
Jobs that sometimes take 5 minutes, sometimes 50 minutes:
Option 1: Generous Grace
{
"gracePeriod": 3600 // 1 hour to cover worst case
}
Option 2: Split into Separate Monitors
[
{
"name": "Fast Path Processor",
"gracePeriod": 300 // 5 minutes
},
{
"name": "Slow Path Processor",
"gracePeriod": 3600 // 1 hour
}
]
First-Run Jobs
Jobs that take longer on first run (cache warming, etc.):
- Set generous grace period initially
- After 10-20 runs, review P95 duration in analytics
- Adjust grace period based on actual performance
Time-of-Day Variations
Jobs that run faster at night, slower during business hours:
Option 1: Set grace period to cover peak hours
Option 2: Use maintenance windows during peak hours (see Maintenance Windows)
Grace Period Anti-Patterns
❌ Setting Grace = Job Duration
// DON'T DO THIS
{
"schedule": {"type": "interval", "seconds": 3600},
"gracePeriod": 3600 // Same as interval!
}
Problem: You won't detect issues until the next expected run, delaying alerts by a full interval.
Better:
{
"schedule": {"type": "interval", "seconds": 3600},
"gracePeriod": 300 // 5 minutes - catches issues within 5 min
}
❌ Extremely Short Grace for Variable Jobs
// DON'T DO THIS for variable jobs
{
"schedule": {"type": "cron", "expression": "0 3 * * *"},
"gracePeriod": 60 // 1 minute for a 10-30 min job
}
Problem: Constant false positives when job takes 11 minutes instead of 10.
Better: Use 20-30% of maximum expected duration.
❌ No Grace Period
// DON'T DO THIS
{
"schedule": {"type": "interval", "seconds": 300},
"gracePeriod": 0 // No tolerance
}
Problem: Even sub-second network delays cause false positives.
Better: Always allow at least 30-60 seconds.
Monitoring Grace Period Effectiveness
Check your monitor analytics to see:
- Incident rate: Too high? Increase grace period.
- Duration distribution: Wide spread? Need more grace.
- P99 vs P95: Large gap? Consider upper bound.
Set a reminder to review grace periods every 3 months. Job characteristics change over time (more data, more users, more load).
Grace Period by Job Type
| Job Type | Typical Duration | Recommended Grace |
|---|---|---|
| Health check | < 10s | 1-2 min |
| API sync | 1-5 min | 2-5 min |
| Report generation | 5-20 min | 5-10 min |
| Database backup | 20-60 min | 10-20 min |
| ETL pipeline | 1-6 hours | 30-60 min |
| ML training | 6-24 hours | 1-4 hours |
Alerts and Grace Periods
Alerts are sent after the grace period expires:
If a ping arrives during the grace period, no incident is created and no alerts are sent.
Integration-Specific Guidance
Kubernetes CronJobs
Account for:
- Image pull time (first run)
- Pod scheduling delay
- Node resource availability
Recommendation: Add 2-5 minutes to expected job duration.
WordPress wp-cron
WordPress cron is visitor-triggered, so:
Low-traffic sites: Set grace to 15-30 minutes
High-traffic sites: Set grace to 5-10 minutes
With real cron: Set grace to 2-5 minutes
CI/CD Pipelines
Build times can vary significantly:
Cache hit: 2 minutes
Cache miss: 10 minutes
Dependency update: 20 minutes
Recommendation: Set grace to cover cache-miss scenarios (10-15 minutes).
Configuration Examples
Conservative (Fewer False Positives)
{
"name": "Nightly Backup",
"schedule": {"type": "cron", "expression": "0 3 * * *"},
"gracePeriod": 3600 // 1 hour
}
Good for: New monitors, variable jobs, less critical services
Balanced (Recommended)
{
"name": "API Sync",
"schedule": {"type": "interval", "seconds": 900},
"gracePeriod": 300 // 5 minutes
}
Good for: Most production jobs with consistent performance
Aggressive (Fast Alerts)
{
"name": "Payment Processing",
"schedule": {"type": "interval", "seconds": 60},
"gracePeriod": 60 // 1 minute
}
Good for: Critical jobs with very consistent duration, high-priority alerts
Next Steps
- Incidents — What happens when grace period expires
- Anomalies — Detect issues even within grace period
- Maintenance Windows — Suspend alerts temporarily