Skip to main content

Grace Periods

A grace period is the extra time Saturn waits before marking a job as LATE or MISSED. It's your buffer against false positives from network delays, server load, or natural variability.

How Grace Periods Work

Default Grace Periods

Schedule TypeDefault Grace
Interval (< 5 min)1 minute
Interval (5-30 min)2 minutes
Interval (30-60 min)5 minutes
Interval (> 1 hour)10 minutes
Cron10 minutes

Calculating Grace Periods

Formula

Grace Period = max(job_duration_p95 * 0.2, minimum_grace)
  • job_duration_p95: 95th percentile job duration
  • minimum_grace: At least 1-2 minutes for network tolerance

Examples

Fast job (30 seconds P95):

Grace = max(30s * 0.2, 120s) = 120s (2 minutes)

Medium job (5 minutes P95):

Grace = max(300s * 0.2, 120s) = 120s (2 minutes)

Slow job (30 minutes P95):

Grace = max(1800s * 0.2, 120s) = 360s (6 minutes)

Long job (2 hours P95):

Grace = max(7200s * 0.2, 120s) = 1440s (24 minutes)

Tuning Strategies

Too Many False Positives

Symptom: Jobs marked LATE/MISSED but actually running fine

Solutions:

  1. Increase grace period by 20-50%
  2. Check for variability: Review duration histogram in analytics
  3. Network issues: Add extra time for network latency
  4. Server load: Account for high-load periods (end of month, etc.)

Example:

{
"name": "Nightly Backup",
"schedule": {"type": "cron", "expression": "0 3 * * *"},
"gracePeriod": 1800 // 30 minutes (was 10)
}

Missing Real Issues

Symptom: Jobs fail but you're not notified quickly enough

Solutions:

  1. Decrease grace period if jobs are very consistent
  2. Use start pings to detect stuck jobs faster
  3. Add timeout in the job itself to fail fast

Example:

{
"name": "Health Check",
"schedule": {"type": "interval", "seconds": 300},
"gracePeriod": 60 // 1 minute (was 2)
}

Special Scenarios

Variable Duration Jobs

Jobs that sometimes take 5 minutes, sometimes 50 minutes:

Option 1: Generous Grace

{
"gracePeriod": 3600 // 1 hour to cover worst case
}

Option 2: Split into Separate Monitors

[
{
"name": "Fast Path Processor",
"gracePeriod": 300 // 5 minutes
},
{
"name": "Slow Path Processor",
"gracePeriod": 3600 // 1 hour
}
]

First-Run Jobs

Jobs that take longer on first run (cache warming, etc.):

  1. Set generous grace period initially
  2. After 10-20 runs, review P95 duration in analytics
  3. Adjust grace period based on actual performance

Time-of-Day Variations

Jobs that run faster at night, slower during business hours:

Option 1: Set grace period to cover peak hours

Option 2: Use maintenance windows during peak hours (see Maintenance Windows)

Grace Period Anti-Patterns

❌ Setting Grace = Job Duration

// DON'T DO THIS
{
"schedule": {"type": "interval", "seconds": 3600},
"gracePeriod": 3600 // Same as interval!
}

Problem: You won't detect issues until the next expected run, delaying alerts by a full interval.

Better:

{
"schedule": {"type": "interval", "seconds": 3600},
"gracePeriod": 300 // 5 minutes - catches issues within 5 min
}

❌ Extremely Short Grace for Variable Jobs

// DON'T DO THIS for variable jobs
{
"schedule": {"type": "cron", "expression": "0 3 * * *"},
"gracePeriod": 60 // 1 minute for a 10-30 min job
}

Problem: Constant false positives when job takes 11 minutes instead of 10.

Better: Use 20-30% of maximum expected duration.

❌ No Grace Period

// DON'T DO THIS
{
"schedule": {"type": "interval", "seconds": 300},
"gracePeriod": 0 // No tolerance
}

Problem: Even sub-second network delays cause false positives.

Better: Always allow at least 30-60 seconds.

Monitoring Grace Period Effectiveness

Check your monitor analytics to see:

  1. Incident rate: Too high? Increase grace period.
  2. Duration distribution: Wide spread? Need more grace.
  3. P99 vs P95: Large gap? Consider upper bound.
Review quarterly

Set a reminder to review grace periods every 3 months. Job characteristics change over time (more data, more users, more load).

Grace Period by Job Type

Job TypeTypical DurationRecommended Grace
Health check< 10s1-2 min
API sync1-5 min2-5 min
Report generation5-20 min5-10 min
Database backup20-60 min10-20 min
ETL pipeline1-6 hours30-60 min
ML training6-24 hours1-4 hours

Alerts and Grace Periods

Alerts are sent after the grace period expires:

If a ping arrives during the grace period, no incident is created and no alerts are sent.

Integration-Specific Guidance

Kubernetes CronJobs

Account for:

  • Image pull time (first run)
  • Pod scheduling delay
  • Node resource availability

Recommendation: Add 2-5 minutes to expected job duration.

WordPress wp-cron

WordPress cron is visitor-triggered, so:

Low-traffic sites: Set grace to 15-30 minutes
High-traffic sites: Set grace to 5-10 minutes
With real cron: Set grace to 2-5 minutes

CI/CD Pipelines

Build times can vary significantly:

Cache hit: 2 minutes
Cache miss: 10 minutes
Dependency update: 20 minutes

Recommendation: Set grace to cover cache-miss scenarios (10-15 minutes).

Configuration Examples

Conservative (Fewer False Positives)

{
"name": "Nightly Backup",
"schedule": {"type": "cron", "expression": "0 3 * * *"},
"gracePeriod": 3600 // 1 hour
}

Good for: New monitors, variable jobs, less critical services

{
"name": "API Sync",
"schedule": {"type": "interval", "seconds": 900},
"gracePeriod": 300 // 5 minutes
}

Good for: Most production jobs with consistent performance

Aggressive (Fast Alerts)

{
"name": "Payment Processing",
"schedule": {"type": "interval", "seconds": 60},
"gracePeriod": 60 // 1 minute
}

Good for: Critical jobs with very consistent duration, high-priority alerts

Next Steps