Grace Periods

A grace period is the extra time Saturn waits before marking a job as LATE or MISSED. It's your buffer against false positives from network delays, server load, or natural variability.

How Grace Periods Work

Default Grace Periods

Schedule Type	Default Grace
Interval (< 5 min)	1 minute
Interval (5-30 min)	2 minutes
Interval (30-60 min)	5 minutes
Interval (> 1 hour)	10 minutes
Cron	10 minutes

Calculating Grace Periods

Formula

Grace Period = max(job_duration_p95 * 0.2, minimum_grace)

job_duration_p95: 95th percentile job duration
minimum_grace: At least 1-2 minutes for network tolerance

Examples

Fast job (30 seconds P95):

Grace = max(30s * 0.2, 120s) = 120s (2 minutes)

Medium job (5 minutes P95):

Grace = max(300s * 0.2, 120s) = 120s (2 minutes)

Slow job (30 minutes P95):

Grace = max(1800s * 0.2, 120s) = 360s (6 minutes)

Long job (2 hours P95):

Grace = max(7200s * 0.2, 120s) = 1440s (24 minutes)

Tuning Strategies

Too Many False Positives

Symptom: Jobs marked LATE/MISSED but actually running fine

Solutions:

Increase grace period by 20-50%
Check for variability: Review duration histogram in analytics
Network issues: Add extra time for network latency
Server load: Account for high-load periods (end of month, etc.)

Example:

{
  "name": "Nightly Backup",
  "schedule": {"type": "cron", "expression": "0 3 * * *"},
  "gracePeriod": 1800  // 30 minutes (was 10)
}

Missing Real Issues

Symptom: Jobs fail but you're not notified quickly enough

Solutions:

Decrease grace period if jobs are very consistent
Use start pings to detect stuck jobs faster
Add timeout in the job itself to fail fast

Example:

{
  "name": "Health Check",
  "schedule": {"type": "interval", "seconds": 300},
  "gracePeriod": 60  // 1 minute (was 2)
}

Special Scenarios

Variable Duration Jobs

Jobs that sometimes take 5 minutes, sometimes 50 minutes:

Option 1: Generous Grace

{
  "gracePeriod": 3600  // 1 hour to cover worst case
}

Option 2: Split into Separate Monitors

[
  {
    "name": "Fast Path Processor",
    "gracePeriod": 300  // 5 minutes
  },
  {
    "name": "Slow Path Processor",
    "gracePeriod": 3600  // 1 hour
  }
]

First-Run Jobs

Jobs that take longer on first run (cache warming, etc.):

Set generous grace period initially
After 10-20 runs, review P95 duration in analytics
Adjust grace period based on actual performance

Time-of-Day Variations

Jobs that run faster at night, slower during business hours:

Option 1: Set grace period to cover peak hours

Option 2: Use maintenance windows during peak hours (see Maintenance Windows)

Grace Period Anti-Patterns

❌ Setting Grace = Job Duration

// DON'T DO THIS
{
  "schedule": {"type": "interval", "seconds": 3600},
  "gracePeriod": 3600  // Same as interval!
}

Problem: You won't detect issues until the next expected run, delaying alerts by a full interval.

Better:

{
  "schedule": {"type": "interval", "seconds": 3600},
  "gracePeriod": 300  // 5 minutes - catches issues within 5 min
}

❌ Extremely Short Grace for Variable Jobs

// DON'T DO THIS for variable jobs
{
  "schedule": {"type": "cron", "expression": "0 3 * * *"},
  "gracePeriod": 60  // 1 minute for a 10-30 min job
}

Problem: Constant false positives when job takes 11 minutes instead of 10.

Better: Use 20-30% of maximum expected duration.

❌ No Grace Period

// DON'T DO THIS
{
  "schedule": {"type": "interval", "seconds": 300},
  "gracePeriod": 0  // No tolerance
}

Problem: Even sub-second network delays cause false positives.

Better: Always allow at least 30-60 seconds.

Monitoring Grace Period Effectiveness

Check your monitor analytics to see:

Incident rate: Too high? Increase grace period.
Duration distribution: Wide spread? Need more grace.
P99 vs P95: Large gap? Consider upper bound.

Review quarterly

Set a reminder to review grace periods every 3 months. Job characteristics change over time (more data, more users, more load).

Grace Period by Job Type

Job Type	Typical Duration	Recommended Grace
Health check	< 10s	1-2 min
API sync	1-5 min	2-5 min
Report generation	5-20 min	5-10 min
Database backup	20-60 min	10-20 min
ETL pipeline	1-6 hours	30-60 min
ML training	6-24 hours	1-4 hours

Alerts and Grace Periods

Alerts are sent after the grace period expires:

If a ping arrives during the grace period, no incident is created and no alerts are sent.

Integration-Specific Guidance

Kubernetes CronJobs

Account for:

Image pull time (first run)
Pod scheduling delay
Node resource availability

Recommendation: Add 2-5 minutes to expected job duration.

WordPress wp-cron

WordPress cron is visitor-triggered, so:

Low-traffic sites: Set grace to 15-30 minutes
High-traffic sites: Set grace to 5-10 minutes
With real cron: Set grace to 2-5 minutes

CI/CD Pipelines

Build times can vary significantly:

Cache hit: 2 minutes
Cache miss: 10 minutes
Dependency update: 20 minutes

Recommendation: Set grace to cover cache-miss scenarios (10-15 minutes).

Configuration Examples

Conservative (Fewer False Positives)

{
  "name": "Nightly Backup",
  "schedule": {"type": "cron", "expression": "0 3 * * *"},
  "gracePeriod": 3600  // 1 hour
}

Good for: New monitors, variable jobs, less critical services

Balanced (Recommended)

{
  "name": "API Sync",
  "schedule": {"type": "interval", "seconds": 900},
  "gracePeriod": 300  // 5 minutes
}

Good for: Most production jobs with consistent performance

Aggressive (Fast Alerts)

{
  "name": "Payment Processing",
  "schedule": {"type": "interval", "seconds": 60},
  "gracePeriod": 60  // 1 minute
}

Good for: Critical jobs with very consistent duration, high-priority alerts

Next Steps

Incidents — What happens when grace period expires
Anomalies — Detect issues even within grace period
Maintenance Windows — Suspend alerts temporarily

How Grace Periods Work​

Default Grace Periods​

Calculating Grace Periods​

Formula​

Examples​

Tuning Strategies​

Too Many False Positives​

Missing Real Issues​

Special Scenarios​

Variable Duration Jobs​

First-Run Jobs​

Time-of-Day Variations​

Grace Period Anti-Patterns​

❌ Setting Grace = Job Duration​

❌ Extremely Short Grace for Variable Jobs​

❌ No Grace Period​

Monitoring Grace Period Effectiveness​

Grace Period by Job Type​

Alerts and Grace Periods​

Integration-Specific Guidance​

Kubernetes CronJobs​

WordPress wp-cron​

CI/CD Pipelines​

Configuration Examples​

Conservative (Fewer False Positives)​

Balanced (Recommended)​

Aggressive (Fast Alerts)​

Next Steps​