Welford's Online Algorithm
Saturn uses Welford's online algorithm to calculate mean, variance, and standard deviation incrementally without storing all historical data.
The Problem
Traditional statistics calculation:
# Naive approach - requires storing all values
values = [10.2, 12.5, 11.8, ..., 13.1] # All historical durations
mean = sum(values) / len(values)
variance = sum((x - mean)**2 for x in values) / len(values)
stddev = sqrt(variance)
Problems:
- 📦 Memory grows with every run (50 runs = 50 stored values)
- 🐌 Recalculation time grows linearly
- 💾 Database/storage costs increase
For 1 million monitors × 1000 runs each = 1 billion stored values!
The Solution: Welford's Algorithm
Welford's algorithm calculates statistics incrementally using only:
- Running count
- Running mean
- Running sum of squared differences (M2)
Memory: O(1) — constant, regardless of run count
Time: O(1) — constant per update
How It Works
State Variables
interface WelfordState {
count: number; // Number of values seen
mean: number; // Current mean
m2: number; // Sum of squared differences from mean
min: number; // Minimum value
max: number; // Maximum value
}
Only 20-40 bytes per monitor!
Update Formula
When a new value arrives:
function updateWelford(state: WelfordState, newValue: number): WelfordState {
const count = state.count + 1;
const delta = newValue - state.mean;
const mean = state.mean + delta / count;
const delta2 = newValue - mean;
const m2 = state.m2 + delta * delta2;
return {
count,
mean,
m2,
min: Math.min(state.min, newValue),
max: Math.max(state.max, newValue),
};
}
Derive Statistics
From the state, calculate:
function getStatistics(state: WelfordState) {
if (state.count < 2) {
return { mean: state.mean, variance: 0, stddev: 0 };
}
const variance = state.m2 / (state.count - 1);
const stddev = Math.sqrt(variance);
return {
mean: state.mean,
variance,
stddev,
min: state.min,
max: state.max,
};
}
Example Walkthrough
Track durations: [10, 12, 11, 15, 13]
Step 1: Initialize
count = 0
mean = 0
m2 = 0
min = +∞
max = -∞
Step 2: Add 10
count = 1
delta = 10 - 0 = 10
mean = 0 + 10/1 = 10
delta2 = 10 - 10 = 0
m2 = 0 + (10 * 0) = 0
min = 10, max = 10
Step 3: Add 12
count = 2
delta = 12 - 10 = 2
mean = 10 + 2/2 = 11
delta2 = 12 - 11 = 1
m2 = 0 + (2 * 1) = 2
min = 10, max = 12
Step 4: Add 11
count = 3
delta = 11 - 11 = 0
mean = 11 + 0/3 = 11
delta2 = 11 - 11 = 0
m2 = 2 + (0 * 0) = 2
min = 10, max = 12
Step 5: Add 15
count = 4
delta = 15 - 11 = 4
mean = 11 + 4/4 = 12
delta2 = 15 - 12 = 3
m2 = 2 + (4 * 3) = 14
min = 10, max = 15
Step 6: Add 13
count = 5
delta = 13 - 12 = 1
mean = 12 + 1/5 = 12.2
delta2 = 13 - 12.2 = 0.8
m2 = 14 + (1 * 0.8) = 14.8
min = 10, max = 15
Final Statistics
Mean = 12.2
Variance = 14.8 / (5 - 1) = 3.7
Std Dev = √3.7 = 1.92
Min = 10
Max = 15
Verify:
Manual: (10+12+11+15+13)/5 = 61/5 = 12.2 ✓
Mathematical Proof
Why It Works
The algorithm maintains the invariant:
m2 = Σ(xi - mean)² for all values seen
When adding a new value xₙ:
New mean: μₙ = μₙ₋₁ + (xₙ - μₙ₋₁)/n
M2ₙ = M2ₙ₋₁ + (xₙ - μₙ₋₁)(xₙ - μₙ)
This ensures:
Variance = M2 / (n - 1)
Proof uses the fact that the sum of squared differences can be updated incrementally by accounting for the shift in mean.
Full proof: Welford (1962)
Implementation in Saturn
// From actual Saturn codebase
async function recordPing(monitorId: string, duration: number) {
const stats = await db.monitorStats.findUnique({
where: { monitorId },
});
if (!stats) {
// First run - initialize
await db.monitorStats.create({
data: {
monitorId,
count: 1,
mean: duration,
m2: 0,
min: duration,
max: duration,
},
});
return;
}
// Update using Welford
const count = stats.count + 1;
const delta = duration - stats.mean;
const mean = stats.mean + delta / count;
const delta2 = duration - mean;
const m2 = stats.m2 + delta * delta2;
await db.monitorStats.update({
where: { monitorId },
data: {
count,
mean,
m2,
min: Math.min(stats.min, duration),
max: Math.max(stats.max, duration),
},
});
// Check for anomalies
if (count >= 10) { // Baseline established
const variance = m2 / (count - 1);
const stddev = Math.sqrt(variance);
const zScore = (duration - mean) / stddev;
if (zScore > 3.0) {
await createAnomalyIncident(monitorId, {
zScore,
duration,
mean,
stddev,
});
}
}
}
Benefits
1. Memory Efficiency
| Approach | Memory per Monitor |
|---|---|
| Store all values | 50 runs × 8 bytes = 400 bytes |
| 1000 runs × 8 bytes = 8 KB | |
| 10,000 runs × 8 bytes = 80 KB | |
| Welford | 40 bytes (constant) |
For 1 million monitors with 1000 runs each:
- Traditional: 8 GB
- Welford: 40 MB (200× smaller!)
2. Speed
Update time is constant:
- No iteration over historical values
- No recalculation from scratch
- Single database write per ping
3. Numerical Stability
Welford's algorithm is more numerically stable than naive variance calculation:
# Naive (unstable for large means)
variance = sum(x**2)/n - (sum(x)/n)**2
# Can have catastrophic cancellation if values are large
Welford avoids this by tracking differences from the mean.
Limitations
1. Cannot Calculate Median
Median requires storing values or using approximations:
Saturn's approach: Store last 50 run durations for median
interface ExtendedStats extends WelfordState {
recentDurations: number[]; // Ring buffer, last 50
}
Memory: 40 bytes (Welford) + 400 bytes (50 × 8) = 440 bytes
2. Cannot Calculate Percentiles
P95/P99 also require storing values or using sketches (t-digest, etc.)
Saturn's approach: Store last 100 runs for percentile calculation
3. Weighted Updates
Standard Welford doesn't support weighted observations (e.g., recent runs more important).
Saturn's approach: Optional exponential decay factor for adapting to trends.
Extensions in Saturn
Rolling Window
Limit statistics to last N runs:
// Only consider last 100 runs
const maxCount = 100;
if (state.count >= maxCount) {
// Reset or use exponential moving average
state.count = maxCount;
}
Outlier Rejection
Exclude extreme outliers from baseline:
// Don't update stats if value is >10σ (likely corrupted)
if (Math.abs((newValue - state.mean) / stddev) > 10) {
return state; // Ignore this value
}
Comparing Approaches
| Feature | Naive | Welford | T-Digest |
|---|---|---|---|
| Memory | O(n) | O(1) | O(log n) |
| Update time | O(n) | O(1) | O(log n) |
| Mean | ✓ | ✓ | ✗ |
| Variance | ✓ | ✓ | ✗ |
| Median | ✓ | ✗ | ✓ (approx) |
| Percentiles | ✓ | ✗ | ✓ (approx) |
| Exact | ✓ | ✓ | ✗ |
Saturn uses Welford + ring buffer for best of both worlds.
Further Reading
- Welford, B. P. (1962). "Note on a method for calculating corrected sums of squares and products"
- Knuth, TAOCP Vol 2, Section 4.2.2
- Wikipedia: Algorithms for calculating variance
Next Steps
- Anomaly Rules — How statistics are used for detection
- Anomaly Tuning — Optimize detection parameters
- Analytics — Visualize statistical trends