flow
Corpo da especificação
Koder Flow — SLOs
Why this exists
#066 shipped Prometheus metrics under the koder_flow_* namespace (+
legacy gitea_* dual-emit per #106). The contrib/flow-monitoring-mixin/
ships Grafana dashboards. Until this spec, no canonical SLO existed —
every operator picked thresholds ad-hoc, alerts got hand-rolled per
deploy, and there was no shared yardstick for "Flow is healthy" vs
"Flow is degraded".
This spec declares the canonical SLOs for flow.koder.dev. The
flow-monitoring-mixin/alerts.libsonnet (FLOW-175) renders these into
PrometheusAlertRule YAML; SREs deploy the rendered output against the
prod Prometheus.
R1 — Service Level Objectives
| SLI | Objective | Window |
|---|---|---|
| SLO-1: Web availability | 99.9% of GET / responses are 2xx | 30 days rolling |
| SLO-2: API availability | 99.9% of /api/v1/* responses are non-5xx (2xx and 4xx both count as "available") | 30 days rolling |
| SLO-3: Git push latency | p95 of koder_flow_reference_transaction_duration_seconds{stage="committed"} < 500ms | 5 min rolling |
| SLO-4: Backup success | ≥95% of BackupRun rows reach status=success | 7 days rolling |
| SLO-5: Scheduler liveness | koder_flow_backups_scheduler_ticks_total increases at least once every 5 minutes | continuous |
| SLO-6: CRON parse safety | koder_flow_backups_scheduler_cron_parse_failures_total{policy_id} is zero | continuous |
SLO-3 depends on the FLOW-183 histogram landing. Until it does, the alert that consumes it (
FlowGitPushP95LatencyHigh) renders but does not fire — buckets stay empty.
R2 — Error budget interpretation
For a 99.9% / 30d objective the budget is 43.2 minutes of downtime per month. Burn-rate alerts fire when the projected exhaustion of that budget exceeds 14.4× (page-soon) or 6× (ticket-grade).
R3 — Multi-window burn-rate alert rules
For SLO-1 and SLO-2 the alerts use the standard short-window + long-window pair (SRE Workbook §5):
| Severity | Long window | Short window | Burn factor |
|---|---|---|---|
| page | 1h | 5min | 14.4× |
| ticket | 6h | 30min | 6× |
alerts.libsonnet renders both for SLO-1 and SLO-2 with names
Flow{Web,API}AvailabilityBudgetBurn{Fast,Slow}.
R4 — Single-window alerts (for the remaining SLOs)
| Alert | Condition | Severity |
|---|---|---|
FlowGitPushP95LatencyHigh | p95 over 5min > 500ms | ticket |
FlowGitPushP99LatencyHigh | p99 over 5min > 2s | ticket |
FlowBackupFailureRateHigh | failed/total over 7d > 5% | page |
FlowSchedulerTickStalled | increase(...ticks_total[10m]) == 0 | page |
FlowCronParseFailureRecurring | any increase(...cron_parse_failures_total[10m]) > 0 | ticket |
R5 — Where these run
- Prometheus instance scraping
flow.koder.dev/metrics(infra/observe/promonce landed; meanwhile the staging Prometheus on the operator's chosen host). - Alertmanager routes
page→ on-call paging;ticket→ the operations inbox of the Flow team. - Grafana mixin (
dashboards_out/) embeds matching panels with the burn-rate visualization so investigating an alert lands on a pre-built board.
R6 — Out of scope
- Composite "Flow is up" alert (rolled by Alertmanager grouping, not by the rule definitions).
- Per-org / per-repo SLOs (multi-tenant SLO breakdown — when Flow grows multi-tenant-by-default surfaces).
- Synthetic probes from outside the network (covered by an external uptime monitor — out of band).
R7 — Owner + revision cadence
- Owner:
products/dev/flow/engine. - Reviewed each Flow release wave (operator confirms thresholds still match the production reality).
- Threshold changes go through normal PR review; tighten over time as the service matures.