flow

slo specs/slo/flow.kmd

Koder Flow — SLOs

Why this exists

#066 shipped Prometheus metrics under the koder_flow_* namespace (+ legacy gitea_* dual-emit per #106). The contrib/flow-monitoring-mixin/ ships Grafana dashboards. Until this spec, no canonical SLO existed — every operator picked thresholds ad-hoc, alerts got hand-rolled per deploy, and there was no shared yardstick for "Flow is healthy" vs "Flow is degraded".

This spec declares the canonical SLOs for flow.koder.dev. The flow-monitoring-mixin/alerts.libsonnet (FLOW-175) renders these into PrometheusAlertRule YAML; SREs deploy the rendered output against the prod Prometheus.

R1 — Service Level Objectives

SLI	Objective	Window
SLO-1: Web availability	99.9% of `GET /` responses are 2xx	30 days rolling
SLO-2: API availability	99.9% of `/api/v1/*` responses are non-5xx (2xx and 4xx both count as "available")	30 days rolling
SLO-3: Git push latency	p95 of `koder_flow_reference_transaction_duration_seconds{stage="committed"}` < 500ms	5 min rolling
SLO-4: Backup success	≥95% of `BackupRun` rows reach `status=success`	7 days rolling
SLO-5: Scheduler liveness	`koder_flow_backups_scheduler_ticks_total` increases at least once every 5 minutes	continuous
SLO-6: CRON parse safety	`koder_flow_backups_scheduler_cron_parse_failures_total{policy_id}` is zero	continuous

SLO-3 depends on the FLOW-183 histogram landing. Until it does, the alert that consumes it (FlowGitPushP95LatencyHigh) renders but does not fire — buckets stay empty.

R2 — Error budget interpretation

For a 99.9% / 30d objective the budget is 43.2 minutes of downtime per month. Burn-rate alerts fire when the projected exhaustion of that budget exceeds 14.4× (page-soon) or 6× (ticket-grade).

R3 — Multi-window burn-rate alert rules

For SLO-1 and SLO-2 the alerts use the standard short-window + long-window pair (SRE Workbook §5):

Severity	Long window	Short window	Burn factor
page	1h	5min	14.4×
ticket	6h	30min	6×

alerts.libsonnet renders both for SLO-1 and SLO-2 with names Flow{Web,API}AvailabilityBudgetBurn{Fast,Slow}.

R4 — Single-window alerts (for the remaining SLOs)

Alert	Condition	Severity
`FlowGitPushP95LatencyHigh`	p95 over 5min > 500ms	ticket
`FlowGitPushP99LatencyHigh`	p99 over 5min > 2s	ticket
`FlowBackupFailureRateHigh`	failed/total over 7d > 5%	page
`FlowSchedulerTickStalled`	`increase(...ticks_total[10m]) == 0`	page
`FlowCronParseFailureRecurring`	any `increase(...cron_parse_failures_total[10m]) > 0`	ticket

R5 — Where these run

Prometheus instance scraping flow.koder.dev/metrics (infra/observe/prom once landed; meanwhile the staging Prometheus on the operator's chosen host).
Alertmanager routes page → on-call paging; ticket → the operations inbox of the Flow team.
Grafana mixin (dashboards_out/) embeds matching panels with the burn-rate visualization so investigating an alert lands on a pre-built board.

R6 — Out of scope

Composite "Flow is up" alert (rolled by Alertmanager grouping, not by the rule definitions).
Per-org / per-repo SLOs (multi-tenant SLO breakdown — when Flow grows multi-tenant-by-default surfaces).
Synthetic probes from outside the network (covered by an external uptime monitor — out of band).

R7 — Owner + revision cadence

Owner: products/dev/flow/engine.
Reviewed each Flow release wave (operator confirms thresholds still match the production reality).
Threshold changes go through normal PR review; tighten over time as the service matures.