Pular para o conteúdo

flow

slo specs/slo/flow.kmd

Corpo da especificação

Koder Flow — SLOs

Why this exists

#066 shipped Prometheus metrics under the koder_flow_* namespace (+ legacy gitea_* dual-emit per #106). The contrib/flow-monitoring-mixin/ ships Grafana dashboards. Until this spec, no canonical SLO existed — every operator picked thresholds ad-hoc, alerts got hand-rolled per deploy, and there was no shared yardstick for "Flow is healthy" vs "Flow is degraded".

This spec declares the canonical SLOs for flow.koder.dev. The flow-monitoring-mixin/alerts.libsonnet (FLOW-175) renders these into PrometheusAlertRule YAML; SREs deploy the rendered output against the prod Prometheus.

R1 — Service Level Objectives

SLIObjectiveWindow
SLO-1: Web availability99.9% of GET / responses are 2xx30 days rolling
SLO-2: API availability99.9% of /api/v1/* responses are non-5xx (2xx and 4xx both count as "available")30 days rolling
SLO-3: Git push latencyp95 of koder_flow_reference_transaction_duration_seconds{stage="committed"} < 500ms5 min rolling
SLO-4: Backup success≥95% of BackupRun rows reach status=success7 days rolling
SLO-5: Scheduler livenesskoder_flow_backups_scheduler_ticks_total increases at least once every 5 minutescontinuous
SLO-6: CRON parse safetykoder_flow_backups_scheduler_cron_parse_failures_total{policy_id} is zerocontinuous

SLO-3 depends on the FLOW-183 histogram landing. Until it does, the alert that consumes it (FlowGitPushP95LatencyHigh) renders but does not fire — buckets stay empty.

R2 — Error budget interpretation

For a 99.9% / 30d objective the budget is 43.2 minutes of downtime per month. Burn-rate alerts fire when the projected exhaustion of that budget exceeds 14.4× (page-soon) or 6× (ticket-grade).

R3 — Multi-window burn-rate alert rules

For SLO-1 and SLO-2 the alerts use the standard short-window + long-window pair (SRE Workbook §5):

SeverityLong windowShort windowBurn factor
page1h5min14.4×
ticket6h30min

alerts.libsonnet renders both for SLO-1 and SLO-2 with names Flow{Web,API}AvailabilityBudgetBurn{Fast,Slow}.

R4 — Single-window alerts (for the remaining SLOs)

AlertConditionSeverity
FlowGitPushP95LatencyHighp95 over 5min > 500msticket
FlowGitPushP99LatencyHighp99 over 5min > 2sticket
FlowBackupFailureRateHighfailed/total over 7d > 5%page
FlowSchedulerTickStalledincrease(...ticks_total[10m]) == 0page
FlowCronParseFailureRecurringany increase(...cron_parse_failures_total[10m]) > 0ticket

R5 — Where these run

  • Prometheus instance scraping flow.koder.dev/metrics (infra/observe/prom once landed; meanwhile the staging Prometheus on the operator's chosen host).
  • Alertmanager routes page → on-call paging; ticket → the operations inbox of the Flow team.
  • Grafana mixin (dashboards_out/) embeds matching panels with the burn-rate visualization so investigating an alert lands on a pre-built board.

R6 — Out of scope

  • Composite "Flow is up" alert (rolled by Alertmanager grouping, not by the rule definitions).
  • Per-org / per-repo SLOs (multi-tenant SLO breakdown — when Flow grows multi-tenant-by-default surfaces).
  • Synthetic probes from outside the network (covered by an external uptime monitor — out of band).

R7 — Owner + revision cadence

  • Owner: products/dev/flow/engine.
  • Reviewed each Flow release wave (operator confirms thresholds still match the production reality).
  • Threshold changes go through normal PR review; tighten over time as the service matures.