Expand-migrate-contract migration recipes
migrations specs/migrations/expand-migrate-contract.kmd
Receitas concretas pro padrão expand → migrate → contract mandado por `policies/always-on.kmd § R3.1`, por tier de storage. Componentes copy-paste-tweak; não reinventam (e não fazem big-bang ALTER que trava a frota).
Corpo da especificação
Spec: Expand-migrate-contract recipes (R3.1)
Status: draft v0.1 (2026-05-24). Receitas validadas em produção em ≥ 1 componente serão promovidas a
stable. Componentes consultam este doc antes de mexer em schema/layout pra não derrubar a frota.
A regra (de always-on.kmd § R3.1)
Toda migração que muda forma de dado segue 3 releases mínimo:
- Expand (release N): adiciona nova forma; código N escreve em ambas as formas; lê preferencialmente da nova, fallback pra antiga.
- Migrate (entre N e N+1): backfill assíncrono converte rows antigas pra forma nova (job em background, idempotente, resumível, per-tenant).
- Contract (release N+1, depois do migrate confirmado): código pára de escrever na forma antiga; opcionalmente dropa coluna/tabela.
Validar que N+1 e N coexistem durante o intervalo de rollout (T4
em specs/testing/always-on-recipes.kmd).
Por que isto existe
R3.1 é a dimensão que mais quebra rollouts na prática. ALTER TABLE ... ADD COLUMN NOT NULL DEFAULT now() em tabela ativa tranca
escritas por minutos. Pior: dropar coluna em release N e ler dela em
client N-1 já em produção = 500 silencioso pro user. Sem padrão
formal, cada componente inventa um caminho diferente e a Stack
acumula débito de migrações inseguras.
Por tier de storage
Postgres / kdb-next (SQL OLTP)
Expand: adicionar coluna nullable
-- release N: migration 0042_users_add_phone.up.sql
ALTER TABLE users
ADD COLUMN phone TEXT; -- nullable, no rewrite (PG >= 11 fast)
-- Index online (won't block writes)
CREATE INDEX CONCURRENTLY idx_users_phone ON users (phone);
Code change in release N:
// Write: populate both old and new path during expand
type UserUpdate struct {
LegacyPhone *string // deprecated, kept until contract release
Phone *string // new canonical
}
func WriteUser(u UserUpdate) error {
if u.Phone != nil && u.LegacyPhone == nil {
u.LegacyPhone = u.Phone // dual-write
}
// INSERT / UPDATE using both columns
}
// Read: prefer new, fallback to old
func ReadUser(id int64) (User, error) {
row, err := db.QueryRow("SELECT phone, legacy_phone FROM users WHERE id=$1", id)
if row.Phone.Valid { return User{Phone: row.Phone.String}, nil }
return User{Phone: row.LegacyPhone.String}, nil
}
Migrate: per-tenant backfill
// run by a separate worker, idempotent, per-tenant checkpoint
func BackfillPhone(ctx context.Context, tenantID int64) error {
cursor := loadCheckpoint(tenantID)
for {
rows, err := db.Query(`
SELECT id, legacy_phone FROM users
WHERE koder_user_id = $1 AND phone IS NULL AND id > $2
ORDER BY id LIMIT 1000`,
tenantID, cursor)
if err != nil { return err }
if len(rows) == 0 { break }
for _, r := range rows {
_, err := db.Exec(
"UPDATE users SET phone = $1 WHERE id = $2 AND phone IS NULL",
normalizePhone(r.LegacyPhone), r.ID)
if err != nil { return err }
}
cursor = rows[len(rows)-1].ID
saveCheckpoint(tenantID, cursor)
}
return nil
}
Contract: drop the old column
-- release N+1 (≥ 90 days after expand, after backfill confirmed)
-- migration 0043_users_drop_legacy_phone.up.sql
ALTER TABLE users DROP COLUMN legacy_phone;
Checklist before running 0043:
- Stage 1 confirmed: 100% of users.phone populated (run
SELECT count(*) FROM users WHERE phone IS NULL AND legacy_phone IS NOT NULL— expect 0). - Window R1.1 elapsed: at least
window_duration_dayssince N rolled out fully. Default 180 days; checkkoder.toml [compat]. - No code path reads
legacy_phone(grep the monorepo). - T4 (specs/testing/always-on-recipes.kmd § T4) green for the drop.
Forbidden one-shot patterns
-- ✗ ALTER ... ADD COLUMN NOT NULL DEFAULT … on populated table — rewrites it
ALTER TABLE users ADD COLUMN phone TEXT NOT NULL DEFAULT '';
-- ✗ ALTER ... DROP COLUMN inside same release as the code change
ALTER TABLE users DROP COLUMN legacy_phone; -- in release N, NOT OK
-- ✗ CREATE INDEX (without CONCURRENTLY) on busy table — blocks writes
CREATE INDEX idx_users_phone ON users (phone);
-- ✗ Adding NOT NULL via ALTER without prior backfill
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;
NOT NULL: 3 steps
-- step 1 (release N): add nullable column
ALTER TABLE users ADD COLUMN phone TEXT;
-- step 2 (release N..N+1): backfill via worker (see Migrate section above)
-- step 3 (release N+1, AFTER 100% populated): tighten constraint
ALTER TABLE users ALTER COLUMN phone SET NOT NULL;
kdb (Koder DB — TiKV-backed) — RFC-001 substrate
kdb's key-prefix model and versioned descriptors per
rfcs/stack-RFC-001-kdb-as-unified-data-plane.kmd make expand/migrate/
contract a key-namespace operation, not DDL.
Expand: dual-write to new key shape
// release N: write to BOTH old and new key namespaces
err = kdb.PutMulti(ctx, []kdb.Op{
{Key: oldKey(tenantID, itemID), Value: encodeOld(payload)},
{Key: newKey(tenantID, itemID), Value: encodeNew(payload)},
})
Reads prefer new namespace, fall back to old:
v, err := kdb.Get(ctx, newKey(tenantID, itemID))
if errors.Is(err, kdb.ErrNotFound) {
v, err = kdb.Get(ctx, oldKey(tenantID, itemID))
}
Migrate: per-tenant range scan + transform
// background job, resumable via checkpoint
func migrateKDBPrefix(ctx context.Context, tenantID int64) error {
cursor := loadCheckpoint(tenantID)
for {
scan, err := kdb.RangeScan(ctx, kdb.Range{
From: oldKeyPrefix(tenantID, cursor),
To: oldKeyPrefixEnd(tenantID),
Limit: 1000,
})
if err != nil { return err }
if len(scan) == 0 { break }
var ops []kdb.Op
for _, entry := range scan {
new := transform(entry.Value)
ops = append(ops, kdb.Op{
Key: newKeyFromOld(entry.Key), Value: new,
})
}
if err := kdb.PutMulti(ctx, ops); err != nil { return err }
cursor = scan[len(scan)-1].Key
saveCheckpoint(tenantID, cursor)
}
return nil
}
Contract: stop writing old prefix, GC old keys
// release N+1 (after migration confirmed)
// 1. Code stops writing old prefix
// 2. Async GC worker deletes oldKey range after window R1.1 expires:
err = kdb.DeleteRange(ctx, kdb.Range{
From: oldKeyPrefix(tenantID, 0),
To: oldKeyPrefixEnd(tenantID),
})
kdb-specific guidance
- Range scans + checkpoint:
kdb.RangeScanwith paging avoids loading whole tenant into memory. - TiKV regions: very large per-tenant migrations may benefit from
splitting the range across regions and parallel migrating; see
infra/data/kdb/docs/for the canonical split helpers. - Multi-tenancy: per
policies/multi-tenant-by-default.kmd, migrate one tenant at a time; don't lock all tenants together.
Redis / KV (hot-path cache or session store)
Expand: dual-write with TTL
// release N: write to both keys, same TTL
pipe := redis.Pipeline()
pipe.Set(ctx, oldKey(uid), encodeOld(v), ttl)
pipe.Set(ctx, newKey(uid), encodeNew(v), ttl)
_, err := pipe.Exec(ctx)
Read prefers new, fallback to old:
v, err := redis.Get(ctx, newKey(uid)).Result()
if errors.Is(err, redis.Nil) {
v, err = redis.Get(ctx, oldKey(uid)).Result()
}
Migrate: scan + transform
// SCAN cursor-based iteration; idempotent
var cursor uint64
for {
keys, next, err := redis.Scan(ctx, cursor, oldKeyPattern(), 1000).Result()
if err != nil { return err }
for _, k := range keys {
v, _ := redis.Get(ctx, k).Result()
newK := translateKey(k)
ttl, _ := redis.TTL(ctx, k).Result()
redis.Set(ctx, newK, transform(v), ttl)
}
if next == 0 { break }
cursor = next
}
Contract: drop old prefix
// release N+1: stop dual-write, then GC
// Option A: rely on TTL expiry (waits up to max TTL)
// Option B: scan + DEL the old prefix explicitly
for cursor := uint64(0); ; {
keys, next, _ := redis.Scan(ctx, cursor, oldKeyPattern(), 1000).Result()
if len(keys) > 0 { redis.Del(ctx, keys...) }
if next == 0 { break }
cursor = next
}
Redis-specific guidance
KEYS patternis forbidden in production (blocks the single Redis thread). UseSCANcursor iteration.- For tens-of-millions of keys, batch DEL in groups of 1000 with sleep between batches to avoid evictions.
- TTL preservation during migrate: read TTL of each key, set new key with same remaining TTL.
S3 / object storage (kdrive blobs, attachments, exports)
Expand: dual-write paths
// release N: upload to both new and legacy paths
oldKey := fmt.Sprintf("koder/uploads/%d/%s", oldFlat, blobID)
newKey := fmt.Sprintf("koder/%d/%s/%s", uid, wsID, blobID)
s3.PutObject(ctx, &s3.PutObjectInput{Key: &newKey, Body: body})
s3.CopyObject(ctx, &s3.CopyObjectInput{
Bucket: &bucket,
CopySource: ptrf("%s/%s", bucket, newKey),
Key: &oldKey,
})
Reads prefer new, fall back to old:
obj, err := s3.GetObject(ctx, &s3.GetObjectInput{Key: &newKey})
if isNotFound(err) {
obj, err = s3.GetObject(ctx, &s3.GetObjectInput{Key: &oldKey})
}
Migrate: server-side copy per tenant
// background job — uses S3's server-side copy (no download/upload)
listInput := &s3.ListObjectsV2Input{
Bucket: &bucket,
Prefix: ptrf("koder/uploads/%d/", oldFlat),
}
for paginator := s3.NewListObjectsV2Paginator(s3client, listInput); paginator.HasMorePages(); {
page, err := paginator.NextPage(ctx)
if err != nil { return err }
for _, obj := range page.Contents {
newKey := translateS3Key(*obj.Key)
s3.CopyObject(ctx, &s3.CopyObjectInput{
Bucket: &bucket,
CopySource: ptrf("%s/%s", bucket, *obj.Key),
Key: &newKey,
})
}
}
Contract: lifecycle policy or batch delete
// release N+1: lifecycle rule that deletes legacy prefix after R1.1 window
{
"Rules": [{
"ID": "expire-legacy-uploads",
"Filter": { "Prefix": "koder/uploads/" },
"Status": "Enabled",
"Expiration": { "Days": 180 }
}]
}
S3-specific guidance
- Server-side copy (
s3:CopyObject) avoids egress and is atomic per object. - For very large buckets, S3 Batch Operations handles billions of objects with retry and reporting.
- Multipart blobs > 5 GB require
s3:UploadPartCopy; same idempotency story.
Common pitfalls (per-tier)
| Pitfall | Why it bites | Fix |
|---|---|---|
| Single-release ADD + use | Old code on N-1 servers can't read the new column; deserialization fails | Always expand+migrate+contract across ≥ 3 releases |
| Migrate runs synchronously in API request | Request times out on large tenants; user sees 504 | Background worker, per-tenant checkpoint, idempotent |
Backfill without WHERE phone IS NULL | Re-overwrites already-migrated rows; race with live writes | Always filter on "still in old form" |
| DROP COLUMN in same release as code change | Client N-1 still in production tries to read dropped column → 500 | DROP only after window R1.1 elapsed |
| Lifecycle delete before clients drained | Reads from N-1 clients hit 404 (object gone) | Lifecycle window ≥ R1.1 window_duration_days |
| Migration job without resumability | One crash mid-tenant restarts the whole tenant; idempotency check expensive | Per-tenant checkpoint, idempotent UPDATEs |
Test pairing
Each migration MUST pair with T4 (specs/testing/always-on-recipes.kmd § T4):
- Restore prod-like snapshot.
- Start sustained read+write load.
- Apply migration in another connection.
- Assert peak per-second latency < 100 ms.
Migrations that fail T4 don't ship until reworked into the expand-migrate-contract pattern.
Per-component override
A component can TIGHTEN the migration cadence (e.g., window_duration_days = 365
implies DROP only after 365 days elapsed). Cannot LOOSEN below
Stack defaults (R1.1).
<component>/koder.toml [compat] is the source of truth for the
component's own window; recipes here use the Stack default 180.
Status
- v0.1 (2026-05-24): 4 tiers covered (Postgres/kdb-next, Redis,
S3). Patterns extracted from
services/foundation/id/engineLGPD cascade migration + multi-tenant rollout. - v0.2 planned: vector DB (embedding migrations between dims/ models), time-series Timescale hypertables, queue/stream schema evolution (event versioning).
- Promotion to v1.0: when ≥ 3 components ship a real expand- migrate-contract migration following these recipes and validate them in production rollouts.
Referências
policies/always-on.kmdpolicies/multi-tenant-by-default.kmdspecs/testing/always-on-recipes.kmdrfcs/stack-RFC-001-kdb-as-unified-data-plane.kmd