AI voice mode UI

ai-ui specs/ai-ui/voice-mode.kmd

Fullscreen voice conversation mode UI: waveform (input/output color-coded), push-to-talk + always-on toggle, barge-in visual, mute, end session. Extends voice/wake-word.kmd with conversational mode semantics. Required for Talk product + future Kortex voice mode.

When this spec applies

Primary triggers

Render fullscreen voice conversation UI

All triggers

Build conversational voice mode in any AI product
Implement Talk voice surface

Spec — AI voice mode UI

Spec base: voice/wake-word.kmd cobre toggles + backend. Esta spec cobre UX do modo conversacional ao vivo. Trigger: mic button no multimodal-input.kmd (#116). Impl ticket: services/ai/ai#115.

Princípios

Fullscreen focus — voice mode = dedicated surface, não inline.
Visual feedback — waveform input vs output color-coded.
Barge-in supported — user fala enquanto IA fala; UX deve sinalizar transição.
One-tap escape — mute / end always reachable.
Optional transcription overlay — toggle live STT visible.

R1 — Anatomia

┌─────────────────────────────────────────────┐
│  [✕]                                        │  ← top: end session
│                                             │
│            [Assistant Icon]                 │
│         Generated by AI — verify            │
│                                             │
│   ╭─╮╭─╮╭─╮╭─╮╭─╮╭─╮╭─╮╭─╮╭─╮              │
│  ─╯ ╰╯ ╰╯ ╰╯ ╰╯ ╰╯ ╰╯ ╰╯ ╰╯ ╰─             │  ← waveform (output = assistant talking)
│                                             │
│   "Sure, here's what I think..."            │  ← optional transcript overlay
│                                             │
│                                             │
│                                             │
│       ┌────────────────────────┐            │
│       │  Push to talk    [PTT] │            │  ← mode toggle
│       │  Always on       [AON] │            │
│       └────────────────────────┘            │
│                                             │
│                  [🎙 mute]                  │
└─────────────────────────────────────────────┘

Slots:

Slot	Function
End button (✕)	Close voice mode; return to composer or chat surface
Avatar	Assistant icon (configurable)
AI disclaimer	Subtle label per `ai-disclaimer.kmd` R1 tier 1
Waveform	Animated; color shifts per state (R2)
Transcript overlay	Optional toggle; shows STT live (input) + TTS source (output)
Mode toggle	Push-to-talk vs Always-on
Mute button	1-tap mute mic; visual indicator state

R2 — Waveform states

Color coding per themes/color-roles.kmd:

State	Waveform color	Behavior
idle	`text-muted` flat line	no input/output
listening (input)	`accent` animated bars	input audio captured
processing	`text-muted` pulsing dot	waiting for response
speaking (output)	`success` animated bars	output audio playing
barge-in transition	Crossfade accent ↔ success	brief overlap when user speaks during output

Visual MUST honor reduced-motion: replace animation with static "...listening" / "speaking" labels.

R3 — Push-to-talk vs Always-on

Mode	Behavior
Push-to-talk (PTT)	Hold button (or spacebar) to record; release to send. Default for noisy environments.
Always-on (AON)	Continuous capture with VAD (voice activity detection). Default for hands-free.

Toggle persists per-user. Cross-link voice/wake-word.kmd R1 toggles (voice.enabled, talkMode).

R4 — Barge-in

When user speaks during assistant output (barge-in):

Per voice/wake-word.kmd R5: bargeIn: true → output audio fades out + input fades in.
Visual: waveform color crossfades accent ↔ success over ~200ms.
Output TTS interrupted; new input streamed to backend.
Audit: barge-in event logged.

R5 — Transcript overlay

Toggle button (default OFF). When ON:

Live STT text appears overlay below waveform (input state).
Live TTS source text appears (output state).
Auto-scroll; max 3 lines visible.
After mode end: transcript persisted to conversation history per conversation-history.kmd (#115).

R6 — End session

Top-right ✕ button OR swipe-down gesture (mobile):

End TTS playback gracefully (fade out 200ms).
Disconnect WebSocket from services/ai/voice.
Return user to composer OR chat history (configurable per product).
Final transcript saved to conversation history.

R7 — Mute

1-tap toggle bottom-center mic button:

Mute: mic input gated locally; backend still connected.
Unmute: input resumes.
Mute state announced via aria-live.
Visual: mic icon strikethrough.

R8 — Surface bindings

Surface	API
Flutter	`KoderVoiceModeSheet({onEnd, onMute, onBargeIn, transcriptToggle})` em `koder_kit/lib/src/ai/voice_mode_sheet.dart`
Web	`<koder-voice-mode-sheet>`
Compose Android	`KoderVoiceModeSheet` em `koder-design-compose` (futuro)
SwiftUI iOS	idem em `koder-design-swift` (futuro)
CLI / TUI	Plain prompt-and-response; mic via system; no waveform

R9 — Acessibilidade

Sheet: role="dialog" aria-modal="true" aria-label="Voice conversation".
Waveform: aria-hidden="true" (visual only); state announced via aria-live ("Listening", "Speaking").
Buttons: keyboard accessible; spacebar for PTT.
Reduced-motion: waveform replaced by labels.
Screen reader: announces transition states.

R10 — i18n

Key	en-US	pt-BR
`ai.voice.mode.title`	"Voice mode"	"Modo de voz"
`ai.voice.mode.ptt`	"Push to talk"	"Pressionar para falar"
`ai.voice.mode.aon`	"Always on"	"Sempre ativo"
`ai.voice.mode.mute`	"Mute"	"Silenciar"
`ai.voice.mode.unmute`	"Unmute"	"Reativar"
`ai.voice.mode.end`	"End conversation"	"Encerrar conversa"
`ai.voice.mode.transcript_toggle`	"Show transcript"	"Mostrar transcrição"
`ai.voice.state.listening`	"Listening..."	"Ouvindo..."
`ai.voice.state.processing`	"Processing..."	"Processando..."
`ai.voice.state.speaking`	"Speaking..."	"Falando..."

R11 — Per-preset

Cosmetic only.

T-suite

T1 Mount: voice mode sheet renders; default mode PTT or AON per user pref.
T2 State transitions: idle → listening → processing → speaking → idle.
T3 Waveform colors: each state correct color.
T4 PTT mode: hold spacebar → recording; release → send.
T5 AON mode: VAD-triggered start/stop.
T6 Barge-in: speak during output → crossfade accent ↔ success; output TTS interrupted.
T7 Mute: 1-tap → mic gated; aria announces "Muted".
T8 End: tap ✕ → fade out + cleanup + return to composer; transcript saved.
T9 Transcript toggle: enable → overlay visible during conversation.
T10 Reduced-motion: waveform replaced by text labels.
T11 A11y: aria-live announces state transitions.
N1 Mic permission revoked mid-session: graceful end + prompt to re-grant.

Cross-link

Base spec: voice/wake-word.kmd (toggles + backend)
Companion: multimodal-input.kmd (#116 — composer mic button entry point), conversation-history.kmd (transcript persist), ai-disclaimer.kmd (disclaimer label R1)
Backend: services/ai/voice/, services/ai/ai/backlog/pending/115-cli-desktop-voice-input.md

References

meta/docs/stack/specs/voice/wake-word.kmd
meta/docs/stack/specs/ai-ui/multimodal-input.kmd
meta/docs/stack/specs/ai-ui/chat-message-bubble.kmd