Audio capture and voice mode
This page answers the voice question for the extracted cli.js: does Claude Code support voice, and how is it designed?
Yes. The runtime contains a source-confirmed voice dictation path. It is local microphone capture plus remote transcription: /voice enables hold-to-talk or tap-to-toggle recording, audio is captured through a native N-API module or OS recorder fallback, a voice stream receives transcript chunks, and the final transcript is injected into the prompt input.
Source anchors
| Semantic alias | Source | Approximate location | String or symbol | Meaning |
|---|---|---|---|---|
| AudioNativeAddonRequire | cli.js | line ~11, byte 0x81d | require("/$bunfs/root/audio-capture.node") | Main bundle can load the embedded audio native addon. |
| AudioCaptureShim | claude-code-pkg/audio-capture.js | line ~11 | require("/$bunfs/root/audio-capture.node") | Retained JS shim references the audio N-API binary from the original Bun payload. |
| LegacyVoiceEnabledSetting | cli.js | line ~185, byte 0x116cb1 | voiceEnabled:y.boolean | Legacy/global voice setting surface. |
| VoiceModeSettingsSchema | cli.js | line ~185, byte 0x11d5e6 | Voice mode settings (hold-to-talk / tap-to-toggle dictation) | Structured voice settings schema. |
| VoiceLanguageSetting | cli.js | line ~185, byte 0x11c36f | Preferred language for Claude responses and voice dictation | Voice dictation shares the language setting. |
| VoiceIdleState | cli.js | line ~605, byte 0x38c87d | voiceState:"idle" | TUI voice state starts idle. |
| VoiceInterimTranscriptState | cli.js | line ~605, byte 0x38c89f | voiceInterimTranscript | Interim transcript is part of TUI state. |
| VoiceAudioLevelsState | cli.js | line ~605, byte 0x38c8b9 | voiceAudioLevels | Audio-level visualization state. |
| PushToTalkAction | cli.js | line ~605, byte 0x38dbbb | voice:pushToTalk | Keybinding/action for voice recording. |
| NativeAudioWrapper | cli.js | line ~7745, byte 0xb5d6e1 | writeNativePlaybackData:()=>Kb5 | Native audio wrapper exports playback and recording methods. |
| AudioNapiLoadedLog | cli.js | line ~7745, byte 0xb5dd23 | audio-capture-napi loaded | Native audio addon load path succeeds when available. |
| MicrophoneAccessGuard | cli.js | line ~7745, byte 0xb5e43d | Voice mode requires microphone access | Remote/no-device guard for voice mode. |
| WslRecorderFallbackGuard | cli.js | line ~7747, byte 0xb5e530 | Voice mode could not find a working audio recorder in WSL | WSL fallback error. |
| SoxRecorderRequirement | cli.js | line ~7751, byte 0xb5e76d | Voice mode requires SoX for audio recording | SoX fallback requirement. |
| VoiceAccountGate | cli.js | line ~7754, byte 0xb5ef70 | Voice mode requires a Claude.ai account | Voice stream is account-gated. |
| VoiceSlashCommand | cli.js | line ~7756, byte 0xb5f941 | Toggle voice mode | /voice slash command description. |
| VoiceTuiComponents | cli.js | line ~9355 | VoiceIndicator, VoiceWarmupHint | TUI components render recording/warmup state. |
| VoiceFinishRecording | cli.js | line ~9514, byte 0xc8c6f9 | finishRecording: stopping recording | Recording state machine finalizes capture. |
| VoiceWebSocketBuffering | cli.js | line ~9514, byte 0xc8da9a | startRecording: buffering audio while WebSocket connects | Captured chunks are buffered until the transcription stream is ready. |
| VoiceFinalTranscript | cli.js | line ~9514, byte 0xc8ccec | Final transcript assembled | Transcription stream produces final text. |
| VoiceTranscriptInjection | cli.js | line ~9514, byte 0xc8ce2c | Injecting transcript | Final transcript is injected into the input buffer. |
| VoiceConnectionFailureTelemetry | cli.js | line ~9514, byte 0xc8cec1 | voice_transcription_connection_failed | Voice stream connection failure telemetry. |
| VoiceStreamErrorPath | cli.js | line ~9514, byte 0xc8e327 | voice_stream error | Voice stream error path. |
| VoiceStreamAuthFailure | cli.js | line ~9514, byte 0xc8e7dc | voice_stream_no_auth | Voice stream auth failure telemetry. |
| VoiceDiscoveryHint | cli.js | line ~9553, byte 0xcba243 | Use /voice to enable push-to-talk dictation | User-visible discovery hint. |
High-level design
flowchart TD User["/voice command or push-to-talk key"] --> Settings[voice.enabled / voice.mode / autoSubmit] Settings --> UI[VoiceProvider state] UI --> Capture{capture backend} Capture -->|native available| Native[audio-capture.node N-API] Capture -->|Linux/WSL fallback| Recorder[arecord or SoX rec] Native --> Chunks[audio chunks + audio levels] Recorder --> Chunks Chunks --> Stream[voice transcription stream] Stream --> Interim[interim transcript] Stream --> Final[final transcript] Final --> Inject[input buffer injection] Inject --> Prompt[normal prompt submission path]Voice mode is dictation, not a separate agent loop. After transcription, the result flows back into the same text prompt pipeline used by keyboard input.
User-facing controls
| Surface | Meaning |
|---|---|
/voice | Toggle or configure voice mode. |
/voice hold | Hold-to-talk dictation. |
/voice tap | Tap-to-toggle dictation. |
/voice off | Disable voice mode. |
voice:pushToTalk | TUI keybinding/action; the default chat binding includes space for push-to-talk. |
voice.autoSubmit | Setting that can submit after transcript injection rather than only inserting text. |
language | Preferred language for Claude responses and voice dictation. |
The state model contains voiceState, voiceError, voiceInterimTranscript, voiceAudioLevels, voiceWarmingUp, and awaitingVoiceSubmitDoubleTap, which explains the visible warmup/recording/transcribing feedback.
Capture backends
Native N-API path
The original Bun payload ships audio-capture.js and audio-capture.node; the final repository layout retains only the JS shim. cli.js loads the native addon when it is available at runtime and exports wrappers such as:
startNativeRecordingstopNativeRecordingisNativeRecordingActivestartNativePlaybackwriteNativePlaybackDatastopNativePlaybackmicrophoneAuthorizationStatusisNativeAudioAvailable
The audio-capture-napi loaded anchor confirms the runtime attempts to use this addon when available.
Recorder fallback path
When native capture is unavailable, the runtime falls back to command-line recorders:
- Linux/WSL can use
arecord. - A SoX
recpath exists and produces user-facing guidance when missing. - WSL has explicit failure messaging when no working recorder exists.
This fallback design keeps the JS/TUI voice state machine independent from any one capture backend.
Transcription stream and injection
The recording flow has two phases:
- Capture phase: start local recording, collect chunks, and surface audio levels.
- Stream phase: connect to the voice stream, buffer audio while the WebSocket-like stream is connecting, send audio frames, receive interim/final transcript messages, and close/finalize.
sequenceDiagram autonumber participant UI as TUI voice action participant Capture as Native/fallback capture participant Buffer as Audio buffer participant Stream as Voice stream participant Input as Prompt input buffer
UI->>Capture: startRecording Capture-->>UI: audio levels / chunks UI->>Stream: open transcription stream Capture->>Buffer: buffer chunks while stream connects Stream-->>UI: ready Buffer->>Stream: flush buffered chunks Capture->>Stream: send live audio frames Stream-->>UI: interim transcript UI->>Capture: finishRecording Capture-->>Stream: final audio / close Stream-->>UI: final transcript UI->>Input: Injecting transcriptThe source strings Final transcript assembled and Injecting transcript confirm that the transcribed text is not merely displayed; it becomes input to the regular prompt flow.
Availability and gates
Voice mode is constrained by environment and account state:
| Gate | Source-confirmed behavior |
|---|---|
| Local audio device | Remote/no-device environments show Voice mode requires microphone access... run Claude Code locally instead. |
| Account/auth | /voice can report Voice mode requires a Claude.ai account; stream errors include voice_stream_no_auth. |
| Recorder dependencies | WSL and SoX-specific errors guide the user when no recorder backend is available. |
| Settings | voiceEnabled, voice.enabled, voice.mode, voice.autoSubmit, and language all affect behavior. |
| Feature/availability check | The TUI renders voice indicators only when the availability helper says voice can run. |
The current evidence supports documenting voice as supported local dictation, not as always available in every environment.
Telemetry and error handling
The bundle contains voice-specific telemetry/error names:
| Event/string | Meaning |
|---|---|
tengu_voice_toggled | Voice setting changed. |
tengu_voice_recording_started | Local recording began. |
tengu_voice_recording_completed | Recording completed. |
voice_transcription_connection_failed | Could not connect to the transcription stream. |
voice_transcription_no_audio_signal | Capture produced no usable audio signal. |
voice_transcription_no_speech | Speech was not detected in the recorded audio. |
voice_stream_no_auth | Voice stream rejected/failed auth. |
voice_stream error | General stream failure. |
Failures are surfaced in the TUI as voiceError and do not replace the normal text-input path.
Relationship to media native modules
The older Media native modules inventory correctly identified audio-capture.node as shipped payload. This page adds the missing runtime call path: the main bundle can load the module when present, starts/stops recording, falls back to OS recorders, and injects the resulting transcript.
Caveats
- The
.nodebinary itself is stripped. This page documents the JavaScript call boundary, exported wrapper names, and user-visible behavior, not native implementation details such as device enumeration internals. - The stream endpoint and server-side transcription implementation are not recoverable from this source alone. The bundle proves a client-side voice stream and auth/error handling, not the backend model details.
- Voice mode should be described as dictation. There is no evidence here that the agent loop itself becomes audio-native; text remains the prompt handoff after transcription.
Related docs
Created and maintained by Yingting Huang.