Audio capture and voice mode

This page answers the voice question for the extracted cli.js: does Claude Code support voice, and how is it designed?

Yes. The runtime contains a source-confirmed voice dictation path. It is local microphone capture plus remote transcription: /voice enables hold-to-talk or tap-to-toggle recording, audio is captured through a native N-API module or OS recorder fallback, a voice stream receives transcript chunks, and the final transcript is injected into the prompt input.

Source anchors

Semantic alias	Source	Approximate location	String or symbol	Meaning
AudioNativeAddonRequire	`cli.js`	line ~11, byte `0x81d`	`require("/$bunfs/root/audio-capture.node")`	Main bundle can load the embedded audio native addon.
AudioCaptureShim	`claude-code-pkg/audio-capture.js`	line ~11	`require("/$bunfs/root/audio-capture.node")`	Retained JS shim references the audio N-API binary from the original Bun payload.
LegacyVoiceEnabledSetting	`cli.js`	line ~185, byte `0x116cb1`	`voiceEnabled:y.boolean`	Legacy/global voice setting surface.
VoiceModeSettingsSchema	`cli.js`	line ~185, byte `0x11d5e6`	`Voice mode settings (hold-to-talk / tap-to-toggle dictation)`	Structured voice settings schema.
VoiceLanguageSetting	`cli.js`	line ~185, byte `0x11c36f`	`Preferred language for Claude responses and voice dictation`	Voice dictation shares the language setting.
VoiceIdleState	`cli.js`	line ~605, byte `0x38c87d`	`voiceState:"idle"`	TUI voice state starts idle.
VoiceInterimTranscriptState	`cli.js`	line ~605, byte `0x38c89f`	`voiceInterimTranscript`	Interim transcript is part of TUI state.
VoiceAudioLevelsState	`cli.js`	line ~605, byte `0x38c8b9`	`voiceAudioLevels`	Audio-level visualization state.
PushToTalkAction	`cli.js`	line ~605, byte `0x38dbbb`	`voice:pushToTalk`	Keybinding/action for voice recording.
NativeAudioWrapper	`cli.js`	line ~7745, byte `0xb5d6e1`	`writeNativePlaybackData:()=>Kb5`	Native audio wrapper exports playback and recording methods.
AudioNapiLoadedLog	`cli.js`	line ~7745, byte `0xb5dd23`	`audio-capture-napi loaded`	Native audio addon load path succeeds when available.
MicrophoneAccessGuard	`cli.js`	line ~7745, byte `0xb5e43d`	`Voice mode requires microphone access`	Remote/no-device guard for voice mode.
WslRecorderFallbackGuard	`cli.js`	line ~7747, byte `0xb5e530`	`Voice mode could not find a working audio recorder in WSL`	WSL fallback error.
SoxRecorderRequirement	`cli.js`	line ~7751, byte `0xb5e76d`	`Voice mode requires SoX for audio recording`	SoX fallback requirement.
VoiceAccountGate	`cli.js`	line ~7754, byte `0xb5ef70`	`Voice mode requires a Claude.ai account`	Voice stream is account-gated.
VoiceSlashCommand	`cli.js`	line ~7756, byte `0xb5f941`	`Toggle voice mode`	`/voice` slash command description.
VoiceTuiComponents	`cli.js`	line ~9355	`VoiceIndicator`, `VoiceWarmupHint`	TUI components render recording/warmup state.
VoiceFinishRecording	`cli.js`	line ~9514, byte `0xc8c6f9`	`finishRecording: stopping recording`	Recording state machine finalizes capture.
VoiceWebSocketBuffering	`cli.js`	line ~9514, byte `0xc8da9a`	`startRecording: buffering audio while WebSocket connects`	Captured chunks are buffered until the transcription stream is ready.
VoiceFinalTranscript	`cli.js`	line ~9514, byte `0xc8ccec`	`Final transcript assembled`	Transcription stream produces final text.
VoiceTranscriptInjection	`cli.js`	line ~9514, byte `0xc8ce2c`	`Injecting transcript`	Final transcript is injected into the input buffer.
VoiceConnectionFailureTelemetry	`cli.js`	line ~9514, byte `0xc8cec1`	`voice_transcription_connection_failed`	Voice stream connection failure telemetry.
VoiceStreamErrorPath	`cli.js`	line ~9514, byte `0xc8e327`	`voice_stream error`	Voice stream error path.
VoiceStreamAuthFailure	`cli.js`	line ~9514, byte `0xc8e7dc`	`voice_stream_no_auth`	Voice stream auth failure telemetry.
VoiceDiscoveryHint	`cli.js`	line ~9553, byte `0xcba243`	`Use /voice to enable push-to-talk dictation`	User-visible discovery hint.

High-level design

flowchart TD
    User["/voice command or push-to-talk key"] --> Settings[voice.enabled / voice.mode / autoSubmit]
    Settings --> UI[VoiceProvider state]
    UI --> Capture{capture backend}
    Capture -->|native available| Native[audio-capture.node N-API]
    Capture -->|Linux/WSL fallback| Recorder[arecord or SoX rec]
    Native --> Chunks[audio chunks + audio levels]
    Recorder --> Chunks
    Chunks --> Stream[voice transcription stream]
    Stream --> Interim[interim transcript]
    Stream --> Final[final transcript]
    Final --> Inject[input buffer injection]
    Inject --> Prompt[normal prompt submission path]

Voice mode is dictation, not a separate agent loop. After transcription, the result flows back into the same text prompt pipeline used by keyboard input.

User-facing controls

Surface	Meaning
`/voice`	Toggle or configure voice mode.
`/voice hold`	Hold-to-talk dictation.
`/voice tap`	Tap-to-toggle dictation.
`/voice off`	Disable voice mode.
`voice:pushToTalk`	TUI keybinding/action; the default chat binding includes space for push-to-talk.
`voice.autoSubmit`	Setting that can submit after transcript injection rather than only inserting text.
`language`	Preferred language for Claude responses and voice dictation.

The state model contains voiceState, voiceError, voiceInterimTranscript, voiceAudioLevels, voiceWarmingUp, and awaitingVoiceSubmitDoubleTap, which explains the visible warmup/recording/transcribing feedback.

Capture backends

Native N-API path

The original Bun payload ships audio-capture.js and audio-capture.node; the final repository layout retains only the JS shim. cli.js loads the native addon when it is available at runtime and exports wrappers such as:

startNativeRecording
stopNativeRecording
isNativeRecordingActive
startNativePlayback
writeNativePlaybackData
stopNativePlayback
microphoneAuthorizationStatus
isNativeAudioAvailable

The audio-capture-napi loaded anchor confirms the runtime attempts to use this addon when available.

Recorder fallback path

When native capture is unavailable, the runtime falls back to command-line recorders:

Linux/WSL can use arecord.
A SoX rec path exists and produces user-facing guidance when missing.
WSL has explicit failure messaging when no working recorder exists.

This fallback design keeps the JS/TUI voice state machine independent from any one capture backend.

Transcription stream and injection

The recording flow has two phases:

Capture phase: start local recording, collect chunks, and surface audio levels.
Stream phase: connect to the voice stream, buffer audio while the WebSocket-like stream is connecting, send audio frames, receive interim/final transcript messages, and close/finalize.

sequenceDiagram
    autonumber
    participant UI as TUI voice action
    participant Capture as Native/fallback capture
    participant Buffer as Audio buffer
    participant Stream as Voice stream
    participant Input as Prompt input buffer

    UI->>Capture: startRecording
    Capture-->>UI: audio levels / chunks
    UI->>Stream: open transcription stream
    Capture->>Buffer: buffer chunks while stream connects
    Stream-->>UI: ready
    Buffer->>Stream: flush buffered chunks
    Capture->>Stream: send live audio frames
    Stream-->>UI: interim transcript
    UI->>Capture: finishRecording
    Capture-->>Stream: final audio / close
    Stream-->>UI: final transcript
    UI->>Input: Injecting transcript

The source strings Final transcript assembled and Injecting transcript confirm that the transcribed text is not merely displayed; it becomes input to the regular prompt flow.

Availability and gates

Voice mode is constrained by environment and account state:

Gate	Source-confirmed behavior
Local audio device	Remote/no-device environments show `Voice mode requires microphone access... run Claude Code locally instead.`
Account/auth	`/voice` can report `Voice mode requires a Claude.ai account`; stream errors include `voice_stream_no_auth`.
Recorder dependencies	WSL and SoX-specific errors guide the user when no recorder backend is available.
Settings	`voiceEnabled`, `voice.enabled`, `voice.mode`, `voice.autoSubmit`, and `language` all affect behavior.
Feature/availability check	The TUI renders voice indicators only when the availability helper says voice can run.

The current evidence supports documenting voice as supported local dictation, not as always available in every environment.

Telemetry and error handling

The bundle contains voice-specific telemetry/error names:

Event/string	Meaning
`tengu_voice_toggled`	Voice setting changed.
`tengu_voice_recording_started`	Local recording began.
`tengu_voice_recording_completed`	Recording completed.
`voice_transcription_connection_failed`	Could not connect to the transcription stream.
`voice_transcription_no_audio_signal`	Capture produced no usable audio signal.
`voice_transcription_no_speech`	Speech was not detected in the recorded audio.
`voice_stream_no_auth`	Voice stream rejected/failed auth.
`voice_stream error`	General stream failure.

Failures are surfaced in the TUI as voiceError and do not replace the normal text-input path.

Relationship to media native modules

The older Media native modules inventory correctly identified audio-capture.node as shipped payload. This page adds the missing runtime call path: the main bundle can load the module when present, starts/stops recording, falls back to OS recorders, and injects the resulting transcript.

Caveats

The .node binary itself is stripped. This page documents the JavaScript call boundary, exported wrapper names, and user-visible behavior, not native implementation details such as device enumeration internals.
The stream endpoint and server-side transcription implementation are not recoverable from this source alone. The bundle proves a client-side voice stream and auth/error handling, not the backend model details.
Voice mode should be described as dictation. There is no evidence here that the agent loop itself becomes audio-native; text remains the prompt handoff after transcription.

Created and maintained by Yingting Huang.