Skip to content

Audio capture and voice mode

This page answers the voice question for the extracted cli.js: does Claude Code support voice, and how is it designed?

Yes. The runtime contains a source-confirmed voice dictation path. It is local microphone capture plus remote transcription: /voice enables hold-to-talk or tap-to-toggle recording, audio is captured through a native N-API module or OS recorder fallback, a voice stream receives transcript chunks, and the final transcript is injected into the prompt input.

Source anchors

Semantic aliasSourceApproximate locationString or symbolMeaning
AudioNativeAddonRequirecli.jsline ~11, byte 0x81drequire("/$bunfs/root/audio-capture.node")Main bundle can load the embedded audio native addon.
AudioCaptureShimclaude-code-pkg/audio-capture.jsline ~11require("/$bunfs/root/audio-capture.node")Retained JS shim references the audio N-API binary from the original Bun payload.
LegacyVoiceEnabledSettingcli.jsline ~185, byte 0x116cb1voiceEnabled:y.booleanLegacy/global voice setting surface.
VoiceModeSettingsSchemacli.jsline ~185, byte 0x11d5e6Voice mode settings (hold-to-talk / tap-to-toggle dictation)Structured voice settings schema.
VoiceLanguageSettingcli.jsline ~185, byte 0x11c36fPreferred language for Claude responses and voice dictationVoice dictation shares the language setting.
VoiceIdleStatecli.jsline ~605, byte 0x38c87dvoiceState:"idle"TUI voice state starts idle.
VoiceInterimTranscriptStatecli.jsline ~605, byte 0x38c89fvoiceInterimTranscriptInterim transcript is part of TUI state.
VoiceAudioLevelsStatecli.jsline ~605, byte 0x38c8b9voiceAudioLevelsAudio-level visualization state.
PushToTalkActioncli.jsline ~605, byte 0x38dbbbvoice:pushToTalkKeybinding/action for voice recording.
NativeAudioWrappercli.jsline ~7745, byte 0xb5d6e1writeNativePlaybackData:()=>Kb5Native audio wrapper exports playback and recording methods.
AudioNapiLoadedLogcli.jsline ~7745, byte 0xb5dd23audio-capture-napi loadedNative audio addon load path succeeds when available.
MicrophoneAccessGuardcli.jsline ~7745, byte 0xb5e43dVoice mode requires microphone accessRemote/no-device guard for voice mode.
WslRecorderFallbackGuardcli.jsline ~7747, byte 0xb5e530Voice mode could not find a working audio recorder in WSLWSL fallback error.
SoxRecorderRequirementcli.jsline ~7751, byte 0xb5e76dVoice mode requires SoX for audio recordingSoX fallback requirement.
VoiceAccountGatecli.jsline ~7754, byte 0xb5ef70Voice mode requires a Claude.ai accountVoice stream is account-gated.
VoiceSlashCommandcli.jsline ~7756, byte 0xb5f941Toggle voice mode/voice slash command description.
VoiceTuiComponentscli.jsline ~9355VoiceIndicator, VoiceWarmupHintTUI components render recording/warmup state.
VoiceFinishRecordingcli.jsline ~9514, byte 0xc8c6f9finishRecording: stopping recordingRecording state machine finalizes capture.
VoiceWebSocketBufferingcli.jsline ~9514, byte 0xc8da9astartRecording: buffering audio while WebSocket connectsCaptured chunks are buffered until the transcription stream is ready.
VoiceFinalTranscriptcli.jsline ~9514, byte 0xc8ccecFinal transcript assembledTranscription stream produces final text.
VoiceTranscriptInjectioncli.jsline ~9514, byte 0xc8ce2cInjecting transcriptFinal transcript is injected into the input buffer.
VoiceConnectionFailureTelemetrycli.jsline ~9514, byte 0xc8cec1voice_transcription_connection_failedVoice stream connection failure telemetry.
VoiceStreamErrorPathcli.jsline ~9514, byte 0xc8e327voice_stream errorVoice stream error path.
VoiceStreamAuthFailurecli.jsline ~9514, byte 0xc8e7dcvoice_stream_no_authVoice stream auth failure telemetry.
VoiceDiscoveryHintcli.jsline ~9553, byte 0xcba243Use /voice to enable push-to-talk dictationUser-visible discovery hint.

High-level design

flowchart TD
User["/voice command or push-to-talk key"] --> Settings[voice.enabled / voice.mode / autoSubmit]
Settings --> UI[VoiceProvider state]
UI --> Capture{capture backend}
Capture -->|native available| Native[audio-capture.node N-API]
Capture -->|Linux/WSL fallback| Recorder[arecord or SoX rec]
Native --> Chunks[audio chunks + audio levels]
Recorder --> Chunks
Chunks --> Stream[voice transcription stream]
Stream --> Interim[interim transcript]
Stream --> Final[final transcript]
Final --> Inject[input buffer injection]
Inject --> Prompt[normal prompt submission path]

Voice mode is dictation, not a separate agent loop. After transcription, the result flows back into the same text prompt pipeline used by keyboard input.

User-facing controls

SurfaceMeaning
/voiceToggle or configure voice mode.
/voice holdHold-to-talk dictation.
/voice tapTap-to-toggle dictation.
/voice offDisable voice mode.
voice:pushToTalkTUI keybinding/action; the default chat binding includes space for push-to-talk.
voice.autoSubmitSetting that can submit after transcript injection rather than only inserting text.
languagePreferred language for Claude responses and voice dictation.

The state model contains voiceState, voiceError, voiceInterimTranscript, voiceAudioLevels, voiceWarmingUp, and awaitingVoiceSubmitDoubleTap, which explains the visible warmup/recording/transcribing feedback.

Capture backends

Native N-API path

The original Bun payload ships audio-capture.js and audio-capture.node; the final repository layout retains only the JS shim. cli.js loads the native addon when it is available at runtime and exports wrappers such as:

  • startNativeRecording
  • stopNativeRecording
  • isNativeRecordingActive
  • startNativePlayback
  • writeNativePlaybackData
  • stopNativePlayback
  • microphoneAuthorizationStatus
  • isNativeAudioAvailable

The audio-capture-napi loaded anchor confirms the runtime attempts to use this addon when available.

Recorder fallback path

When native capture is unavailable, the runtime falls back to command-line recorders:

  • Linux/WSL can use arecord.
  • A SoX rec path exists and produces user-facing guidance when missing.
  • WSL has explicit failure messaging when no working recorder exists.

This fallback design keeps the JS/TUI voice state machine independent from any one capture backend.

Transcription stream and injection

The recording flow has two phases:

  1. Capture phase: start local recording, collect chunks, and surface audio levels.
  2. Stream phase: connect to the voice stream, buffer audio while the WebSocket-like stream is connecting, send audio frames, receive interim/final transcript messages, and close/finalize.
sequenceDiagram
autonumber
participant UI as TUI voice action
participant Capture as Native/fallback capture
participant Buffer as Audio buffer
participant Stream as Voice stream
participant Input as Prompt input buffer
UI->>Capture: startRecording
Capture-->>UI: audio levels / chunks
UI->>Stream: open transcription stream
Capture->>Buffer: buffer chunks while stream connects
Stream-->>UI: ready
Buffer->>Stream: flush buffered chunks
Capture->>Stream: send live audio frames
Stream-->>UI: interim transcript
UI->>Capture: finishRecording
Capture-->>Stream: final audio / close
Stream-->>UI: final transcript
UI->>Input: Injecting transcript

The source strings Final transcript assembled and Injecting transcript confirm that the transcribed text is not merely displayed; it becomes input to the regular prompt flow.

Availability and gates

Voice mode is constrained by environment and account state:

GateSource-confirmed behavior
Local audio deviceRemote/no-device environments show Voice mode requires microphone access... run Claude Code locally instead.
Account/auth/voice can report Voice mode requires a Claude.ai account; stream errors include voice_stream_no_auth.
Recorder dependenciesWSL and SoX-specific errors guide the user when no recorder backend is available.
SettingsvoiceEnabled, voice.enabled, voice.mode, voice.autoSubmit, and language all affect behavior.
Feature/availability checkThe TUI renders voice indicators only when the availability helper says voice can run.

The current evidence supports documenting voice as supported local dictation, not as always available in every environment.

Telemetry and error handling

The bundle contains voice-specific telemetry/error names:

Event/stringMeaning
tengu_voice_toggledVoice setting changed.
tengu_voice_recording_startedLocal recording began.
tengu_voice_recording_completedRecording completed.
voice_transcription_connection_failedCould not connect to the transcription stream.
voice_transcription_no_audio_signalCapture produced no usable audio signal.
voice_transcription_no_speechSpeech was not detected in the recorded audio.
voice_stream_no_authVoice stream rejected/failed auth.
voice_stream errorGeneral stream failure.

Failures are surfaced in the TUI as voiceError and do not replace the normal text-input path.

Relationship to media native modules

The older Media native modules inventory correctly identified audio-capture.node as shipped payload. This page adds the missing runtime call path: the main bundle can load the module when present, starts/stops recording, falls back to OS recorders, and injects the resulting transcript.

Caveats

  • The .node binary itself is stripped. This page documents the JavaScript call boundary, exported wrapper names, and user-visible behavior, not native implementation details such as device enumeration internals.
  • The stream endpoint and server-side transcription implementation are not recoverable from this source alone. The bundle proves a client-side voice stream and auth/error handling, not the backend model details.
  • Voice mode should be described as dictation. There is no evidence here that the agent loop itself becomes audio-native; text remains the prompt handoff after transcription.

Created and maintained by Yingting Huang.