Skip to content

Model selection, calls, usage, quota, and billing

This page documents how cli.js selects models dynamically, how many logical model roles are visible, how provider calls are made, and how rate limits, errors, usage, quota, and billing are surfaced.

Scope: model aliases and precedence, main/helper/subagent/advisor/fallback model roles, Messages API request construction, streaming and retry behavior, rate-limit headers/events, token/cost accounting, headless budget guards, quota checks, and billing/extra-usage UI surfaces.

Source anchors

Semantic aliasSourceApproximate locationString or symbolMeaning
DefaultModelResolverscli.jsline ~253, byte 0x20eb5fgetDefaultSonnetModel, getDefaultOpusModel, getDefaultHaikuModel, getDefaultMainLoopModelResolver exports for the model family defaults.
SmallFastModelOverridecli.jsline ~253, byte 0x20eb73ANTHROPIC_SMALL_FAST_MODELSmall/fast helper model override.
MainModelEnvOverridecli.jsline ~253, byte 0x20ed29ANTHROPIC_MODELEnvironment-level main model override.
PerTurnModelResolvercli.jsline ~253, byte 0x20ef88nG({permissionMode,mainLoopModel,exceeds200kTokens})Per-turn model resolver; plan mode can alter the selected model.
ModelAliasResolvercli.jsline ~253, byte 0x20fca5case "opusplan", case "sonnet", case "haiku", case "opus", case "best"Alias-to-concrete-model mapping.
StartupModelPrecedencecli.jsline ~791, byte 0x43e8aehgK({cli,env,settings,agentFrontmatter})Startup model precedence across CLI, env, settings, and agent frontmatter.
FallbackModelResolvercli.jsline ~791, byte 0x43e853ygK({cli:{fallbackModel}})Fallback-model resolver.
StartupModelStatecli.jsline ~19293, byte 0xd90439startup_resolve_modelRoot startup path stores effective and initial model state.
ModelSelectionFlagcli.jsline ~19525, byte 0xdc18ed--model <model>Root model-selection flag.
FallbackModelFlagcli.jsline ~19525, byte 0xdc1b5d--fallback-model <model>Print-mode overload fallback flag.
AdvisorModelSettingcli.jsline ~185, byte 0x11ca03advisorModelSettings surface for the server-side advisor tool model.
SubagentModelOverridecli.jsline ~2844, byte 0x7a7606CLAUDE_CODE_SUBAGENT_MODELSubagent model override.
AutoModeClassifierConfigcli.jsline ~3091, byte 0x7bb35ctengu_auto_mode_config, twoStageClassifierAuto-mode classifier model/config selection.
AutoModeRequestSourcecli.jsline ~3091, byte 0x7b8fa0querySource:"auto_mode"Auto-mode classifier provider request source.
MemoryHelperModelcli.jsline ~1975, byte 0x50d55bSelect memories relevant to:, model:iv()Memory relevance helper uses the Sonnet resolver.
QuotaProbeRequestcli.jsline ~793, byte 0x440bccsource:"quota_check", max_tokens:1, messages:[{..."quota"}]Quota probe sends a tiny helper request.
ProviderRequestWrappercli.jsline ~404, byte 0x2c1a5c[API REQUEST], x-client-request-idFetch wrapper logs requests and injects a client request ID.
SseStreamDetectorcli.jsline ~404, byte 0x2c1b55text/event-streamStreaming response detection.
BedrockStreamDetectorcli.jsline ~404, byte 0x2c1b87vnd.amazon.eventstreamBedrock event-stream detection.
TokenCountHelpercli.jsline ~4966, byte 0x927f0dsource:"count_tokens", beta.messages.createToken-count helper request.
ApiUsageTelemetrycli.jsline ~2027, byte 0x52af3aapi_request, input_tokens, output_tokens, cache_read_tokens, cost_usdAPI request telemetry/accounting.
SessionUsageAccumulatorcli.jsline ~11, byte 0x9ed2totalCostUSD, modelUsage, RV8Session-level cost and per-model usage accumulator.
HeadlessUsageResultcli.jsline ~2004, byte 0x516656total_cost_usd, usage, modelUsageHeadless result schema includes usage and cost.
SdkRetryDelayParsercli.jsline ~51, byte 0x28cearetry-after-ms, retry-after, status 429, status >=500SDK retry behavior and retry-delay parsing.
RuntimeRateLimitClassifiercli.jsline ~793, byte 0x446445status 429, status 529, overloaded_errorRuntime error classification for rate limit and overload.
OverloadFallbackTelemetrycli.jsline ~5543, byte 0x98b8f2tengu_api_opus_fallback_triggered, api_request_retry_exhaustedRetry loop and overload fallback behavior.
UnifiedRateLimitHeaderscli.jsline ~793, byte 0x4406aaanthropic-ratelimit-unified-* headersUnified rate-limit/quota header parsing.
RateLimitEventFramecli.jsline ~19349, byte 0xda5683rate_limit_eventRate-limit state changes are emitted to headless/SDK streams.
MaxBudgetFlagcli.jsline ~19525, byte 0xdc06b1--max-budget-usd <amount>Headless API-spend budget flag.
MaxBudgetErrorResultcli.jsline ~19323, byte 0xda0191error_max_budget_usdHeadless result when the dollar budget is exceeded.
UsageLimitMessagecli.jsline ~793, byte 0x43f2bbusage limit, extra usage spending limitUser-visible limit/overage messages.
BillingUpgradeGuidancecli.jsline ~3118, byte 0x7d57c1hasBillingAccess, /extra-usage, /upgradeBilling/overage guidance in rate-limit UI.
ApiUsageBillingStatuscli.jsline ~6634, byte 0xa9074aAPI Usage BillingStatus-line billing type for API-key/console-style usage.

Model selection precedence

Model selection is a layered resolver, not one static constant.

flowchart TD
CLI[--model / -m] --> Startup[hgK startup resolver]
AgentFrontmatter[agent frontmatter model] --> Startup
Env[ANTHROPIC_MODEL] --> Startup
Settings[settings model] --> Startup
Default[default main loop model] --> Startup
Startup --> State[mainLoopModelOverride + initialMainLoopModel]
State --> Turn[nG per-turn resolver]
Permission[permission mode / plan mode] --> Turn
Context[context size, e.g. >200k] --> Turn
Turn --> Request[Provider request]

The root startup path calls hgK(...), then stores two pieces of state:

StateMeaning
effectiveModel / mainLoopModelOverrideThe override currently applied to the loop.
initialMainLoopModelThe model originally selected by startup/env/settings.

The visible precedence is:

  1. CLI --model, including default as a special alias for the default concrete model.
  2. Agent frontmatter model when present and not inherit.
  3. ANTHROPIC_MODEL.
  4. Settings model.
  5. Default main-loop model resolver.

Resume can also restore the model: Sa5(...) scans prior assistant messages and IG(...) reapplies a compatible restored model if no stronger override is active.

Logical model roles

There is no fixed “number of concrete models” baked into the CLI. Concrete IDs depend on provider, feature flags, aliases, environment variables, settings, and account capabilities. The source does show a fixed set of logical model roles:

RoleResolver / settingPurpose
Main loop modelR7(), lJ(), --model, ANTHROPIC_MODEL, settingsNormal assistant turns. Defaults to the default main-loop model, commonly Sonnet unless account/provider logic chooses otherwise.
Default Sonnetiv(), ANTHROPIC_DEFAULT_SONNET_MODELEveryday/default work; also used by memory relevance/fact extraction helpers.
Default Opus / bestnv(), alias opus, alias best, opusplan in plan modeMore capable/plan-mode work and “best” alias.
Default Haiku / small-fastSxH(), LL(), ANTHROPIC_SMALL_FAST_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODELLightweight helper requests such as quota probing, some web/search/count/token/helper paths, and small-fast mode when available.
Auto-mode classifiertengu_auto_mode_config.model else R7(), twoStageClassifierClassifies tool/action safety for auto mode with querySource:"auto_mode".
Memory helperiv()Selects relevant memories and extracts facts using JSON-schema outputs.
Advisor tool modeladvisorModelServer-side advisor tool model override.
Subagent modelCLAUDE_CODE_SUBAGENT_MODEL, agent model/frontmatter, or inheritLets subagents use an explicit model or inherit from the main loop.
Fallback model--fallback-model / ygKPrint/headless overload fallback when the primary model repeatedly returns overload.

The important answer to “how many models” is therefore: the CLI uses multiple logical model roles; it does not hard-code one universal count of concrete models. In a normal local session, the main loop may use one model, while helper calls can use Sonnet or small-fast/Haiku, auto-mode can make classifier calls, and subagents/advisor/fallback can introduce additional models.

Alias and dynamic mapping

The alias resolver maps user-facing names to current concrete IDs:

AliasSource-confirmed behavior
sonnetResolves through iv().
haikuResolves through SxH().
opusResolves through nv().
bestResolves through Itq(), which currently points at the Opus resolver.
opusplanResolves to Sonnet normally but can switch to Opus in plan mode through nG(...).
defaultTreated as the current default concrete model in CLI/fallback handling.

Because aliases are resolved at runtime, docs should prefer “Sonnet/Opus/Haiku resolver” unless a concrete build-specific model ID is the point of the discussion.

Provider call path

Provider calls share a common shape even when the backend differs.

sequenceDiagram
participant ContextLoop as Context/model loop
participant Client as Provider client Su(...)
participant Fetch as fetch wrapper Uv1
participant Provider as Anthropic/Bedrock/Vertex/etc.
participant Accounting as usage/cost state
ContextLoop->>Client: model, messages, system, tools, thinking, betas, metadata
Client->>Fetch: beta.messages.create(...)
Fetch->>Fetch: add x-client-request-id, log [API REQUEST]
Fetch->>Provider: HTTP(S) request
Provider-->>Fetch: text/event-stream or provider event stream
Fetch-->>ContextLoop: streaming deltas / final response
ContextLoop->>Accounting: input/output/cache tokens, duration, cost, request id

Confirmed request ingredients include:

Request ingredientSource evidence
Modelmodel:<resolver result> in main/helper requests.
Messages/systemMain loop and helper calls pass messages, system, and sometimes skipSystemPromptPrefix.
Tools/tool choiceCount-token/helper and web-search paths can include tool schemas or tool choice.
Thinking/effort--thinking, --thinking-display, --max-thinking-tokens, effort settings.
BetasRu(model) and TP(...) add model/provider beta headers.
Metadatametadata:C3H() appears in helper/provider calls.
Extra body params$9H() contributes additional API body settings.

The fetch wrapper logs [API REQUEST] <path> x-client-request-id=<id> source=<source> and detects streaming content types. For first-party/AWS-like first-party paths it injects x-client-request-id; for Bedrock it also recognizes vnd.amazon.eventstream.

Streaming, retries, and errors

Streaming

The runtime uses provider streaming, with source-confirmed surfaces for:

  • text/event-stream for ordinary streaming responses;
  • vnd.amazon.eventstream for Bedrock event streams;
  • stream deltas that carry input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens, and context_management.

Retry behavior

There are two visible retry layers:

LayerBehavior
SDK/client retryParses retry-after-ms and retry-after; retries status 408, 409, 429, and >=500 according to max-retry policy.
Claude Code loop retryClassifies provider/API errors, retries selected retryable failures, handles auth refresh paths, and can switch to fallback model on repeated overload.

The runtime classifies:

ConditionClassification / behavior
HTTP 429Rate limit.
HTTP 529 or "type":"overloaded_error"Server overload; can trigger fallback logic.
HTTP 413 with context-window wordingPrompt/context too long; UI directs the user toward /compact or reducing context.
Repeated overload with --fallback-modelEmits tengu_api_opus_fallback_triggered and raises a fallback-model transition.
Retry exhaustionEmits api_request_retry_exhausted/throws a wrapped execution error.

Usage and cost accounting

cli.js maintains session-level usage state in the global runtime envelope:

StateMeaning
totalCostUSDAccumulated API cost estimate for the current run/session envelope.
modelUsagePer-model token/cost usage map.
totalAPIDuration / totalAPIDurationWithoutRetriesTotal provider time with and without retry time.
hasUnknownModelCostSet when the runtime cannot price a model.

After a successful API call, telemetry includes:

  • input_tokens
  • output_tokens
  • cache_read_tokens
  • cache_creation_tokens
  • cost_usd
  • cost_usd_micros
  • duration_ms
  • request_id
  • model speed (fast / normal)
  • query source
  • effort level when present

Headless result frames include total_cost_usd, usage, and modelUsage, so SDK/print-mode consumers can account for the entire run rather than only the final message.

Budget guards

The root flag --max-budget-usd <amount> is a print/headless budget guard. The headless loop checks vW()>=maxBudgetUsd after events and emits a final result with subtype error_max_budget_usd when exceeded.

The emitted result contains:

  • elapsed duration;
  • API duration;
  • turn count;
  • total_cost_usd;
  • usage;
  • modelUsage;
  • permission denials;
  • a user-readable error such as Reached maximum budget ($<amount>).

This is local run-budget enforcement. It is separate from server-side account quota/rate limits.

Quota, rate limit, and billing surfaces

Quota probing

The function anchored by source:"quota_check" creates a client with maxRetries:0, selects LL() as the helper model, and sends a one-token messages.create request with the user content quota. This is a low-cost probe designed to surface quota/rate-limit headers rather than to generate meaningful text.

Unified rate-limit headers

The runtime parses Anthropic unified rate-limit headers such as:

Header familyMeaning
anthropic-ratelimit-unified-representative-claimWhich limit bucket is currently representative.
anthropic-ratelimit-unified-resetReset timestamp for the active limit.
anthropic-ratelimit-unified-overage-statusWhether extra usage/overage is allowed, warning, or rejected.
anthropic-ratelimit-unified-overage-resetReset timestamp for overage status.
anthropic-ratelimit-unified-overage-disabled-reasonAdmin/seat/group reason why extra usage is unavailable.
anthropic-ratelimit-unified-5h-utilization / ...-5h-resetFive-hour/session window utilization and reset.
anthropic-ratelimit-unified-7d-utilization / ...-7d-resetSeven-day/weekly window utilization and reset.
anthropic-ratelimit-unified-overage-utilization / ...-overage-resetExtra-usage utilization and reset.

Parsed state is stored as the current rate-limit state and projected into headless streams as rate_limit_event frames.

User-visible limit and billing messages

The UI distinguishes several user-facing cases:

SurfaceMeaning
five_hour“session limit” / five-hour style limit.
seven_dayweekly limit.
seven_day_opusOpus-specific limit.
seven_day_sonnetSonnet-specific limit.
overageusage or extra-usage spending limit.
/extra-usageSuggested when extra usage can be requested/enabled.
/upgradeSuggested for Pro/Max-style upgrade paths when applicable.
hasBillingAccessGates whether the user can manage billing/extra usage.
API Usage BillingStatus-line billing type for API/console billing mode.

This confirms that billing/quota handling is not just a raw API error. The CLI parses quota headers, maintains local limit state, emits SDK/headless events, and renders plan/billing-specific guidance.

Relationship between usage, quota, and billing

ConcernOwnerSource-confirmed mechanism
Per-request usageProvider response + runtime accountingToken/cache/cost fields collected after API calls.
Per-run budgetLocal headless loop--max-budget-usd and error_max_budget_usd.
Account quota/rate limitsProvider/server headersanthropic-ratelimit-unified-* parsing and rate_limit_event.
Billing/overage UIAccount state + server headers + OAuth account role/extra-usage, /upgrade, billing-access checks, API Usage Billing.

Caveats

  • Concrete model names and aliases are build/account/provider dependent. The logical roles above are safer anchors than one hard-coded model count.
  • Some rate_limit_error and SDK examples in the bundle are embedded documentation strings. This page treats them as evidence only when connected to runtime classification, request wrapping, header parsing, or result schemas.
  • Cost is an estimate derived from known model pricing tables and response usage. hasUnknownModelCost exists because not every model can be priced by the local table.
  • --fallback-model is documented by the CLI as print-mode-only. Interactive model changes use /model, Remote Control set_model, or session state transitions rather than the fallback flag.

Created and maintained by Yingting Huang.