Model selection, calls, usage, quota, and billing

This page documents how cli.js selects models dynamically, how many logical model roles are visible, how provider calls are made, and how rate limits, errors, usage, quota, and billing are surfaced.

Scope: model aliases and precedence, main/helper/subagent/advisor/fallback model roles, Messages API request construction, streaming and retry behavior, rate-limit headers/events, token/cost accounting, headless budget guards, quota checks, and billing/extra-usage UI surfaces.

Source anchors

Semantic alias	Source	Approximate location	String or symbol	Meaning
DefaultModelResolvers	`cli.js`	line ~253, byte `0x20eb5f`	`getDefaultSonnetModel`, `getDefaultOpusModel`, `getDefaultHaikuModel`, `getDefaultMainLoopModel`	Resolver exports for the model family defaults.
SmallFastModelOverride	`cli.js`	line ~253, byte `0x20eb73`	`ANTHROPIC_SMALL_FAST_MODEL`	Small/fast helper model override.
MainModelEnvOverride	`cli.js`	line ~253, byte `0x20ed29`	`ANTHROPIC_MODEL`	Environment-level main model override.
PerTurnModelResolver	`cli.js`	line ~253, byte `0x20ef88`	`nG({permissionMode,mainLoopModel,exceeds200kTokens})`	Per-turn model resolver; plan mode can alter the selected model.
ModelAliasResolver	`cli.js`	line ~253, byte `0x20fca5`	`case "opusplan"`, `case "sonnet"`, `case "haiku"`, `case "opus"`, `case "best"`	Alias-to-concrete-model mapping.
StartupModelPrecedence	`cli.js`	line ~791, byte `0x43e8ae`	`hgK({cli,env,settings,agentFrontmatter})`	Startup model precedence across CLI, env, settings, and agent frontmatter.
FallbackModelResolver	`cli.js`	line ~791, byte `0x43e853`	`ygK({cli:{fallbackModel}})`	Fallback-model resolver.
StartupModelState	`cli.js`	line ~19293, byte `0xd90439`	`startup_resolve_model`	Root startup path stores effective and initial model state.
ModelSelectionFlag	`cli.js`	line ~19525, byte `0xdc18ed`	`--model <model>`	Root model-selection flag.
FallbackModelFlag	`cli.js`	line ~19525, byte `0xdc1b5d`	`--fallback-model <model>`	Print-mode overload fallback flag.
AdvisorModelSetting	`cli.js`	line ~185, byte `0x11ca03`	`advisorModel`	Settings surface for the server-side advisor tool model.
SubagentModelOverride	`cli.js`	line ~2844, byte `0x7a7606`	`CLAUDE_CODE_SUBAGENT_MODEL`	Subagent model override.
AutoModeClassifierConfig	`cli.js`	line ~3091, byte `0x7bb35c`	`tengu_auto_mode_config`, `twoStageClassifier`	Auto-mode classifier model/config selection.
AutoModeRequestSource	`cli.js`	line ~3091, byte `0x7b8fa0`	`querySource:"auto_mode"`	Auto-mode classifier provider request source.
MemoryHelperModel	`cli.js`	line ~1975, byte `0x50d55b`	`Select memories relevant to:`, `model:iv()`	Memory relevance helper uses the Sonnet resolver.
QuotaProbeRequest	`cli.js`	line ~793, byte `0x440bcc`	`source:"quota_check"`, `max_tokens:1`, `messages:[{..."quota"}]`	Quota probe sends a tiny helper request.
ProviderRequestWrapper	`cli.js`	line ~404, byte `0x2c1a5c`	`[API REQUEST]`, `x-client-request-id`	Fetch wrapper logs requests and injects a client request ID.
SseStreamDetector	`cli.js`	line ~404, byte `0x2c1b55`	`text/event-stream`	Streaming response detection.
BedrockStreamDetector	`cli.js`	line ~404, byte `0x2c1b87`	`vnd.amazon.eventstream`	Bedrock event-stream detection.
TokenCountHelper	`cli.js`	line ~4966, byte `0x927f0d`	`source:"count_tokens"`, `beta.messages.create`	Token-count helper request.
ApiUsageTelemetry	`cli.js`	line ~2027, byte `0x52af3a`	`api_request`, `input_tokens`, `output_tokens`, `cache_read_tokens`, `cost_usd`	API request telemetry/accounting.
SessionUsageAccumulator	`cli.js`	line ~11, byte `0x9ed2`	`totalCostUSD`, `modelUsage`, `RV8`	Session-level cost and per-model usage accumulator.
HeadlessUsageResult	`cli.js`	line ~2004, byte `0x516656`	`total_cost_usd`, `usage`, `modelUsage`	Headless result schema includes usage and cost.
SdkRetryDelayParser	`cli.js`	line ~51, byte `0x28cea`	`retry-after-ms`, `retry-after`, status `429`, status `>=500`	SDK retry behavior and retry-delay parsing.
RuntimeRateLimitClassifier	`cli.js`	line ~793, byte `0x446445`	status `429`, status `529`, `overloaded_error`	Runtime error classification for rate limit and overload.
OverloadFallbackTelemetry	`cli.js`	line ~5543, byte `0x98b8f2`	`tengu_api_opus_fallback_triggered`, `api_request_retry_exhausted`	Retry loop and overload fallback behavior.
UnifiedRateLimitHeaders	`cli.js`	line ~793, byte `0x4406aa`	`anthropic-ratelimit-unified-*` headers	Unified rate-limit/quota header parsing.
RateLimitEventFrame	`cli.js`	line ~19349, byte `0xda5683`	`rate_limit_event`	Rate-limit state changes are emitted to headless/SDK streams.
MaxBudgetFlag	`cli.js`	line ~19525, byte `0xdc06b1`	`--max-budget-usd <amount>`	Headless API-spend budget flag.
MaxBudgetErrorResult	`cli.js`	line ~19323, byte `0xda0191`	`error_max_budget_usd`	Headless result when the dollar budget is exceeded.
UsageLimitMessage	`cli.js`	line ~793, byte `0x43f2bb`	`usage limit`, `extra usage spending limit`	User-visible limit/overage messages.
BillingUpgradeGuidance	`cli.js`	line ~3118, byte `0x7d57c1`	`hasBillingAccess`, `/extra-usage`, `/upgrade`	Billing/overage guidance in rate-limit UI.
ApiUsageBillingStatus	`cli.js`	line ~6634, byte `0xa9074a`	`API Usage Billing`	Status-line billing type for API-key/console-style usage.

Model selection precedence

Model selection is a layered resolver, not one static constant.

flowchart TD
    CLI[--model / -m] --> Startup[hgK startup resolver]
    AgentFrontmatter[agent frontmatter model] --> Startup
    Env[ANTHROPIC_MODEL] --> Startup
    Settings[settings model] --> Startup
    Default[default main loop model] --> Startup
    Startup --> State[mainLoopModelOverride + initialMainLoopModel]
    State --> Turn[nG per-turn resolver]
    Permission[permission mode / plan mode] --> Turn
    Context[context size, e.g. >200k] --> Turn
    Turn --> Request[Provider request]

The root startup path calls hgK(...), then stores two pieces of state:

State	Meaning
`effectiveModel` / `mainLoopModelOverride`	The override currently applied to the loop.
`initialMainLoopModel`	The model originally selected by startup/env/settings.

The visible precedence is:

CLI --model, including default as a special alias for the default concrete model.
Agent frontmatter model when present and not inherit.
ANTHROPIC_MODEL.
Settings model.
Default main-loop model resolver.

Resume can also restore the model: Sa5(...) scans prior assistant messages and IG(...) reapplies a compatible restored model if no stronger override is active.

Logical model roles

There is no fixed “number of concrete models” baked into the CLI. Concrete IDs depend on provider, feature flags, aliases, environment variables, settings, and account capabilities. The source does show a fixed set of logical model roles:

Role	Resolver / setting	Purpose
Main loop model	`R7()`, `lJ()`, `--model`, `ANTHROPIC_MODEL`, settings	Normal assistant turns. Defaults to the default main-loop model, commonly Sonnet unless account/provider logic chooses otherwise.
Default Sonnet	`iv()`, `ANTHROPIC_DEFAULT_SONNET_MODEL`	Everyday/default work; also used by memory relevance/fact extraction helpers.
Default Opus / best	`nv()`, alias `opus`, alias `best`, `opusplan` in plan mode	More capable/plan-mode work and “best” alias.
Default Haiku / small-fast	`SxH()`, `LL()`, `ANTHROPIC_SMALL_FAST_MODEL`, `ANTHROPIC_DEFAULT_HAIKU_MODEL`	Lightweight helper requests such as quota probing, some web/search/count/token/helper paths, and small-fast mode when available.
Auto-mode classifier	`tengu_auto_mode_config.model` else `R7()`, `twoStageClassifier`	Classifies tool/action safety for auto mode with `querySource:"auto_mode"`.
Memory helper	`iv()`	Selects relevant memories and extracts facts using JSON-schema outputs.
Advisor tool model	`advisorModel`	Server-side advisor tool model override.
Subagent model	`CLAUDE_CODE_SUBAGENT_MODEL`, agent model/frontmatter, or inherit	Lets subagents use an explicit model or inherit from the main loop.
Fallback model	`--fallback-model` / `ygK`	Print/headless overload fallback when the primary model repeatedly returns overload.

The important answer to “how many models” is therefore: the CLI uses multiple logical model roles; it does not hard-code one universal count of concrete models. In a normal local session, the main loop may use one model, while helper calls can use Sonnet or small-fast/Haiku, auto-mode can make classifier calls, and subagents/advisor/fallback can introduce additional models.

Alias and dynamic mapping

The alias resolver maps user-facing names to current concrete IDs:

Alias	Source-confirmed behavior
`sonnet`	Resolves through `iv()`.
`haiku`	Resolves through `SxH()`.
`opus`	Resolves through `nv()`.
`best`	Resolves through `Itq()`, which currently points at the Opus resolver.
`opusplan`	Resolves to Sonnet normally but can switch to Opus in plan mode through `nG(...)`.
`default`	Treated as the current default concrete model in CLI/fallback handling.

Because aliases are resolved at runtime, docs should prefer “Sonnet/Opus/Haiku resolver” unless a concrete build-specific model ID is the point of the discussion.

Provider call path

Provider calls share a common shape even when the backend differs.

sequenceDiagram
    participant ContextLoop as Context/model loop
    participant Client as Provider client Su(...)
    participant Fetch as fetch wrapper Uv1
    participant Provider as Anthropic/Bedrock/Vertex/etc.
    participant Accounting as usage/cost state

    ContextLoop->>Client: model, messages, system, tools, thinking, betas, metadata
    Client->>Fetch: beta.messages.create(...)
    Fetch->>Fetch: add x-client-request-id, log [API REQUEST]
    Fetch->>Provider: HTTP(S) request
    Provider-->>Fetch: text/event-stream or provider event stream
    Fetch-->>ContextLoop: streaming deltas / final response
    ContextLoop->>Accounting: input/output/cache tokens, duration, cost, request id

Confirmed request ingredients include:

Request ingredient	Source evidence
Model	`model:<resolver result>` in main/helper requests.
Messages/system	Main loop and helper calls pass `messages`, `system`, and sometimes `skipSystemPromptPrefix`.
Tools/tool choice	Count-token/helper and web-search paths can include tool schemas or tool choice.
Thinking/effort	`--thinking`, `--thinking-display`, `--max-thinking-tokens`, effort settings.
Betas	`Ru(model)` and `TP(...)` add model/provider beta headers.
Metadata	`metadata:C3H()` appears in helper/provider calls.
Extra body params	`$9H()` contributes additional API body settings.

The fetch wrapper logs [API REQUEST] <path> x-client-request-id=<id> source=<source> and detects streaming content types. For first-party/AWS-like first-party paths it injects x-client-request-id; for Bedrock it also recognizes vnd.amazon.eventstream.

Streaming, retries, and errors

Streaming

The runtime uses provider streaming, with source-confirmed surfaces for:

text/event-stream for ordinary streaming responses;
vnd.amazon.eventstream for Bedrock event streams;
stream deltas that carry input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens, and context_management.

Retry behavior

There are two visible retry layers:

Layer	Behavior
SDK/client retry	Parses `retry-after-ms` and `retry-after`; retries status `408`, `409`, `429`, and `>=500` according to max-retry policy.
Claude Code loop retry	Classifies provider/API errors, retries selected retryable failures, handles auth refresh paths, and can switch to fallback model on repeated overload.

The runtime classifies:

Condition	Classification / behavior
HTTP `429`	Rate limit.
HTTP `529` or `"type":"overloaded_error"`	Server overload; can trigger fallback logic.
HTTP `413` with context-window wording	Prompt/context too long; UI directs the user toward `/compact` or reducing context.
Repeated overload with `--fallback-model`	Emits `tengu_api_opus_fallback_triggered` and raises a fallback-model transition.
Retry exhaustion	Emits `api_request_retry_exhausted`/throws a wrapped execution error.

Usage and cost accounting

cli.js maintains session-level usage state in the global runtime envelope:

State	Meaning
`totalCostUSD`	Accumulated API cost estimate for the current run/session envelope.
`modelUsage`	Per-model token/cost usage map.
`totalAPIDuration` / `totalAPIDurationWithoutRetries`	Total provider time with and without retry time.
`hasUnknownModelCost`	Set when the runtime cannot price a model.

After a successful API call, telemetry includes:

input_tokens
output_tokens
cache_read_tokens
cache_creation_tokens
cost_usd
cost_usd_micros
duration_ms
request_id
model speed (fast / normal)
query source
effort level when present

Headless result frames include total_cost_usd, usage, and modelUsage, so SDK/print-mode consumers can account for the entire run rather than only the final message.

Budget guards

The root flag --max-budget-usd <amount> is a print/headless budget guard. The headless loop checks vW()>=maxBudgetUsd after events and emits a final result with subtype error_max_budget_usd when exceeded.

The emitted result contains:

elapsed duration;
API duration;
turn count;
total_cost_usd;
usage;
modelUsage;
permission denials;
a user-readable error such as Reached maximum budget ($<amount>).

This is local run-budget enforcement. It is separate from server-side account quota/rate limits.

Quota, rate limit, and billing surfaces

Quota probing

The function anchored by source:"quota_check" creates a client with maxRetries:0, selects LL() as the helper model, and sends a one-token messages.create request with the user content quota. This is a low-cost probe designed to surface quota/rate-limit headers rather than to generate meaningful text.

Unified rate-limit headers

The runtime parses Anthropic unified rate-limit headers such as:

Header family	Meaning
`anthropic-ratelimit-unified-representative-claim`	Which limit bucket is currently representative.
`anthropic-ratelimit-unified-reset`	Reset timestamp for the active limit.
`anthropic-ratelimit-unified-overage-status`	Whether extra usage/overage is allowed, warning, or rejected.
`anthropic-ratelimit-unified-overage-reset`	Reset timestamp for overage status.
`anthropic-ratelimit-unified-overage-disabled-reason`	Admin/seat/group reason why extra usage is unavailable.
`anthropic-ratelimit-unified-5h-utilization` / `...-5h-reset`	Five-hour/session window utilization and reset.
`anthropic-ratelimit-unified-7d-utilization` / `...-7d-reset`	Seven-day/weekly window utilization and reset.
`anthropic-ratelimit-unified-overage-utilization` / `...-overage-reset`	Extra-usage utilization and reset.

Parsed state is stored as the current rate-limit state and projected into headless streams as rate_limit_event frames.

User-visible limit and billing messages

The UI distinguishes several user-facing cases:

Surface	Meaning
`five_hour`	“session limit” / five-hour style limit.
`seven_day`	weekly limit.
`seven_day_opus`	Opus-specific limit.
`seven_day_sonnet`	Sonnet-specific limit.
`overage`	usage or extra-usage spending limit.
`/extra-usage`	Suggested when extra usage can be requested/enabled.
`/upgrade`	Suggested for Pro/Max-style upgrade paths when applicable.
`hasBillingAccess`	Gates whether the user can manage billing/extra usage.
`API Usage Billing`	Status-line billing type for API/console billing mode.

This confirms that billing/quota handling is not just a raw API error. The CLI parses quota headers, maintains local limit state, emits SDK/headless events, and renders plan/billing-specific guidance.

Relationship between usage, quota, and billing

Concern	Owner	Source-confirmed mechanism
Per-request usage	Provider response + runtime accounting	Token/cache/cost fields collected after API calls.
Per-run budget	Local headless loop	`--max-budget-usd` and `error_max_budget_usd`.
Account quota/rate limits	Provider/server headers	`anthropic-ratelimit-unified-*` parsing and `rate_limit_event`.
Billing/overage UI	Account state + server headers + OAuth account role	`/extra-usage`, `/upgrade`, billing-access checks, `API Usage Billing`.

Caveats

Concrete model names and aliases are build/account/provider dependent. The logical roles above are safer anchors than one hard-coded model count.
Some rate_limit_error and SDK examples in the bundle are embedded documentation strings. This page treats them as evidence only when connected to runtime classification, request wrapping, header parsing, or result schemas.
Cost is an estimate derived from known model pricing tables and response usage. hasUnknownModelCost exists because not every model can be priced by the local table.
--fallback-model is documented by the CLI as print-mode-only. Interactive model changes use /model, Remote Control set_model, or session state transitions rather than the fallback flag.

Created and maintained by Yingting Huang.