> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ollim.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Choose a model provider

> Use your Claude subscription, switch to a cheaper alternative, or bring your own backend.

ollim-bot authenticates through Claude Code OAuth by default — no API key,
no provider config, nothing to set up. Your Claude subscription handles
everything. If you want to experiment with other models or reduce costs,
swap providers with a couple of environment variables in your `.env` file.

| I want to...                                     | Go to                                                        |
| ------------------------------------------------ | ------------------------------------------------------------ |
| Just use my Claude subscription (recommended)    | [Default: Claude subscription](#default-claude-subscription) |
| Use a cheaper subscription from another provider | [Alternative subscriptions](#alternative-subscriptions)      |
| Pay per token instead of a flat subscription     | [Pay-per-token providers](#pay-per-token-providers)          |
| Run models locally for full data sovereignty     | [Self-hosted models](#self-hosted-models)                    |
| Use Bedrock, Vertex AI, or a custom gateway      | [Advanced provider options](#advanced-provider-options)      |
| Pick the right Claude model for my use case      | [Choosing a model](#choosing-a-model)                        |

## Default: Claude subscription

Out of the box, ollim-bot uses your Claude subscription via Claude Code
OAuth. The model you get depends on your subscription tier:

| Subscription | Default model | Opus access               |
| ------------ | ------------- | ------------------------- |
| Pro          | Sonnet 4.6    | Available (with fallback) |
| Max          | Opus 4.6      | Default                   |

Switch models at runtime with the `/model` slash command in Discord:

```bash theme={null}
/model opus
/model sonnet
/model haiku
```

The agent resolves aliases to the latest version automatically — `sonnet`
currently maps to Sonnet 4.6, `opus` to Opus 4.6, `haiku` to Haiku 4.5.

<Note>
  Claude Code may fall back to Sonnet if you hit your Opus usage threshold
  on a subscription plan.
</Note>

## Alternative subscriptions

Don't want a Claude subscription? Several providers offer their own coding
subscriptions with Anthropic Messages API-compatible endpoints. Set two
environment variables and ollim-bot uses their models instead — no code
changes.

| Provider                                                          | Cost        | Models                             | Notes                                                                                 |
| ----------------------------------------------------------------- | ----------- | ---------------------------------- | ------------------------------------------------------------------------------------- |
| [Z.AI](https://z.ai)                                              | \$3–49/mo   | GLM-5, GLM-4.7                     | Free tier (GLM-4.7-Flash). GLM-5 (744B/40B active MoE, MIT license) released Feb 2026 |
| [Qwen](https://alibabacloud.com/help/en/model-studio/coding-plan) | \$10–50/mo  | Qwen3.5, Qwen3-Coder-Next + others | Multi-model subscription. Qwen3.5 supports 1M context and multimodal                  |
| [MiniMax](https://platform.minimax.io)                            | \$10–150/mo | MiniMax M2.5                       | SWE-Bench 80.2%, 100+ tok/s, \$0.30/M input on API                                    |
| [Kimi](https://kimi.com/code)                                     | \~\$7/week  | Kimi K2.5                          | 1T params (32B active MoE), agent swarm up to 100 sub-agents                          |

All of these use the same pattern — `ANTHROPIC_BASE_URL` and
`ANTHROPIC_AUTH_TOKEN` in your `.env` file:

```bash title=".env (Z.AI example)" theme={null}
ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
ANTHROPIC_AUTH_TOKEN=your-zai-api-key
```

<AccordionGroup>
  <Accordion title="Z.AI GLM setup">
    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
    ANTHROPIC_AUTH_TOKEN=your-zai-api-key
    ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7
    ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.5-air
    ```

    GLM-4.7 runs at roughly 5–7x cheaper than Claude Sonnet 4.6. Z.AI
    offers subscription plans (Lite \$3/mo, Pro \$15/mo, Max \~\$60/mo)
    with prompt-based quotas, or pay-per-token. GLM-4.7-Flash and
    GLM-4.5-Flash are free.

    **GLM-5** (released February 2026) is their frontier model — 744B
    total / 40B active MoE, MIT license, \$1.00/M input / \$3.20/M
    output via API. Requires server-grade hardware for self-hosting.
  </Accordion>

  <Accordion title="Qwen / Alibaba Cloud setup">
    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
    ANTHROPIC_AUTH_TOKEN=your-dashscope-api-key
    ANTHROPIC_MODEL=qwen3.5-plus
    ANTHROPIC_SMALL_FAST_MODEL=qwen3.5-coder
    ```

    One subscription (\$10–50/mo) covers six models.
    Switch between Qwen3.5-Plus, Qwen3-Coder, GLM-4.7, Kimi-K2.5, and
    MiniMax M2.5 without changing providers.

    **Qwen3.5** (released February 2026) adds 1M context, native
    multimodal, and 201 languages. **Qwen3-Coder-Next** (80B/3B active
    MoE, Apache 2.0) scores 70.6 on SWE-Bench Verified — purpose-built
    for coding agents.
  </Accordion>

  <Accordion title="MiniMax setup">
    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
    ANTHROPIC_AUTH_TOKEN=your-minimax-api-key
    ANTHROPIC_MODEL=minimax-m2.5
    ANTHROPIC_SMALL_FAST_MODEL=minimax-m2.5-lightning
    ```

    Subscription tiers: Starter \$10/mo (100 prompts/5h), Plus \$20/mo
    (300/5h), Max \$50/mo (1,000/5h). 100+ tokens per second throughput.

    **M2.5** (released February 2026) hits 80.2% on SWE-Bench Verified
    and completes tasks 37% faster than M2.1. API pricing:
    \$0.30/M input, \$1.20/M output. Open weights under Modified MIT.
  </Accordion>

  <Accordion title="Kimi / Moonshot AI setup">
    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
    ANTHROPIC_AUTH_TOKEN=your-moonshot-api-key
    ANTHROPIC_MODEL=kimi-k2.5
    ANTHROPIC_SMALL_FAST_MODEL=kimi-k2
    ```

    Weekly membership at \~\$7/week with 300–1,200 API calls per 5-hour
    window. 256K context window and 100 tokens/second output speed.

    **K2.5** (released January 2026) is a 1T parameter / 32B active MoE
    with native multimodal and an agent swarm that runs up to 100
    sub-agents in parallel. Open weights under Modified MIT.
  </Accordion>
</AccordionGroup>

## Pay-per-token providers

If you prefer paying for what you use instead of a flat subscription:

| Provider                                  | Input cost | Output cost | Models        | Notes                                                                       |
| ----------------------------------------- | ---------- | ----------- | ------------- | --------------------------------------------------------------------------- |
| [DeepSeek](https://platform.deepseek.com) | \$0.28/1M  | \$0.42/1M   | DeepSeek V3.2 | Cheapest option. Thinking integrated with tool use                          |
| [OpenRouter](https://openrouter.ai)       | Varies     | Varies      | 400+ models   | Gateway with unified billing. 24 free models, `openrouter/free` auto-router |

<AccordionGroup>
  <Accordion title="DeepSeek setup">
    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
    ANTHROPIC_AUTH_TOKEN=your-deepseek-api-key
    ANTHROPIC_MODEL=deepseek-chat
    ANTHROPIC_SMALL_FAST_MODEL=deepseek-chat
    ```

    The cheapest per-token option — roughly 10x cheaper than Claude
    Sonnet 4.6. No subscription required. **V3.2** is the first model
    to integrate thinking directly into tool use, supporting both
    thinking and non-thinking modes. Image input is not supported
    through the Anthropic compatibility endpoint.

    Open weights (MIT license) — distilled 32B variants are available
    for single-GPU self-hosting via Ollama (`ollama pull deepseek-v3.2`).
  </Accordion>

  <Accordion title="OpenRouter setup">
    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=https://openrouter.ai/api
    ANTHROPIC_AUTH_TOKEN=sk-or-v1-your-key
    ANTHROPIC_API_KEY=
    ```

    Gateway to 400+ models with unified billing. 24 models are free
    without a credit card, and `openrouter/free` auto-routes to a
    compatible free model. The empty `ANTHROPIC_API_KEY=` prevents
    Claude Code from authenticating directly with Anthropic. **Only
    Claude models are guaranteed to work** — non-Claude models require
    a [translation proxy](https://github.com/luohy15/y-router).
  </Accordion>
</AccordionGroup>

<Warning>
  Alternative models are community-supported and not tested by Anthropic.
  ollim-bot's agentic loop depends on reliable tool use — test thoroughly
  before relying on a non-Claude model for daily operations. Provider
  endpoints and pricing may change without notice.
</Warning>

## Self-hosted models

Run models locally for full data sovereignty — no tokens leave your
network. As of early 2026, all three major inference backends natively
support the Anthropic Messages API with tool calling.

All self-hosted setups use the same `.env` pattern:

| Variable                     | Purpose                                                                                                                                                                     |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ANTHROPIC_BASE_URL`         | Points ollim-bot at your local inference server instead of Anthropic's API                                                                                                  |
| `ANTHROPIC_AUTH_TOKEN`       | Any non-empty string — local backends require the header but don't validate it. This also tells the Agent SDK to skip Claude OAuth, so no Anthropic account is needed.      |
| `ANTHROPIC_MODEL`            | The model name your backend serves (must match exactly)                                                                                                                     |
| `ANTHROPIC_SMALL_FAST_MODEL` | Model used for lightweight tasks like [subagent](/extending/subagents) work and background [routines](/scheduling/routines). Can be the same model or a smaller/faster one. |

<AccordionGroup>
  <Accordion title="Ollama">
    [Ollama](https://ollama.com) (v0.17+) runs open models locally with
    a native Anthropic-compatible endpoint, tool calling, and streaming.

    <Tabs>
      <Tab title="Install script">
        ```bash theme={null}
        curl -fsSL https://ollama.com/install.sh -o /tmp/ollama-install.sh
        less /tmp/ollama-install.sh   # inspect the script first
        sh /tmp/ollama-install.sh
        ```
      </Tab>

      <Tab title="Docker (recommended)">
        ```bash theme={null}
        docker run -d --gpus all \
          --name ollama \
          -p 127.0.0.1:11434:11434 \
          -v ollama:/root/.ollama \
          ollama/ollama
        ```

        Omit `--gpus all` if you don't have an NVIDIA GPU — but expect
        noticeably slower inference.

        <Tip>
          If port 11434 is already in use (e.g., another Ollama instance),
          map to a different host port: `-p 127.0.0.1:11435:11434`. Update
          `ANTHROPIC_BASE_URL` to `http://localhost:11435` to match.
        </Tip>
      </Tab>
    </Tabs>

    Pull a model **before starting the bot** — the bot reports a model
    error if it starts before the pull finishes:

    ```bash theme={null}
    ollama pull qwen3.5:2b
    curl http://localhost:11434/
    ```

    If you used Docker, prefix commands with `docker exec ollama`:

    ```bash theme={null}
    docker exec ollama ollama pull qwen3.5:2b
    ```

    Then add these variables to your `.env`:

    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=http://localhost:11434
    ANTHROPIC_AUTH_TOKEN=ollama
    ANTHROPIC_MODEL=qwen3.5:2b
    ANTHROPIC_SMALL_FAST_MODEL=qwen3.5:2b
    ```

    <Note>
      Ollama model names use the Ollama registry format (e.g.,
      `qwen3.5:2b`, `qwen3.5:latest`) — not HuggingFace model IDs.
      Browse available models at [ollama.com/search](https://ollama.com/search).
    </Note>

    <Warning>
      Ollama binds to all network interfaces by default. For security,
      bind to localhost only by setting `OLLAMA_HOST=127.0.0.1` in your
      environment, or use `127.0.0.1` in the Docker `-p` flag as shown
      above.
    </Warning>

    v0.17 (February 2026) ships a new inference engine with up to 40%
    faster prompt processing, improved multi-GPU tensor parallelism, and
    better KV cache management for long conversations.

    Tool use works well with larger models — expect tinkering with
    smaller ones. Local inference is still slower than
    cloud providers. Not recommended as a primary backend for a bot
    that needs sub-second response times.

    Once your `.env` is configured, return to
    [step 6 of the quickstart](/getting-started/quickstart#start-the-bot)
    to start the bot.
  </Accordion>

  <Accordion title="vLLM">
    [vLLM](https://docs.vllm.ai/) (v0.16+) exposes a native Anthropic
    `/v1/messages` endpoint with tool calling — the best option for
    production multi-GPU deployments:

    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=http://localhost:8000
    ANTHROPIC_AUTH_TOKEN=vllm
    ANTHROPIC_MODEL=qwen3-coder-next
    ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder
    ```

    v0.16 (February 2026) adds async scheduling with pipeline
    parallelism for \~31% throughput improvement. See the
    [vLLM Claude Code integration docs](https://docs.vllm.ai/en/latest/serving/integrations/claude_code/)
    for full setup.
  </Accordion>

  <Accordion title="llama.cpp server">
    [llama.cpp server](https://github.com/ggml-org/llama.cpp/tree/master/tools/server)
    added Anthropic Messages API support in January 2026 — the most
    lightweight option for single-GPU setups:

    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=http://localhost:8080
    ANTHROPIC_AUTH_TOKEN=llamacpp
    ANTHROPIC_MODEL=qwen3-coder-next
    ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder
    ```

    Supports tools, vision, streaming, and token counting. Up to 35%
    faster with NVFP4/FP8 quantization on NVIDIA GPUs. See the
    [Hugging Face walkthrough](https://huggingface.co/blog/ggml-org/anthropic-messages-api-in-llamacpp)
    for setup details.
  </Accordion>

  <Accordion title="Bifrost proxy">
    If your inference server only speaks OpenAI Chat Completions,
    route through a [Bifrost gateway](https://github.com/maximhq/bifrost)
    — it translates to Anthropic format automatically with sub-millisecond
    overhead, load balancing, and a built-in web UI.

    ```bash theme={null}
    # Start Bifrost (Docker or npx)
    docker run -p 8080:8080 maximhq/bifrost
    # or: npx -y @maximhq/bifrost
    ```

    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
    ANTHROPIC_AUTH_TOKEN=bifrost
    ANTHROPIC_MODEL=your-model-name
    ANTHROPIC_SMALL_FAST_MODEL=your-model-name
    ```

    Bifrost is open source (Apache 2.0) and supports OpenAI, Ollama,
    and vLLM backends among others.
  </Accordion>
</AccordionGroup>

### Recommended models for self-hosting

Pick based on your hardware. For reliable tool calling, use high-quality
quantizations (q8 or fp16 — these preserve the precision models need for
structured output like tool calls).

<Warning>
  Most self-hosted models require a dedicated GPU with at least 24GB VRAM.
  Smaller models like Qwen3.5 (2B) can run on consumer GPUs (8GB+) but
  with reduced tool-calling reliability. If you don't have the hardware,
  consider [alternative subscriptions](#alternative-subscriptions) or
  [pay-per-token providers](#pay-per-token-providers) instead.
</Warning>

Models marked "MoE" (Mixture of Experts) only activate a fraction of their
total parameters per request — so an 80B model with 3B active runs on
hardware sized for 3B, not 80B.

| Model                           | Ollama name      | Active params  | Hardware          | Tool use                         | License      |
| ------------------------------- | ---------------- | -------------- | ----------------- | -------------------------------- | ------------ |
| **Qwen3.5**                     | `qwen3.5:2b`     | 2.3B           | 8GB+ VRAM         | Good — tool calling + multimodal | Apache 2.0   |
| **Qwen3-Coder-Next**            | —                | 3B (80B MoE)   | 32GB+ VRAM        | Excellent — SWE-Bench 70.6       | Apache 2.0   |
| **Qwen3.5-35B-A3B**             | `qwen3.5:latest` | 3B (35B MoE)   | 32GB VRAM         | Good — 1M context                | Apache 2.0   |
| **GLM-4.7-Flash**               | —                | 30B            | 24GB+ VRAM        | Good — interleaved reasoning     | Open weights |
| **DeepSeek V3.2 (32B distill)** | `deepseek-v3.2`  | 32B            | 24GB+ VRAM        | Good — thinking + tool use       | MIT          |
| **GLM-5**                       | —                | 40B (744B MoE) | Multi-GPU cluster | Excellent — frontier open-source | MIT          |
| **MiniMax M2.5**                | —                | Large MoE      | Multi-GPU cluster | Excellent — SWE-Bench 80.2%      | Modified MIT |
| **Kimi K2.5**                   | —                | 32B (1T MoE)   | Multi-GPU cluster | Excellent — 1,500 tool calls     | Modified MIT |

<Note>
  The **Ollama name** column shows the tag to use with `ollama pull`. Models
  marked "—" are not yet available on Ollama — use vLLM or llama.cpp with the
  HuggingFace model ID instead. Check [ollama.com/search](https://ollama.com/search)
  for current availability.
</Note>

<Tip>
  Self-hosting works well for data sovereignty and experimentation.
  Tool use in local models improved sharply in early 2026 with native Anthropic endpoints in Ollama and vLLM —
  but cloud providers still win on latency and reliability for a bot
  that needs to respond quickly throughout the day.
</Tip>

## Model version pinning

By default, model aliases (`opus`, `sonnet`, `haiku`) resolve to the latest
version. Pin specific versions with these environment variables:

| Variable                         | Description            |
| -------------------------------- | ---------------------- |
| `ANTHROPIC_DEFAULT_OPUS_MODEL`   | Pin the `opus` alias   |
| `ANTHROPIC_DEFAULT_SONNET_MODEL` | Pin the `sonnet` alias |
| `ANTHROPIC_DEFAULT_HAIKU_MODEL`  | Pin the `haiku` alias  |

These also work with alternative providers — set them to the provider's
model IDs (e.g. `glm-4.7`, `deepseek-chat`, `kimi-k2.5`).

<AccordionGroup>
  <Accordion title="Claude model IDs by provider">
    | Model      | Anthropic API               | Amazon Bedrock                                | Google Vertex AI            |
    | ---------- | --------------------------- | --------------------------------------------- | --------------------------- |
    | Opus 4.6   | `claude-opus-4-6`           | `us.anthropic.claude-opus-4-6-v1`             | `claude-opus-4-6`           |
    | Sonnet 4.6 | `claude-sonnet-4-6`         | `us.anthropic.claude-sonnet-4-6`              | `claude-sonnet-4-6`         |
    | Haiku 4.5  | `claude-haiku-4-5-20251001` | `us.anthropic.claude-haiku-4-5-20251001-v1:0` | `claude-haiku-4-5@20251001` |

    ```bash title=".env (Bedrock example)" theme={null}
    ANTHROPIC_DEFAULT_OPUS_MODEL=us.anthropic.claude-opus-4-6-v1
    ANTHROPIC_DEFAULT_SONNET_MODEL=us.anthropic.claude-sonnet-4-6
    ANTHROPIC_DEFAULT_HAIKU_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0
    ```

    <Warning>
      Pin all three models when using Bedrock or Vertex AI. Without
      pinning, aliases resolve to the latest version — which may not be
      available in your deployment yet.
    </Warning>
  </Accordion>
</AccordionGroup>

## Per-routine model override

Background routines can override the model in their YAML frontmatter:

```yaml title="routines/quick-email-check.md" theme={null}
---
id: "d1e2f3a4"
cron: "0 */3 * * *"
description: "Email check"
background: true
model: "haiku"
---
Check for new important emails. Save a summary to pending updates.
```

The `model` field accepts aliases (`opus`, `sonnet`, `haiku`) and only
applies to background routines. See [Routines](/scheduling/routines)
for all frontmatter fields.

## Choosing a model

For most ollim-bot use, **Sonnet 4.6 handles tool calling, scheduling, and
conversation as well as Opus 4.6** — at 40% of the cost. Opus pulls ahead on
deep reasoning and complex multi-step debugging. Haiku 4.5 is ideal for
lightweight background routines where speed matters more than depth.

| Model      | Best for                                         | Tool calling | Agentic coding       | Deep reasoning | Speed   | Cost   |
| ---------- | ------------------------------------------------ | ------------ | -------------------- | -------------- | ------- | ------ |
| Opus 4.6   | Complex multi-step tasks, novel problem-solving  | Excellent    | 65.4% Terminal-Bench | 91.3% GPQA     | Slowest | \$\$\$ |
| Sonnet 4.6 | Daily conversations, routines, most agentic work | Excellent    | 59.1% Terminal-Bench | 74.1% GPQA     | Fast    | \$\$   |
| Haiku 4.5  | Background routines, email triage, quick checks  | Good         | 41.8% Terminal-Bench | —              | Fastest | \$     |

<Note>
  Haiku has a 200k context window (vs. 1M for Sonnet and Opus). If your main session exceeds 200k tokens, the bot automatically upgrades [interactive forks](/core-usage/forks#haiku-auto-upgrade) to sonnet to avoid failures. This makes haiku best suited for short-lived background routines rather than long interactive sessions.
</Note>

<Tip>
  **Sonnet 4.6 is the sweet spot for ollim-bot.** It matches Opus on
  tau2-bench tool calling (91.7% vs 91.9%), beats it on knowledge work
  tasks (GDPval-AA: 1633 vs 1606 Elo), and is 70% more token-efficient.
  It's the default.
</Tip>

<AccordionGroup>
  <Accordion title="Full agentic benchmark comparison">
    All scores use extended/adaptive thinking unless noted. Benchmarks are
    selected for relevance to agentic tool-calling bots like ollim-bot.

    | Benchmark             | What it measures                     | Opus 4.6  | Sonnet 4.6 | Haiku 4.5 |
    | --------------------- | ------------------------------------ | --------- | ---------- | --------- |
    | SWE-bench Verified    | Real-world software engineering      | 80.8%     | 79.6%      | 73.3%     |
    | Terminal-Bench 2.0    | Agentic CLI coding                   | **65.4%** | 59.1%      | 41.8%     |
    | tau2-bench Retail     | Multi-step tool calling (retail)     | 91.9%     | **91.7%**  | —         |
    | tau2-bench Telecom    | Multi-step tool calling (telecom)    | **99.3%** | 97.9%      | —         |
    | OSWorld               | Agentic computer use                 | **72.7%** | 72.5%      | 22.0%     |
    | MCP Atlas             | Scaled tool use                      | 59.5%     | **61.3%**  | —         |
    | GDPval-AA (Elo)       | Economically valuable knowledge work | 1606      | **1633**   | —         |
    | Finance Agent         | Financial tool use                   | 60.7%     | **63.3%**  | —         |
    | ARC-AGI-2             | Novel problem-solving                | **68.8%** | 58.3%      | —         |
    | GPQA Diamond          | Graduate-level scientific reasoning  | **91.3%** | 74.1%      | —         |
    | Humanity's Last Exam  | Hardest questions (with tools)       | **53.1%** | 19.1%      | —         |
    | BrowseComp            | Web search and information discovery | **84.0%** | —          | —         |
    | MRCR v2 8-needle @ 1M | Long-context retrieval accuracy      | **76.0%** | —          | —         |

    **Key pattern**: Sonnet 4.6 matches or beats Opus on practical tool
    calling (tau2-bench, MCP Atlas, Finance Agent, GDPval-AA). Opus leads
    on deep reasoning (GPQA, ARC-AGI-2, Humanity's Last Exam) and
    long-context retrieval — tasks that matter for complex debugging, not
    typical daily bot interactions.

    **Haiku 4.5** achieves 73.3% on SWE-bench Verified — matching Claude
    Sonnet 4.5 — at one-third the cost and 4-5x the speed. It reaches \~90%
    of Sonnet 4.5's agentic coding performance per Augment's evaluation.

    Sources:
    [Anthropic Opus 4.6](https://www.anthropic.com/news/claude-opus-4-6),
    [Anthropic Sonnet 4.6](https://www.anthropic.com/news/claude-sonnet-4-6),
    [Anthropic Haiku 4.5](https://www.anthropic.com/news/claude-haiku-4-5),
    [Vellum benchmarks](https://www.vellum.ai/blog/claude-opus-4-6-benchmarks),
    [Anthropic model overview](https://platform.claude.com/docs/en/about-claude/models/all-models).
    Scores current as of February 2026.
  </Accordion>
</AccordionGroup>

## Claude pricing

For most users, a Claude subscription costs less than API
pay-as-you-go. The average Claude Code developer uses the equivalent of
\$130/month in API tokens — covered by a \$20 Pro plan.

| Plan    | Cost     | Default model | Opus access               | Rate limits   |
| ------- | -------- | ------------- | ------------------------- | ------------- |
| Pro     | \$20/mo  | Sonnet 4.6    | Available (with fallback) | \~45 msgs/5hr |
| Max 5x  | \$100/mo | Opus 4.6      | Default                   | 5x Pro        |
| Max 20x | \$200/mo | Opus 4.6      | Default                   | 20x Pro       |

<Note>
  On the Pro plan, Claude Code may fall back from Opus to Sonnet when you
  hit a usage threshold. The exact limit is not published. Max plans have
  higher thresholds — Max 20x rarely triggers fallback.
</Note>

<AccordionGroup>
  <Accordion title="API token pricing and breakeven analysis">
    **Per million tokens (standard on-demand):**

    | Model      | Input  | Output  | Cache read (90% off) | Batch (50% off)       |
    | ---------- | ------ | ------- | -------------------- | --------------------- |
    | Opus 4.6   | \$5.00 | \$25.00 | \$0.50 in            | $2.50 in / $12.50 out |
    | Sonnet 4.6 | \$3.00 | \$15.00 | \$0.30 in            | $1.50 in / $7.50 out  |
    | Haiku 4.5  | \$1.00 | \$5.00  | \$0.10 in            | $0.50 in / $2.50 out  |

    Extended thinking tokens are billed at output token rates. Long context
    (>200K input) doubles the input cost and adds 50% to the output cost.

    **Breakeven analysis** (assuming 3:1 input-to-output ratio):

    | Plan    | Monthly cost | Breakeven on Sonnet  | Breakeven on Opus  |
    | ------- | ------------ | -------------------- | ------------------ |
    | Pro     | \$20         | \~3.3M tokens/month  | \~2M tokens/month  |
    | Max 5x  | \$100        | \~16.7M tokens/month | \~10M tokens/month |
    | Max 20x | \$200        | \~33.3M tokens/month | \~20M tokens/month |

    For context, Anthropic reports the average Claude Code developer uses
    ~~\$6/day (~~\$130/month) in API-equivalent costs, and the 90th percentile
    is under \$12/day (\~\$260/month). **Pro at \$20/month covers what would be
    \$130+ on the API** — a subscription is the clear winner for regular use.

    API pay-as-you-go only wins at very low usage (under \~3M tokens/month on
    Sonnet) or when you need guaranteed access without rate limit resets.

    Source: [Anthropic API pricing](https://platform.claude.com/docs/en/about-claude/pricing),
    [Claude Code costs](https://code.claude.com/docs/en/costs).
  </Accordion>
</AccordionGroup>

## Advanced provider options

If you need pay-as-you-go API billing, cloud provider infrastructure, or
a custom LLM gateway, these options are available but require more setup.

<AccordionGroup>
  <Accordion title="Anthropic API key">
    For pay-as-you-go billing instead of a subscription:

    ```bash title=".env" theme={null}
    ANTHROPIC_API_KEY=sk-ant-...
    ```

    This bypasses Claude Code OAuth entirely. You pay per token at
    [Anthropic's API rates](https://docs.anthropic.com/en/docs/about-claude/models/all-models#model-comparison-table).
  </Accordion>

  <Accordion title="Amazon Bedrock">
    Set these environment variables in your `.env` file:

    | Variable                  | Required    | Description                                                 |
    | ------------------------- | ----------- | ----------------------------------------------------------- |
    | `CLAUDE_CODE_USE_BEDROCK` | Yes         | Set to `1` to enable Bedrock                                |
    | `AWS_REGION`              | Yes         | AWS region (e.g. `us-east-1`) — not read from `.aws` config |
    | `AWS_ACCESS_KEY_ID`       | Conditional | AWS access key (one auth method required)                   |
    | `AWS_SECRET_ACCESS_KEY`   | Conditional | AWS secret key                                              |
    | `AWS_SESSION_TOKEN`       | No          | Session token for temporary credentials                     |
    | `AWS_PROFILE`             | Conditional | AWS SSO profile name (alternative to access keys)           |

    ```bash title=".env" theme={null}
    CLAUDE_CODE_USE_BEDROCK=1
    AWS_REGION=us-east-1
    AWS_ACCESS_KEY_ID=AKIA...
    AWS_SECRET_ACCESS_KEY=...
    ```

    Bedrock supports five authentication methods: AWS CLI config, environment
    variable access keys, SSO profiles, Management Console credentials, and
    Bedrock API keys (`AWS_BEARER_TOKEN_BEDROCK`).

    <Warning>
      `AWS_REGION` is required and is not read from your AWS CLI configuration.
      Always set it explicitly.
    </Warning>

    IAM permissions required: `bedrock:InvokeModel`,
    `bedrock:InvokeModelWithResponseStream`, `bedrock:ListInferenceProfiles`.

    For full IAM policy details, credential chain options, and guardrail
    configuration, see the
    [Claude Code Bedrock docs](https://code.claude.com/docs/en/amazon-bedrock).
  </Accordion>

  <Accordion title="Google Vertex AI">
    Set these environment variables in your `.env` file:

    | Variable                         | Required | Description                                                 |
    | -------------------------------- | -------- | ----------------------------------------------------------- |
    | `CLAUDE_CODE_USE_VERTEX`         | Yes      | Set to `1` to enable Vertex AI                              |
    | `CLOUD_ML_REGION`                | Yes      | GCP region (e.g. `us-east5`) or `global`                    |
    | `ANTHROPIC_VERTEX_PROJECT_ID`    | Yes      | Your GCP project ID                                         |
    | `GOOGLE_APPLICATION_CREDENTIALS` | No       | Path to service account JSON (alternative to `gcloud auth`) |

    ```bash title=".env" theme={null}
    CLAUDE_CODE_USE_VERTEX=1
    CLOUD_ML_REGION=us-east5
    ANTHROPIC_VERTEX_PROJECT_ID=my-project-id
    ```

    Authenticate with `gcloud auth application-default login` or provide a
    service account key via `GOOGLE_APPLICATION_CREDENTIALS`.

    IAM role required: `roles/aiplatform.user`

    <Note>
      Model access approval on Vertex AI can take 24–48 hours. Not all models
      are available in all regions.
    </Note>

    For full GCP setup, region-specific configuration, and credential details,
    see the
    [Claude Code Vertex AI docs](https://code.claude.com/docs/en/google-vertex-ai).
  </Accordion>

  <Accordion title="Custom LLM gateway">
    Point ollim-bot at any endpoint that implements the
    [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) —
    a [Bifrost gateway](https://github.com/maximhq/bifrost), vLLM, or
    your own gateway.

    | Variable               | Required | Description                                   |
    | ---------------------- | -------- | --------------------------------------------- |
    | `ANTHROPIC_BASE_URL`   | Yes      | Base URL for the Messages API endpoint        |
    | `ANTHROPIC_AUTH_TOKEN` | No       | Static API key sent as `Authorization` header |

    ```bash title=".env" theme={null}
    ANTHROPIC_BASE_URL=https://your-gateway:8080/anthropic
    ANTHROPIC_AUTH_TOKEN=your-gateway-key
    ```

    The gateway must expose `/v1/messages` and forward the `anthropic-beta`
    and `anthropic-version` headers. For general LLM gateway setup, see the
    [Claude Code LLM gateway docs](https://code.claude.com/docs/en/llm-gateway).
  </Accordion>

  <Accordion title="Cross-provider pricing">
    Global endpoint pricing is identical across Anthropic API, Amazon Bedrock,
    and Google Vertex AI — no markup. Regional endpoints add a 10% premium
    for data residency compliance.

    **Per million tokens (global endpoints):**

    | Model             | Anthropic API | Bedrock | Vertex AI | Regional (+10%) |
    | ----------------- | ------------- | ------- | --------- | --------------- |
    | Opus 4.6 input    | \$5.00        | \$5.00  | \$5.00    | \$5.50          |
    | Opus 4.6 output   | \$25.00       | \$25.00 | \$25.00   | \$27.50         |
    | Sonnet 4.6 input  | \$3.00        | \$3.00  | \$3.00    | \$3.30          |
    | Sonnet 4.6 output | \$15.00       | \$15.00 | \$15.00   | \$16.50         |
    | Haiku 4.5 input   | \$1.00        | \$1.00  | \$1.00    | \$1.10          |
    | Haiku 4.5 output  | \$5.00        | \$5.00  | \$5.00    | \$5.50          |

    **Feature availability:**

    | Feature                | Anthropic API | Bedrock       | Vertex AI     |
    | ---------------------- | ------------- | ------------- | ------------- |
    | Prompt caching         | Yes           | Yes           | Yes           |
    | Batch API (50% off)    | Yes           | Yes           | Yes           |
    | Extended thinking      | Yes           | Yes           | Yes           |
    | Fast mode (6x pricing) | Yes           | Not confirmed | Not confirmed |
    | 1M context (beta)      | Yes           | Verify        | Verify        |
    | New model availability | First         | Delayed       | Delayed       |
    | Provisioned throughput | No            | Yes           | Yes           |

    **Choose based on your infrastructure**, not pricing — the per-token cost
    is the same. Bedrock and Vertex AI add value through IAM integration,
    compliance frameworks, and provisioned throughput for predictable
    workloads. The Anthropic API gets new features and models first.

    Source: [Anthropic pricing](https://platform.claude.com/docs/en/about-claude/pricing),
    [Bedrock pricing](https://aws.amazon.com/bedrock/pricing/),
    [Vertex AI pricing](https://cloud.google.com/vertex-ai/generative-ai/pricing).
    Pricing current as of February 2026.
  </Accordion>
</AccordionGroup>

## Next steps

<Columns cols={2}>
  <Card title="Configuration reference" icon="sliders" href="/configuration/reference">
    All environment variables and configuration options.
  </Card>

  <Card title="Self-host ollim-bot" icon="server" href="/self-hosting/guide">
    Fork, configure, and deploy your own instance.
  </Card>

  <Card title="Routines" icon="clock" href="/scheduling/routines">
    Per-routine model overrides and background fork configuration.
  </Card>

  <Card title="Slash commands" icon="terminal" href="/core-usage/slash-commands">
    The /model command and other runtime controls.
  </Card>
</Columns>
