Choose a model provider

ollim-bot authenticates through Claude Code OAuth by default — no API key, no provider config, nothing to set up. Your Claude subscription handles everything. If you want to experiment with other models or reduce costs, swap providers with a couple of environment variables in your .env file.

I want to…	Go to
Just use my Claude subscription (recommended)	Default: Claude subscription
Use a cheaper subscription from another provider	Alternative subscriptions
Pay per token instead of a flat subscription	Pay-per-token providers
Run models locally for full data sovereignty	Self-hosted models
Use Bedrock, Vertex AI, or a custom gateway	Advanced provider options
Pick the right Claude model for my use case	Choosing a model

Default: Claude subscription

Out of the box, ollim-bot uses your Claude subscription via Claude Code OAuth. The model you get depends on your subscription tier:

Subscription	Default model	Opus access
Pro	Sonnet 4.6	Available (with fallback)
Max	Opus 4.6	Default

Switch models at runtime with the /model slash command in Discord:

/model opus
/model sonnet
/model haiku

The agent resolves aliases to the latest version automatically — sonnet currently maps to Sonnet 4.6, opus to Opus 4.6, haiku to Haiku 4.5.

Claude Code may fall back to Sonnet if you hit your Opus usage threshold on a subscription plan.

Alternative subscriptions

Don’t want a Claude subscription? Several providers offer their own coding subscriptions with Anthropic Messages API-compatible endpoints. Set two environment variables and ollim-bot uses their models instead — no code changes.

Provider	Cost	Models	Notes
Z.AI	$3–49/mo	GLM-5, GLM-4.7	Free tier (GLM-4.7-Flash). GLM-5 (744B/40B active MoE, MIT license) released Feb 2026
Qwen	$10–50/mo	Qwen3.5, Qwen3-Coder-Next + others	Multi-model subscription. Qwen3.5 supports 1M context and multimodal
MiniMax	$10–150/mo	MiniMax M2.5	SWE-Bench 80.2%, 100+ tok/s, $0.30/M input on API
Kimi	~$7/week	Kimi K2.5	1T params (32B active MoE), agent swarm up to 100 sub-agents

All of these use the same pattern — ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN in your .env file:

.env (Z.AI example)

ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
ANTHROPIC_AUTH_TOKEN=your-zai-api-key

Z.AI GLM setup

.env

ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
ANTHROPIC_AUTH_TOKEN=your-zai-api-key
ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7
ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.5-air

GLM-4.7 runs at roughly 5–7x cheaper than Claude Sonnet 4.6. Z.AI offers subscription plans (Lite $3/mo, Pro $15/mo, Max ~$60/mo) with prompt-based quotas, or pay-per-token. GLM-4.7-Flash and GLM-4.5-Flash are free.GLM-5 (released February 2026) is their frontier model — 744B total / 40B active MoE, MIT license, $1.00/M input / $3.20/M output via API. Requires server-grade hardware for self-hosting.

Qwen / Alibaba Cloud setup

.env

ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
ANTHROPIC_AUTH_TOKEN=your-dashscope-api-key
ANTHROPIC_MODEL=qwen3.5-plus
ANTHROPIC_SMALL_FAST_MODEL=qwen3.5-coder

One subscription ($10–50/mo) covers six models. Switch between Qwen3.5-Plus, Qwen3-Coder, GLM-4.7, Kimi-K2.5, and MiniMax M2.5 without changing providers.Qwen3.5 (released February 2026) adds 1M context, native multimodal, and 201 languages. Qwen3-Coder-Next (80B/3B active MoE, Apache 2.0) scores 70.6 on SWE-Bench Verified — purpose-built for coding agents.

MiniMax setup

.env

ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
ANTHROPIC_AUTH_TOKEN=your-minimax-api-key
ANTHROPIC_MODEL=minimax-m2.5
ANTHROPIC_SMALL_FAST_MODEL=minimax-m2.5-lightning

Subscription tiers: Starter $10/mo (100 prompts/5h), Plus $20/mo (300/5h), Max $50/mo (1,000/5h). 100+ tokens per second throughput.M2.5 (released February 2026) hits 80.2% on SWE-Bench Verified and completes tasks 37% faster than M2.1. API pricing: $0.30/M input, $1.20/M output. Open weights under Modified MIT.

Kimi / Moonshot AI setup

.env

ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
ANTHROPIC_AUTH_TOKEN=your-moonshot-api-key
ANTHROPIC_MODEL=kimi-k2.5
ANTHROPIC_SMALL_FAST_MODEL=kimi-k2

Weekly membership at ~$7/week with 300–1,200 API calls per 5-hour window. 256K context window and 100 tokens/second output speed.K2.5 (released January 2026) is a 1T parameter / 32B active MoE with native multimodal and an agent swarm that runs up to 100 sub-agents in parallel. Open weights under Modified MIT.

Pay-per-token providers

If you prefer paying for what you use instead of a flat subscription:

Provider	Input cost	Output cost	Models	Notes
DeepSeek	$0.28/1M	$0.42/1M	DeepSeek V3.2	Cheapest option. Thinking integrated with tool use
OpenRouter	Varies	Varies	400+ models	Gateway with unified billing. 24 free models, `openrouter/free` auto-router

DeepSeek setup

.env

ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
ANTHROPIC_AUTH_TOKEN=your-deepseek-api-key
ANTHROPIC_MODEL=deepseek-chat
ANTHROPIC_SMALL_FAST_MODEL=deepseek-chat

The cheapest per-token option — roughly 10x cheaper than Claude Sonnet 4.6. No subscription required. V3.2 is the first model to integrate thinking directly into tool use, supporting both thinking and non-thinking modes. Image input is not supported through the Anthropic compatibility endpoint.Open weights (MIT license) — distilled 32B variants are available for single-GPU self-hosting via Ollama (ollama pull deepseek-v3.2).

OpenRouter setup

.env

ANTHROPIC_BASE_URL=https://openrouter.ai/api
ANTHROPIC_AUTH_TOKEN=sk-or-v1-your-key
ANTHROPIC_API_KEY=

Gateway to 400+ models with unified billing. 24 models are free without a credit card, and openrouter/free auto-routes to a compatible free model. The empty ANTHROPIC_API_KEY= prevents Claude Code from authenticating directly with Anthropic. Only Claude models are guaranteed to work — non-Claude models require a translation proxy.

Alternative models are community-supported and not tested by Anthropic. ollim-bot’s agentic loop depends on reliable tool use — test thoroughly before relying on a non-Claude model for daily operations. Provider endpoints and pricing may change without notice.

Self-hosted models

Run models locally for full data sovereignty — no tokens leave your network. As of early 2026, all three major inference backends natively support the Anthropic Messages API with tool calling. All self-hosted setups use the same .env pattern:

Variable	Purpose
`ANTHROPIC_BASE_URL`	Points ollim-bot at your local inference server instead of Anthropic’s API
`ANTHROPIC_AUTH_TOKEN`	Any non-empty string — local backends require the header but don’t validate it. This also tells the Agent SDK to skip Claude OAuth, so no Anthropic account is needed.
`ANTHROPIC_MODEL`	The model name your backend serves (must match exactly)
`ANTHROPIC_SMALL_FAST_MODEL`	Model used for lightweight tasks like subagent work and background routines. Can be the same model or a smaller/faster one.

Ollama

Ollama (v0.17+) runs open models locally with a native Anthropic-compatible endpoint, tool calling, and streaming.

Install script
Docker (recommended)

curl -fsSL https://ollama.com/install.sh -o /tmp/ollama-install.sh
less /tmp/ollama-install.sh   # inspect the script first
sh /tmp/ollama-install.sh

docker run -d --gpus all \
  --name ollama \
  -p 127.0.0.1:11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

Omit --gpus all if you don’t have an NVIDIA GPU — but expect noticeably slower inference.

If port 11434 is already in use (e.g., another Ollama instance), map to a different host port: -p 127.0.0.1:11435:11434. Update ANTHROPIC_BASE_URL to http://localhost:11435 to match.

Pull a model before starting the bot — the bot reports a model error if it starts before the pull finishes:

ollama pull qwen3.5:2b
curl http://localhost:11434/

If you used Docker, prefix commands with docker exec ollama:

docker exec ollama ollama pull qwen3.5:2b

Then add these variables to your .env:

.env

ANTHROPIC_BASE_URL=http://localhost:11434
ANTHROPIC_AUTH_TOKEN=ollama
ANTHROPIC_MODEL=qwen3.5:2b
ANTHROPIC_SMALL_FAST_MODEL=qwen3.5:2b

Ollama model names use the Ollama registry format (e.g., qwen3.5:2b, qwen3.5:latest) — not HuggingFace model IDs. Browse available models at ollama.com/search.

Ollama binds to all network interfaces by default. For security, bind to localhost only by setting OLLAMA_HOST=127.0.0.1 in your environment, or use 127.0.0.1 in the Docker -p flag as shown above.

v0.17 (February 2026) ships a new inference engine with up to 40% faster prompt processing, improved multi-GPU tensor parallelism, and better KV cache management for long conversations.Tool use works well with larger models — expect tinkering with smaller ones. Local inference is still slower than cloud providers. Not recommended as a primary backend for a bot that needs sub-second response times.Once your .env is configured, return to step 6 of the quickstart to start the bot.

vLLM

vLLM (v0.16+) exposes a native Anthropic /v1/messages endpoint with tool calling — the best option for production multi-GPU deployments:

.env

ANTHROPIC_BASE_URL=http://localhost:8000
ANTHROPIC_AUTH_TOKEN=vllm
ANTHROPIC_MODEL=qwen3-coder-next
ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder

v0.16 (February 2026) adds async scheduling with pipeline parallelism for ~31% throughput improvement. See the vLLM Claude Code integration docs for full setup.

llama.cpp server

llama.cpp server added Anthropic Messages API support in January 2026 — the most lightweight option for single-GPU setups:

.env

ANTHROPIC_BASE_URL=http://localhost:8080
ANTHROPIC_AUTH_TOKEN=llamacpp
ANTHROPIC_MODEL=qwen3-coder-next
ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder

Supports tools, vision, streaming, and token counting. Up to 35% faster with NVFP4/FP8 quantization on NVIDIA GPUs. See the Hugging Face walkthrough for setup details.

Bifrost proxy

If your inference server only speaks OpenAI Chat Completions, route through a Bifrost gateway — it translates to Anthropic format automatically with sub-millisecond overhead, load balancing, and a built-in web UI.

# Start Bifrost (Docker or npx)
docker run -p 8080:8080 maximhq/bifrost
# or: npx -y @maximhq/bifrost

.env

ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
ANTHROPIC_AUTH_TOKEN=bifrost
ANTHROPIC_MODEL=your-model-name
ANTHROPIC_SMALL_FAST_MODEL=your-model-name

Bifrost is open source (Apache 2.0) and supports OpenAI, Ollama, and vLLM backends among others.

Recommended models for self-hosting

Pick based on your hardware. For reliable tool calling, use high-quality quantizations (q8 or fp16 — these preserve the precision models need for structured output like tool calls).

Most self-hosted models require a dedicated GPU with at least 24GB VRAM. Smaller models like Qwen3.5 (2B) can run on consumer GPUs (8GB+) but with reduced tool-calling reliability. If you don’t have the hardware, consider alternative subscriptions or pay-per-token providers instead.

Models marked “MoE” (Mixture of Experts) only activate a fraction of their total parameters per request — so an 80B model with 3B active runs on hardware sized for 3B, not 80B.

Model	Ollama name	Active params	Hardware	Tool use	License
Qwen3.5	`qwen3.5:2b`	2.3B	8GB+ VRAM	Good — tool calling + multimodal	Apache 2.0
Qwen3-Coder-Next	—	3B (80B MoE)	32GB+ VRAM	Excellent — SWE-Bench 70.6	Apache 2.0
Qwen3.5-35B-A3B	`qwen3.5:latest`	3B (35B MoE)	32GB VRAM	Good — 1M context	Apache 2.0
GLM-4.7-Flash	—	30B	24GB+ VRAM	Good — interleaved reasoning	Open weights
DeepSeek V3.2 (32B distill)	`deepseek-v3.2`	32B	24GB+ VRAM	Good — thinking + tool use	MIT
GLM-5	—	40B (744B MoE)	Multi-GPU cluster	Excellent — frontier open-source	MIT
MiniMax M2.5	—	Large MoE	Multi-GPU cluster	Excellent — SWE-Bench 80.2%	Modified MIT
Kimi K2.5	—	32B (1T MoE)	Multi-GPU cluster	Excellent — 1,500 tool calls	Modified MIT

The Ollama name column shows the tag to use with ollama pull. Models marked ”—” are not yet available on Ollama — use vLLM or llama.cpp with the HuggingFace model ID instead. Check ollama.com/search for current availability.

Self-hosting works well for data sovereignty and experimentation. Tool use in local models improved sharply in early 2026 with native Anthropic endpoints in Ollama and vLLM — but cloud providers still win on latency and reliability for a bot that needs to respond quickly throughout the day.

Model version pinning

By default, model aliases (opus, sonnet, haiku) resolve to the latest version. Pin specific versions with these environment variables:

Variable	Description
`ANTHROPIC_DEFAULT_OPUS_MODEL`	Pin the `opus` alias
`ANTHROPIC_DEFAULT_SONNET_MODEL`	Pin the `sonnet` alias
`ANTHROPIC_DEFAULT_HAIKU_MODEL`	Pin the `haiku` alias

These also work with alternative providers — set them to the provider’s model IDs (e.g. glm-4.7, deepseek-chat, kimi-k2.5).

Claude model IDs by provider

Model	Anthropic API	Amazon Bedrock	Google Vertex AI
Opus 4.6	`claude-opus-4-6`	`us.anthropic.claude-opus-4-6-v1`	`claude-opus-4-6`
Sonnet 4.6	`claude-sonnet-4-6`	`us.anthropic.claude-sonnet-4-6`	`claude-sonnet-4-6`
Haiku 4.5	`claude-haiku-4-5-20251001`	`us.anthropic.claude-haiku-4-5-20251001-v1:0`	`claude-haiku-4-5@20251001`

.env (Bedrock example)

ANTHROPIC_DEFAULT_OPUS_MODEL=us.anthropic.claude-opus-4-6-v1
ANTHROPIC_DEFAULT_SONNET_MODEL=us.anthropic.claude-sonnet-4-6
ANTHROPIC_DEFAULT_HAIKU_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0

Pin all three models when using Bedrock or Vertex AI. Without pinning, aliases resolve to the latest version — which may not be available in your deployment yet.

Per-routine model override

Background routines can override the model in their YAML frontmatter:

routines/quick-email-check.md

---
id: "d1e2f3a4"
cron: "0 */3 * * *"
description: "Email check"
background: true
model: "haiku"
---
Check for new important emails. Save a summary to pending updates.

The model field accepts aliases (opus, sonnet, haiku) and only applies to background routines. See Routines for all frontmatter fields.

Choosing a model

For most ollim-bot use, Sonnet 4.6 handles tool calling, scheduling, and conversation as well as Opus 4.6 — at 40% of the cost. Opus pulls ahead on deep reasoning and complex multi-step debugging. Haiku 4.5 is ideal for lightweight background routines where speed matters more than depth.

Model	Best for	Tool calling	Agentic coding	Deep reasoning	Speed	Cost
Opus 4.6	Complex multi-step tasks, novel problem-solving	Excellent	65.4% Terminal-Bench	91.3% GPQA	Slowest	$$$
Sonnet 4.6	Daily conversations, routines, most agentic work	Excellent	59.1% Terminal-Bench	74.1% GPQA	Fast	$$
Haiku 4.5	Background routines, email triage, quick checks	Good	41.8% Terminal-Bench	—	Fastest	$

Haiku has a 200k context window (vs. 1M for Sonnet and Opus). If your main session exceeds 200k tokens, the bot automatically upgrades interactive forks to sonnet to avoid failures. This makes haiku best suited for short-lived background routines rather than long interactive sessions.

Sonnet 4.6 is the sweet spot for ollim-bot. It matches Opus on tau2-bench tool calling (91.7% vs 91.9%), beats it on knowledge work tasks (GDPval-AA: 1633 vs 1606 Elo), and is 70% more token-efficient. It’s the default.

Full agentic benchmark comparison

All scores use extended/adaptive thinking unless noted. Benchmarks are selected for relevance to agentic tool-calling bots like ollim-bot.

Benchmark	What it measures	Opus 4.6	Sonnet 4.6	Haiku 4.5
SWE-bench Verified	Real-world software engineering	80.8%	79.6%	73.3%
Terminal-Bench 2.0	Agentic CLI coding	65.4%	59.1%	41.8%
tau2-bench Retail	Multi-step tool calling (retail)	91.9%	91.7%	—
tau2-bench Telecom	Multi-step tool calling (telecom)	99.3%	97.9%	—
OSWorld	Agentic computer use	72.7%	72.5%	22.0%
MCP Atlas	Scaled tool use	59.5%	61.3%	—
GDPval-AA (Elo)	Economically valuable knowledge work	1606	1633	—
Finance Agent	Financial tool use	60.7%	63.3%	—
ARC-AGI-2	Novel problem-solving	68.8%	58.3%	—
GPQA Diamond	Graduate-level scientific reasoning	91.3%	74.1%	—
Humanity’s Last Exam	Hardest questions (with tools)	53.1%	19.1%	—
BrowseComp	Web search and information discovery	84.0%	—	—
MRCR v2 8-needle @ 1M	Long-context retrieval accuracy	76.0%	—	—

Key pattern: Sonnet 4.6 matches or beats Opus on practical tool calling (tau2-bench, MCP Atlas, Finance Agent, GDPval-AA). Opus leads on deep reasoning (GPQA, ARC-AGI-2, Humanity’s Last Exam) and long-context retrieval — tasks that matter for complex debugging, not typical daily bot interactions.Haiku 4.5 achieves 73.3% on SWE-bench Verified — matching Claude Sonnet 4.5 — at one-third the cost and 4-5x the speed. It reaches ~90% of Sonnet 4.5’s agentic coding performance per Augment’s evaluation.Sources: Anthropic Opus 4.6, Anthropic Sonnet 4.6, Anthropic Haiku 4.5, Vellum benchmarks, Anthropic model overview. Scores current as of February 2026.

Claude pricing

For most users, a Claude subscription costs less than API pay-as-you-go. The average Claude Code developer uses the equivalent of $130/month in API tokens — covered by a $20 Pro plan.

Plan	Cost	Default model	Opus access	Rate limits
Pro	$20/mo	Sonnet 4.6	Available (with fallback)	~45 msgs/5hr
Max 5x	$100/mo	Opus 4.6	Default	5x Pro
Max 20x	$200/mo	Opus 4.6	Default	20x Pro

On the Pro plan, Claude Code may fall back from Opus to Sonnet when you hit a usage threshold. The exact limit is not published. Max plans have higher thresholds — Max 20x rarely triggers fallback.

API token pricing and breakeven analysis

Per million tokens (standard on-demand):

Model	Input	Output	Cache read (90% off)	Batch (50% off)
Opus 4.6	$5.00	$25.00	$0.50 in	$2.50 in /$ 12.50 out
Sonnet 4.6	$3.00	$15.00	$0.30 in	$1.50 in /$ 7.50 out
Haiku 4.5	$1.00	$5.00	$0.10 in	$0.50 in /$ 2.50 out

Extended thinking tokens are billed at output token rates. Long context (>200K input) doubles the input cost and adds 50% to the output cost.Breakeven analysis (assuming 3:1 input-to-output ratio):

Plan	Monthly cost	Breakeven on Sonnet	Breakeven on Opus
Pro	$20	~3.3M tokens/month	~2M tokens/month
Max 5x	$100	~16.7M tokens/month	~10M tokens/month
Max 20x	$200	~33.3M tokens/month	~20M tokens/month

For context, Anthropic reports the average Claude Code developer uses ~~$6/day (~~$130/month) in API-equivalent costs, and the 90th percentile is under $12/day (~$260/month). Pro at $20/month covers what would be $130+ on the API — a subscription is the clear winner for regular use.API pay-as-you-go only wins at very low usage (under ~3M tokens/month on Sonnet) or when you need guaranteed access without rate limit resets.Source: Anthropic API pricing, Claude Code costs.

Advanced provider options

If you need pay-as-you-go API billing, cloud provider infrastructure, or a custom LLM gateway, these options are available but require more setup.

Anthropic API key

For pay-as-you-go billing instead of a subscription:

.env

ANTHROPIC_API_KEY=sk-ant-...

This bypasses Claude Code OAuth entirely. You pay per token at Anthropic’s API rates.

Amazon Bedrock

Set these environment variables in your .env file:

Variable	Required	Description
`CLAUDE_CODE_USE_BEDROCK`	Yes	Set to `1` to enable Bedrock
`AWS_REGION`	Yes	AWS region (e.g. `us-east-1`) — not read from `.aws` config
`AWS_ACCESS_KEY_ID`	Conditional	AWS access key (one auth method required)
`AWS_SECRET_ACCESS_KEY`	Conditional	AWS secret key
`AWS_SESSION_TOKEN`	No	Session token for temporary credentials
`AWS_PROFILE`	Conditional	AWS SSO profile name (alternative to access keys)

.env

CLAUDE_CODE_USE_BEDROCK=1
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...

Bedrock supports five authentication methods: AWS CLI config, environment variable access keys, SSO profiles, Management Console credentials, and Bedrock API keys (AWS_BEARER_TOKEN_BEDROCK).

AWS_REGION is required and is not read from your AWS CLI configuration. Always set it explicitly.

IAM permissions required: bedrock:InvokeModel, bedrock:InvokeModelWithResponseStream, bedrock:ListInferenceProfiles.For full IAM policy details, credential chain options, and guardrail configuration, see the Claude Code Bedrock docs.

Google Vertex AI

Set these environment variables in your .env file:

Variable	Required	Description
`CLAUDE_CODE_USE_VERTEX`	Yes	Set to `1` to enable Vertex AI
`CLOUD_ML_REGION`	Yes	GCP region (e.g. `us-east5`) or `global`
`ANTHROPIC_VERTEX_PROJECT_ID`	Yes	Your GCP project ID
`GOOGLE_APPLICATION_CREDENTIALS`	No	Path to service account JSON (alternative to `gcloud auth`)

.env

CLAUDE_CODE_USE_VERTEX=1
CLOUD_ML_REGION=us-east5
ANTHROPIC_VERTEX_PROJECT_ID=my-project-id

Authenticate with gcloud auth application-default login or provide a service account key via GOOGLE_APPLICATION_CREDENTIALS.IAM role required: roles/aiplatform.user

Model access approval on Vertex AI can take 24–48 hours. Not all models are available in all regions.

For full GCP setup, region-specific configuration, and credential details, see the Claude Code Vertex AI docs.

Custom LLM gateway

Point ollim-bot at any endpoint that implements the Anthropic Messages API — a Bifrost gateway, vLLM, or your own gateway.

Variable	Required	Description
`ANTHROPIC_BASE_URL`	Yes	Base URL for the Messages API endpoint
`ANTHROPIC_AUTH_TOKEN`	No	Static API key sent as `Authorization` header

.env

ANTHROPIC_BASE_URL=https://your-gateway:8080/anthropic
ANTHROPIC_AUTH_TOKEN=your-gateway-key

The gateway must expose /v1/messages and forward the anthropic-beta and anthropic-version headers. For general LLM gateway setup, see the Claude Code LLM gateway docs.

Cross-provider pricing

Global endpoint pricing is identical across Anthropic API, Amazon Bedrock, and Google Vertex AI — no markup. Regional endpoints add a 10% premium for data residency compliance.Per million tokens (global endpoints):

Model	Anthropic API	Bedrock	Vertex AI	Regional (+10%)
Opus 4.6 input	$5.00	$5.00	$5.00	$5.50
Opus 4.6 output	$25.00	$25.00	$25.00	$27.50
Sonnet 4.6 input	$3.00	$3.00	$3.00	$3.30
Sonnet 4.6 output	$15.00	$15.00	$15.00	$16.50
Haiku 4.5 input	$1.00	$1.00	$1.00	$1.10
Haiku 4.5 output	$5.00	$5.00	$5.00	$5.50

Feature availability:

Feature	Anthropic API	Bedrock	Vertex AI
Prompt caching	Yes	Yes	Yes
Batch API (50% off)	Yes	Yes	Yes
Extended thinking	Yes	Yes	Yes
Fast mode (6x pricing)	Yes	Not confirmed	Not confirmed
1M context (beta)	Yes	Verify	Verify
New model availability	First	Delayed	Delayed
Provisioned throughput	No	Yes	Yes

Choose based on your infrastructure, not pricing — the per-token cost is the same. Bedrock and Vertex AI add value through IAM integration, compliance frameworks, and provisioned throughput for predictable workloads. The Anthropic API gets new features and models first.Source: Anthropic pricing, Bedrock pricing, Vertex AI pricing. Pricing current as of February 2026.

Next steps

Configuration reference

All environment variables and configuration options.

Self-host ollim-bot

Fork, configure, and deploy your own instance.

Routines

Per-routine model overrides and background fork configuration.

Slash commands

The /model command and other runtime controls.

​Default: Claude subscription

​Alternative subscriptions

​Pay-per-token providers

​Self-hosted models

​Recommended models for self-hosting

​Model version pinning

​Per-routine model override

​Choosing a model

​Claude pricing

​Advanced provider options

​Next steps

Configuration reference

Self-host ollim-bot

Routines

Slash commands

Default: Claude subscription

Alternative subscriptions

Pay-per-token providers

Self-hosted models

Recommended models for self-hosting

Model version pinning

Per-routine model override

Choosing a model

Claude pricing

Advanced provider options

Next steps