.env file.
| I want to… | Go to |
|---|---|
| Just use my Claude subscription (recommended) | Default: Claude subscription |
| Use a cheaper subscription from another provider | Alternative subscriptions |
| Pay per token instead of a flat subscription | Pay-per-token providers |
| Run models locally for full data sovereignty | Self-hosted models |
| Use Bedrock, Vertex AI, or a custom gateway | Advanced provider options |
| Pick the right Claude model for my use case | Choosing a model |
Default: Claude subscription
Out of the box, ollim-bot uses your Claude subscription via Claude Code OAuth. The model you get depends on your subscription tier:| Subscription | Default model | Opus access |
|---|---|---|
| Pro | Sonnet 4.6 | Available (with fallback) |
| Max | Opus 4.6 | Default |
/model slash command in Discord:
sonnet
currently maps to Sonnet 4.6, opus to Opus 4.6, haiku to Haiku 4.5.
Claude Code may fall back to Sonnet if you hit your Opus usage threshold
on a subscription plan.
Alternative subscriptions
Don’t want a Claude subscription? Several providers offer their own coding subscriptions with Anthropic Messages API-compatible endpoints. Set two environment variables and ollim-bot uses their models instead — no code changes.| Provider | Cost | Models | Notes |
|---|---|---|---|
| Z.AI | $3–49/mo | GLM-5, GLM-4.7 | Free tier (GLM-4.7-Flash). GLM-5 (744B/40B active MoE, MIT license) released Feb 2026 |
| Qwen | $10–50/mo | Qwen3.5, Qwen3-Coder-Next + others | Multi-model subscription. Qwen3.5 supports 1M context and multimodal |
| MiniMax | $10–150/mo | MiniMax M2.5 | SWE-Bench 80.2%, 100+ tok/s, $0.30/M input on API |
| Kimi | ~$7/week | Kimi K2.5 | 1T params (32B active MoE), agent swarm up to 100 sub-agents |
ANTHROPIC_BASE_URL and
ANTHROPIC_AUTH_TOKEN in your .env file:
.env (Z.AI example)
Z.AI GLM setup
Z.AI GLM setup
.env
Qwen / Alibaba Cloud setup
Qwen / Alibaba Cloud setup
.env
MiniMax setup
MiniMax setup
.env
Kimi / Moonshot AI setup
Kimi / Moonshot AI setup
.env
Pay-per-token providers
If you prefer paying for what you use instead of a flat subscription:| Provider | Input cost | Output cost | Models | Notes |
|---|---|---|---|---|
| DeepSeek | $0.28/1M | $0.42/1M | DeepSeek V3.2 | Cheapest option. Thinking integrated with tool use |
| OpenRouter | Varies | Varies | 400+ models | Gateway with unified billing. 24 free models, openrouter/free auto-router |
DeepSeek setup
DeepSeek setup
.env
ollama pull deepseek-v3.2).OpenRouter setup
OpenRouter setup
.env
openrouter/free auto-routes to a
compatible free model. The empty ANTHROPIC_API_KEY= prevents
Claude Code from authenticating directly with Anthropic. Only
Claude models are guaranteed to work — non-Claude models require
a translation proxy.Self-hosted models
Run models locally for full data sovereignty — no tokens leave your network. As of early 2026, all three major inference backends natively support the Anthropic Messages API with tool calling. All self-hosted setups use the same.env pattern:
| Variable | Purpose |
|---|---|
ANTHROPIC_BASE_URL | Points ollim-bot at your local inference server instead of Anthropic’s API |
ANTHROPIC_AUTH_TOKEN | Any non-empty string — local backends require the header but don’t validate it. This also tells the Agent SDK to skip Claude OAuth, so no Anthropic account is needed. |
ANTHROPIC_MODEL | The model name your backend serves (must match exactly) |
ANTHROPIC_SMALL_FAST_MODEL | Model used for lightweight tasks like subagent work and background routines. Can be the same model or a smaller/faster one. |
Ollama
Ollama
Ollama (v0.17+) runs open models locally with
a native Anthropic-compatible endpoint, tool calling, and streaming.Pull a model before starting the bot — the bot reports a model
error if it starts before the pull finishes:If you used Docker, prefix commands with Then add these variables to your v0.17 (February 2026) ships a new inference engine with up to 40%
faster prompt processing, improved multi-GPU tensor parallelism, and
better KV cache management for long conversations.Tool use works well with larger models — expect tinkering with
smaller ones. Local inference is still slower than
cloud providers. Not recommended as a primary backend for a bot
that needs sub-second response times.Once your
- Install script
- Docker (recommended)
docker exec ollama:.env:.env
Ollama model names use the Ollama registry format (e.g.,
qwen3.5:2b, qwen3.5:latest) — not HuggingFace model IDs.
Browse available models at ollama.com/search..env is configured, return to
step 6 of the quickstart
to start the bot.vLLM
vLLM
vLLM (v0.16+) exposes a native Anthropic
v0.16 (February 2026) adds async scheduling with pipeline
parallelism for ~31% throughput improvement. See the
vLLM Claude Code integration docs
for full setup.
/v1/messages endpoint with tool calling — the best option for
production multi-GPU deployments:.env
llama.cpp server
llama.cpp server
llama.cpp server
added Anthropic Messages API support in January 2026 — the most
lightweight option for single-GPU setups:Supports tools, vision, streaming, and token counting. Up to 35%
faster with NVFP4/FP8 quantization on NVIDIA GPUs. See the
Hugging Face walkthrough
for setup details.
.env
Bifrost proxy
Bifrost proxy
If your inference server only speaks OpenAI Chat Completions,
route through a Bifrost gateway
— it translates to Anthropic format automatically with sub-millisecond
overhead, load balancing, and a built-in web UI.Bifrost is open source (Apache 2.0) and supports OpenAI, Ollama,
and vLLM backends among others.
.env
Recommended models for self-hosting
Pick based on your hardware. For reliable tool calling, use high-quality quantizations (q8 or fp16 — these preserve the precision models need for structured output like tool calls). Models marked “MoE” (Mixture of Experts) only activate a fraction of their total parameters per request — so an 80B model with 3B active runs on hardware sized for 3B, not 80B.| Model | Ollama name | Active params | Hardware | Tool use | License |
|---|---|---|---|---|---|
| Qwen3.5 | qwen3.5:2b | 2.3B | 8GB+ VRAM | Good — tool calling + multimodal | Apache 2.0 |
| Qwen3-Coder-Next | — | 3B (80B MoE) | 32GB+ VRAM | Excellent — SWE-Bench 70.6 | Apache 2.0 |
| Qwen3.5-35B-A3B | qwen3.5:latest | 3B (35B MoE) | 32GB VRAM | Good — 1M context | Apache 2.0 |
| GLM-4.7-Flash | — | 30B | 24GB+ VRAM | Good — interleaved reasoning | Open weights |
| DeepSeek V3.2 (32B distill) | deepseek-v3.2 | 32B | 24GB+ VRAM | Good — thinking + tool use | MIT |
| GLM-5 | — | 40B (744B MoE) | Multi-GPU cluster | Excellent — frontier open-source | MIT |
| MiniMax M2.5 | — | Large MoE | Multi-GPU cluster | Excellent — SWE-Bench 80.2% | Modified MIT |
| Kimi K2.5 | — | 32B (1T MoE) | Multi-GPU cluster | Excellent — 1,500 tool calls | Modified MIT |
The Ollama name column shows the tag to use with
ollama pull. Models
marked ”—” are not yet available on Ollama — use vLLM or llama.cpp with the
HuggingFace model ID instead. Check ollama.com/search
for current availability.Model version pinning
By default, model aliases (opus, sonnet, haiku) resolve to the latest
version. Pin specific versions with these environment variables:
| Variable | Description |
|---|---|
ANTHROPIC_DEFAULT_OPUS_MODEL | Pin the opus alias |
ANTHROPIC_DEFAULT_SONNET_MODEL | Pin the sonnet alias |
ANTHROPIC_DEFAULT_HAIKU_MODEL | Pin the haiku alias |
glm-4.7, deepseek-chat, kimi-k2.5).
Claude model IDs by provider
Claude model IDs by provider
| Model | Anthropic API | Amazon Bedrock | Google Vertex AI |
|---|---|---|---|
| Opus 4.6 | claude-opus-4-6 | us.anthropic.claude-opus-4-6-v1 | claude-opus-4-6 |
| Sonnet 4.6 | claude-sonnet-4-6 | us.anthropic.claude-sonnet-4-6 | claude-sonnet-4-6 |
| Haiku 4.5 | claude-haiku-4-5-20251001 | us.anthropic.claude-haiku-4-5-20251001-v1:0 | claude-haiku-4-5@20251001 |
.env (Bedrock example)
Per-routine model override
Background routines can override the model in their YAML frontmatter:routines/quick-email-check.md
model field accepts aliases (opus, sonnet, haiku) and only
applies to background routines. See Routines
for all frontmatter fields.
Choosing a model
For most ollim-bot use, Sonnet 4.6 handles tool calling, scheduling, and conversation as well as Opus 4.6 — at 40% of the cost. Opus pulls ahead on deep reasoning and complex multi-step debugging. Haiku 4.5 is ideal for lightweight background routines where speed matters more than depth.| Model | Best for | Tool calling | Agentic coding | Deep reasoning | Speed | Cost |
|---|---|---|---|---|---|---|
| Opus 4.6 | Complex multi-step tasks, novel problem-solving | Excellent | 65.4% Terminal-Bench | 91.3% GPQA | Slowest | $$$ |
| Sonnet 4.6 | Daily conversations, routines, most agentic work | Excellent | 59.1% Terminal-Bench | 74.1% GPQA | Fast | $$ |
| Haiku 4.5 | Background routines, email triage, quick checks | Good | 41.8% Terminal-Bench | — | Fastest | $ |
Haiku has a 200k context window (vs. 1M for Sonnet and Opus). If your main session exceeds 200k tokens, the bot automatically upgrades interactive forks to sonnet to avoid failures. This makes haiku best suited for short-lived background routines rather than long interactive sessions.
Full agentic benchmark comparison
Full agentic benchmark comparison
All scores use extended/adaptive thinking unless noted. Benchmarks are
selected for relevance to agentic tool-calling bots like ollim-bot.
Key pattern: Sonnet 4.6 matches or beats Opus on practical tool
calling (tau2-bench, MCP Atlas, Finance Agent, GDPval-AA). Opus leads
on deep reasoning (GPQA, ARC-AGI-2, Humanity’s Last Exam) and
long-context retrieval — tasks that matter for complex debugging, not
typical daily bot interactions.Haiku 4.5 achieves 73.3% on SWE-bench Verified — matching Claude
Sonnet 4.5 — at one-third the cost and 4-5x the speed. It reaches ~90%
of Sonnet 4.5’s agentic coding performance per Augment’s evaluation.Sources:
Anthropic Opus 4.6,
Anthropic Sonnet 4.6,
Anthropic Haiku 4.5,
Vellum benchmarks,
Anthropic model overview.
Scores current as of February 2026.
| Benchmark | What it measures | Opus 4.6 | Sonnet 4.6 | Haiku 4.5 |
|---|---|---|---|---|
| SWE-bench Verified | Real-world software engineering | 80.8% | 79.6% | 73.3% |
| Terminal-Bench 2.0 | Agentic CLI coding | 65.4% | 59.1% | 41.8% |
| tau2-bench Retail | Multi-step tool calling (retail) | 91.9% | 91.7% | — |
| tau2-bench Telecom | Multi-step tool calling (telecom) | 99.3% | 97.9% | — |
| OSWorld | Agentic computer use | 72.7% | 72.5% | 22.0% |
| MCP Atlas | Scaled tool use | 59.5% | 61.3% | — |
| GDPval-AA (Elo) | Economically valuable knowledge work | 1606 | 1633 | — |
| Finance Agent | Financial tool use | 60.7% | 63.3% | — |
| ARC-AGI-2 | Novel problem-solving | 68.8% | 58.3% | — |
| GPQA Diamond | Graduate-level scientific reasoning | 91.3% | 74.1% | — |
| Humanity’s Last Exam | Hardest questions (with tools) | 53.1% | 19.1% | — |
| BrowseComp | Web search and information discovery | 84.0% | — | — |
| MRCR v2 8-needle @ 1M | Long-context retrieval accuracy | 76.0% | — | — |
Claude pricing
For most users, a Claude subscription costs less than API pay-as-you-go. The average Claude Code developer uses the equivalent of $130/month in API tokens — covered by a $20 Pro plan.| Plan | Cost | Default model | Opus access | Rate limits |
|---|---|---|---|---|
| Pro | $20/mo | Sonnet 4.6 | Available (with fallback) | ~45 msgs/5hr |
| Max 5x | $100/mo | Opus 4.6 | Default | 5x Pro |
| Max 20x | $200/mo | Opus 4.6 | Default | 20x Pro |
On the Pro plan, Claude Code may fall back from Opus to Sonnet when you
hit a usage threshold. The exact limit is not published. Max plans have
higher thresholds — Max 20x rarely triggers fallback.
API token pricing and breakeven analysis
API token pricing and breakeven analysis
Per million tokens (standard on-demand):
Extended thinking tokens are billed at output token rates. Long context
(>200K input) doubles the input cost and adds 50% to the output cost.Breakeven analysis (assuming 3:1 input-to-output ratio):
For context, Anthropic reports the average Claude Code developer uses
$6/day ($130/month) in API-equivalent costs, and the 90th percentile
is under $12/day (~$260/month). Pro at $20/month covers what would be
$130+ on the API — a subscription is the clear winner for regular use.API pay-as-you-go only wins at very low usage (under ~3M tokens/month on
Sonnet) or when you need guaranteed access without rate limit resets.Source: Anthropic API pricing,
Claude Code costs.
| Model | Input | Output | Cache read (90% off) | Batch (50% off) |
|---|---|---|---|---|
| Opus 4.6 | $5.00 | $25.00 | $0.50 in | 12.50 out |
| Sonnet 4.6 | $3.00 | $15.00 | $0.30 in | 7.50 out |
| Haiku 4.5 | $1.00 | $5.00 | $0.10 in | 2.50 out |
| Plan | Monthly cost | Breakeven on Sonnet | Breakeven on Opus |
|---|---|---|---|
| Pro | $20 | ~3.3M tokens/month | ~2M tokens/month |
| Max 5x | $100 | ~16.7M tokens/month | ~10M tokens/month |
| Max 20x | $200 | ~33.3M tokens/month | ~20M tokens/month |
Advanced provider options
If you need pay-as-you-go API billing, cloud provider infrastructure, or a custom LLM gateway, these options are available but require more setup.Anthropic API key
Anthropic API key
For pay-as-you-go billing instead of a subscription:This bypasses Claude Code OAuth entirely. You pay per token at
Anthropic’s API rates.
.env
Amazon Bedrock
Amazon Bedrock
Set these environment variables in your
Bedrock supports five authentication methods: AWS CLI config, environment
variable access keys, SSO profiles, Management Console credentials, and
Bedrock API keys (
.env file:| Variable | Required | Description |
|---|---|---|
CLAUDE_CODE_USE_BEDROCK | Yes | Set to 1 to enable Bedrock |
AWS_REGION | Yes | AWS region (e.g. us-east-1) — not read from .aws config |
AWS_ACCESS_KEY_ID | Conditional | AWS access key (one auth method required) |
AWS_SECRET_ACCESS_KEY | Conditional | AWS secret key |
AWS_SESSION_TOKEN | No | Session token for temporary credentials |
AWS_PROFILE | Conditional | AWS SSO profile name (alternative to access keys) |
.env
AWS_BEARER_TOKEN_BEDROCK).IAM permissions required: bedrock:InvokeModel,
bedrock:InvokeModelWithResponseStream, bedrock:ListInferenceProfiles.For full IAM policy details, credential chain options, and guardrail
configuration, see the
Claude Code Bedrock docs.Google Vertex AI
Google Vertex AI
Set these environment variables in your
Authenticate with For full GCP setup, region-specific configuration, and credential details,
see the
Claude Code Vertex AI docs.
.env file:| Variable | Required | Description |
|---|---|---|
CLAUDE_CODE_USE_VERTEX | Yes | Set to 1 to enable Vertex AI |
CLOUD_ML_REGION | Yes | GCP region (e.g. us-east5) or global |
ANTHROPIC_VERTEX_PROJECT_ID | Yes | Your GCP project ID |
GOOGLE_APPLICATION_CREDENTIALS | No | Path to service account JSON (alternative to gcloud auth) |
.env
gcloud auth application-default login or provide a
service account key via GOOGLE_APPLICATION_CREDENTIALS.IAM role required: roles/aiplatform.userModel access approval on Vertex AI can take 24–48 hours. Not all models
are available in all regions.
Custom LLM gateway
Custom LLM gateway
Point ollim-bot at any endpoint that implements the
Anthropic Messages API —
a Bifrost gateway, vLLM, or
your own gateway.
The gateway must expose
| Variable | Required | Description |
|---|---|---|
ANTHROPIC_BASE_URL | Yes | Base URL for the Messages API endpoint |
ANTHROPIC_AUTH_TOKEN | No | Static API key sent as Authorization header |
.env
/v1/messages and forward the anthropic-beta
and anthropic-version headers. For general LLM gateway setup, see the
Claude Code LLM gateway docs.Cross-provider pricing
Cross-provider pricing
Global endpoint pricing is identical across Anthropic API, Amazon Bedrock,
and Google Vertex AI — no markup. Regional endpoints add a 10% premium
for data residency compliance.Per million tokens (global endpoints):
Feature availability:
Choose based on your infrastructure, not pricing — the per-token cost
is the same. Bedrock and Vertex AI add value through IAM integration,
compliance frameworks, and provisioned throughput for predictable
workloads. The Anthropic API gets new features and models first.Source: Anthropic pricing,
Bedrock pricing,
Vertex AI pricing.
Pricing current as of February 2026.
| Model | Anthropic API | Bedrock | Vertex AI | Regional (+10%) |
|---|---|---|---|---|
| Opus 4.6 input | $5.00 | $5.00 | $5.00 | $5.50 |
| Opus 4.6 output | $25.00 | $25.00 | $25.00 | $27.50 |
| Sonnet 4.6 input | $3.00 | $3.00 | $3.00 | $3.30 |
| Sonnet 4.6 output | $15.00 | $15.00 | $15.00 | $16.50 |
| Haiku 4.5 input | $1.00 | $1.00 | $1.00 | $1.10 |
| Haiku 4.5 output | $5.00 | $5.00 | $5.00 | $5.50 |
| Feature | Anthropic API | Bedrock | Vertex AI |
|---|---|---|---|
| Prompt caching | Yes | Yes | Yes |
| Batch API (50% off) | Yes | Yes | Yes |
| Extended thinking | Yes | Yes | Yes |
| Fast mode (6x pricing) | Yes | Not confirmed | Not confirmed |
| 1M context (beta) | Yes | Verify | Verify |
| New model availability | First | Delayed | Delayed |
| Provisioned throughput | No | Yes | Yes |
Next steps
Configuration reference
All environment variables and configuration options.
Self-host ollim-bot
Fork, configure, and deploy your own instance.
Routines
Per-routine model overrides and background fork configuration.
Slash commands
The /model command and other runtime controls.
