Skip to main content
ollim-bot authenticates through Claude Code OAuth by default — no API key, no provider config, nothing to set up. Your Claude subscription handles everything. If you want to experiment with other models or reduce costs, swap providers with a couple of environment variables in your .env file.
I want to…Go to
Just use my Claude subscription (recommended)Default: Claude subscription
Use a cheaper subscription from another providerAlternative subscriptions
Pay per token instead of a flat subscriptionPay-per-token providers
Run models locally for full data sovereigntySelf-hosted models
Use Bedrock, Vertex AI, or a custom gatewayAdvanced provider options
Pick the right Claude model for my use caseChoosing a model

Default: Claude subscription

Out of the box, ollim-bot uses your Claude subscription via Claude Code OAuth. The model you get depends on your subscription tier:
SubscriptionDefault modelOpus access
ProSonnet 4.6Available (with fallback)
MaxOpus 4.6Default
Switch models at runtime with the /model slash command in Discord:
/model opus
/model sonnet
/model haiku
The agent resolves aliases to the latest version automatically — sonnet currently maps to Sonnet 4.6, opus to Opus 4.6, haiku to Haiku 4.5.
Claude Code may fall back to Sonnet if you hit your Opus usage threshold on a subscription plan.

Alternative subscriptions

Don’t want a Claude subscription? Several providers offer their own coding subscriptions with Anthropic Messages API-compatible endpoints. Set two environment variables and ollim-bot uses their models instead — no code changes.
ProviderCostModelsNotes
Z.AI$3–49/moGLM-5, GLM-4.7Free tier (GLM-4.7-Flash). GLM-5 (744B/40B active MoE, MIT license) released Feb 2026
Qwen$10–50/moQwen3.5, Qwen3-Coder-Next + othersMulti-model subscription. Qwen3.5 supports 1M context and multimodal
MiniMax$10–150/moMiniMax M2.5SWE-Bench 80.2%, 100+ tok/s, $0.30/M input on API
Kimi~$7/weekKimi K2.51T params (32B active MoE), agent swarm up to 100 sub-agents
All of these use the same pattern — ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN in your .env file:
.env (Z.AI example)
ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
ANTHROPIC_AUTH_TOKEN=your-zai-api-key
.env
ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
ANTHROPIC_AUTH_TOKEN=your-zai-api-key
ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7
ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.5-air
GLM-4.7 runs at roughly 5–7x cheaper than Claude Sonnet 4.6. Z.AI offers subscription plans (Lite $3/mo, Pro $15/mo, Max ~$60/mo) with prompt-based quotas, or pay-per-token. GLM-4.7-Flash and GLM-4.5-Flash are free.GLM-5 (released February 2026) is their frontier model — 744B total / 40B active MoE, MIT license, $1.00/M input / $3.20/M output via API. Requires server-grade hardware for self-hosting.
.env
ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic
ANTHROPIC_AUTH_TOKEN=your-dashscope-api-key
ANTHROPIC_MODEL=qwen3.5-plus
ANTHROPIC_SMALL_FAST_MODEL=qwen3.5-coder
One subscription ($10–50/mo) covers six models. Switch between Qwen3.5-Plus, Qwen3-Coder, GLM-4.7, Kimi-K2.5, and MiniMax M2.5 without changing providers.Qwen3.5 (released February 2026) adds 1M context, native multimodal, and 201 languages. Qwen3-Coder-Next (80B/3B active MoE, Apache 2.0) scores 70.6 on SWE-Bench Verified — purpose-built for coding agents.
.env
ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
ANTHROPIC_AUTH_TOKEN=your-minimax-api-key
ANTHROPIC_MODEL=minimax-m2.5
ANTHROPIC_SMALL_FAST_MODEL=minimax-m2.5-lightning
Subscription tiers: Starter $10/mo (100 prompts/5h), Plus $20/mo (300/5h), Max $50/mo (1,000/5h). 100+ tokens per second throughput.M2.5 (released February 2026) hits 80.2% on SWE-Bench Verified and completes tasks 37% faster than M2.1. API pricing: $0.30/M input, $1.20/M output. Open weights under Modified MIT.
.env
ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
ANTHROPIC_AUTH_TOKEN=your-moonshot-api-key
ANTHROPIC_MODEL=kimi-k2.5
ANTHROPIC_SMALL_FAST_MODEL=kimi-k2
Weekly membership at ~$7/week with 300–1,200 API calls per 5-hour window. 256K context window and 100 tokens/second output speed.K2.5 (released January 2026) is a 1T parameter / 32B active MoE with native multimodal and an agent swarm that runs up to 100 sub-agents in parallel. Open weights under Modified MIT.

Pay-per-token providers

If you prefer paying for what you use instead of a flat subscription:
ProviderInput costOutput costModelsNotes
DeepSeek$0.28/1M$0.42/1MDeepSeek V3.2Cheapest option. Thinking integrated with tool use
OpenRouterVariesVaries400+ modelsGateway with unified billing. 24 free models, openrouter/free auto-router
.env
ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
ANTHROPIC_AUTH_TOKEN=your-deepseek-api-key
ANTHROPIC_MODEL=deepseek-chat
ANTHROPIC_SMALL_FAST_MODEL=deepseek-chat
The cheapest per-token option — roughly 10x cheaper than Claude Sonnet 4.6. No subscription required. V3.2 is the first model to integrate thinking directly into tool use, supporting both thinking and non-thinking modes. Image input is not supported through the Anthropic compatibility endpoint.Open weights (MIT license) — distilled 32B variants are available for single-GPU self-hosting via Ollama (ollama pull deepseek-v3.2).
.env
ANTHROPIC_BASE_URL=https://openrouter.ai/api
ANTHROPIC_AUTH_TOKEN=sk-or-v1-your-key
ANTHROPIC_API_KEY=
Gateway to 400+ models with unified billing. 24 models are free without a credit card, and openrouter/free auto-routes to a compatible free model. The empty ANTHROPIC_API_KEY= prevents Claude Code from authenticating directly with Anthropic. Only Claude models are guaranteed to work — non-Claude models require a translation proxy.
Alternative models are community-supported and not tested by Anthropic. ollim-bot’s agentic loop depends on reliable tool use — test thoroughly before relying on a non-Claude model for daily operations. Provider endpoints and pricing may change without notice.

Self-hosted models

Run models locally for full data sovereignty — no tokens leave your network. As of early 2026, all three major inference backends natively support the Anthropic Messages API with tool calling. All self-hosted setups use the same .env pattern:
VariablePurpose
ANTHROPIC_BASE_URLPoints ollim-bot at your local inference server instead of Anthropic’s API
ANTHROPIC_AUTH_TOKENAny non-empty string — local backends require the header but don’t validate it. This also tells the Agent SDK to skip Claude OAuth, so no Anthropic account is needed.
ANTHROPIC_MODELThe model name your backend serves (must match exactly)
ANTHROPIC_SMALL_FAST_MODELModel used for lightweight tasks like subagent work and background routines. Can be the same model or a smaller/faster one.
Ollama (v0.17+) runs open models locally with a native Anthropic-compatible endpoint, tool calling, and streaming.
curl -fsSL https://ollama.com/install.sh -o /tmp/ollama-install.sh
less /tmp/ollama-install.sh   # inspect the script first
sh /tmp/ollama-install.sh
Pull a model before starting the bot — the bot reports a model error if it starts before the pull finishes:
ollama pull qwen3.5:2b
curl http://localhost:11434/
If you used Docker, prefix commands with docker exec ollama:
docker exec ollama ollama pull qwen3.5:2b
Then add these variables to your .env:
.env
ANTHROPIC_BASE_URL=http://localhost:11434
ANTHROPIC_AUTH_TOKEN=ollama
ANTHROPIC_MODEL=qwen3.5:2b
ANTHROPIC_SMALL_FAST_MODEL=qwen3.5:2b
Ollama model names use the Ollama registry format (e.g., qwen3.5:2b, qwen3.5:latest) — not HuggingFace model IDs. Browse available models at ollama.com/search.
Ollama binds to all network interfaces by default. For security, bind to localhost only by setting OLLAMA_HOST=127.0.0.1 in your environment, or use 127.0.0.1 in the Docker -p flag as shown above.
v0.17 (February 2026) ships a new inference engine with up to 40% faster prompt processing, improved multi-GPU tensor parallelism, and better KV cache management for long conversations.Tool use works well with larger models — expect tinkering with smaller ones. Local inference is still slower than cloud providers. Not recommended as a primary backend for a bot that needs sub-second response times.Once your .env is configured, return to step 6 of the quickstart to start the bot.
vLLM (v0.16+) exposes a native Anthropic /v1/messages endpoint with tool calling — the best option for production multi-GPU deployments:
.env
ANTHROPIC_BASE_URL=http://localhost:8000
ANTHROPIC_AUTH_TOKEN=vllm
ANTHROPIC_MODEL=qwen3-coder-next
ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder
v0.16 (February 2026) adds async scheduling with pipeline parallelism for ~31% throughput improvement. See the vLLM Claude Code integration docs for full setup.
llama.cpp server added Anthropic Messages API support in January 2026 — the most lightweight option for single-GPU setups:
.env
ANTHROPIC_BASE_URL=http://localhost:8080
ANTHROPIC_AUTH_TOKEN=llamacpp
ANTHROPIC_MODEL=qwen3-coder-next
ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder
Supports tools, vision, streaming, and token counting. Up to 35% faster with NVFP4/FP8 quantization on NVIDIA GPUs. See the Hugging Face walkthrough for setup details.
If your inference server only speaks OpenAI Chat Completions, route through a Bifrost gateway — it translates to Anthropic format automatically with sub-millisecond overhead, load balancing, and a built-in web UI.
# Start Bifrost (Docker or npx)
docker run -p 8080:8080 maximhq/bifrost
# or: npx -y @maximhq/bifrost
.env
ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
ANTHROPIC_AUTH_TOKEN=bifrost
ANTHROPIC_MODEL=your-model-name
ANTHROPIC_SMALL_FAST_MODEL=your-model-name
Bifrost is open source (Apache 2.0) and supports OpenAI, Ollama, and vLLM backends among others.
Pick based on your hardware. For reliable tool calling, use high-quality quantizations (q8 or fp16 — these preserve the precision models need for structured output like tool calls).
Most self-hosted models require a dedicated GPU with at least 24GB VRAM. Smaller models like Qwen3.5 (2B) can run on consumer GPUs (8GB+) but with reduced tool-calling reliability. If you don’t have the hardware, consider alternative subscriptions or pay-per-token providers instead.
Models marked “MoE” (Mixture of Experts) only activate a fraction of their total parameters per request — so an 80B model with 3B active runs on hardware sized for 3B, not 80B.
ModelOllama nameActive paramsHardwareTool useLicense
Qwen3.5qwen3.5:2b2.3B8GB+ VRAMGood — tool calling + multimodalApache 2.0
Qwen3-Coder-Next3B (80B MoE)32GB+ VRAMExcellent — SWE-Bench 70.6Apache 2.0
Qwen3.5-35B-A3Bqwen3.5:latest3B (35B MoE)32GB VRAMGood — 1M contextApache 2.0
GLM-4.7-Flash30B24GB+ VRAMGood — interleaved reasoningOpen weights
DeepSeek V3.2 (32B distill)deepseek-v3.232B24GB+ VRAMGood — thinking + tool useMIT
GLM-540B (744B MoE)Multi-GPU clusterExcellent — frontier open-sourceMIT
MiniMax M2.5Large MoEMulti-GPU clusterExcellent — SWE-Bench 80.2%Modified MIT
Kimi K2.532B (1T MoE)Multi-GPU clusterExcellent — 1,500 tool callsModified MIT
The Ollama name column shows the tag to use with ollama pull. Models marked ”—” are not yet available on Ollama — use vLLM or llama.cpp with the HuggingFace model ID instead. Check ollama.com/search for current availability.
Self-hosting works well for data sovereignty and experimentation. Tool use in local models improved sharply in early 2026 with native Anthropic endpoints in Ollama and vLLM — but cloud providers still win on latency and reliability for a bot that needs to respond quickly throughout the day.

Model version pinning

By default, model aliases (opus, sonnet, haiku) resolve to the latest version. Pin specific versions with these environment variables:
VariableDescription
ANTHROPIC_DEFAULT_OPUS_MODELPin the opus alias
ANTHROPIC_DEFAULT_SONNET_MODELPin the sonnet alias
ANTHROPIC_DEFAULT_HAIKU_MODELPin the haiku alias
These also work with alternative providers — set them to the provider’s model IDs (e.g. glm-4.7, deepseek-chat, kimi-k2.5).
ModelAnthropic APIAmazon BedrockGoogle Vertex AI
Opus 4.6claude-opus-4-6us.anthropic.claude-opus-4-6-v1claude-opus-4-6
Sonnet 4.6claude-sonnet-4-6us.anthropic.claude-sonnet-4-6claude-sonnet-4-6
Haiku 4.5claude-haiku-4-5-20251001us.anthropic.claude-haiku-4-5-20251001-v1:0claude-haiku-4-5@20251001
.env (Bedrock example)
ANTHROPIC_DEFAULT_OPUS_MODEL=us.anthropic.claude-opus-4-6-v1
ANTHROPIC_DEFAULT_SONNET_MODEL=us.anthropic.claude-sonnet-4-6
ANTHROPIC_DEFAULT_HAIKU_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0
Pin all three models when using Bedrock or Vertex AI. Without pinning, aliases resolve to the latest version — which may not be available in your deployment yet.

Per-routine model override

Background routines can override the model in their YAML frontmatter:
routines/quick-email-check.md
---
id: "d1e2f3a4"
cron: "0 */3 * * *"
description: "Email check"
background: true
model: "haiku"
---
Check for new important emails. Save a summary to pending updates.
The model field accepts aliases (opus, sonnet, haiku) and only applies to background routines. See Routines for all frontmatter fields.

Choosing a model

For most ollim-bot use, Sonnet 4.6 handles tool calling, scheduling, and conversation as well as Opus 4.6 — at 40% of the cost. Opus pulls ahead on deep reasoning and complex multi-step debugging. Haiku 4.5 is ideal for lightweight background routines where speed matters more than depth.
ModelBest forTool callingAgentic codingDeep reasoningSpeedCost
Opus 4.6Complex multi-step tasks, novel problem-solvingExcellent65.4% Terminal-Bench91.3% GPQASlowest$$$
Sonnet 4.6Daily conversations, routines, most agentic workExcellent59.1% Terminal-Bench74.1% GPQAFast$$
Haiku 4.5Background routines, email triage, quick checksGood41.8% Terminal-BenchFastest$
Haiku has a 200k context window (vs. 1M for Sonnet and Opus). If your main session exceeds 200k tokens, the bot automatically upgrades interactive forks to sonnet to avoid failures. This makes haiku best suited for short-lived background routines rather than long interactive sessions.
Sonnet 4.6 is the sweet spot for ollim-bot. It matches Opus on tau2-bench tool calling (91.7% vs 91.9%), beats it on knowledge work tasks (GDPval-AA: 1633 vs 1606 Elo), and is 70% more token-efficient. It’s the default.
All scores use extended/adaptive thinking unless noted. Benchmarks are selected for relevance to agentic tool-calling bots like ollim-bot.
BenchmarkWhat it measuresOpus 4.6Sonnet 4.6Haiku 4.5
SWE-bench VerifiedReal-world software engineering80.8%79.6%73.3%
Terminal-Bench 2.0Agentic CLI coding65.4%59.1%41.8%
tau2-bench RetailMulti-step tool calling (retail)91.9%91.7%
tau2-bench TelecomMulti-step tool calling (telecom)99.3%97.9%
OSWorldAgentic computer use72.7%72.5%22.0%
MCP AtlasScaled tool use59.5%61.3%
GDPval-AA (Elo)Economically valuable knowledge work16061633
Finance AgentFinancial tool use60.7%63.3%
ARC-AGI-2Novel problem-solving68.8%58.3%
GPQA DiamondGraduate-level scientific reasoning91.3%74.1%
Humanity’s Last ExamHardest questions (with tools)53.1%19.1%
BrowseCompWeb search and information discovery84.0%
MRCR v2 8-needle @ 1MLong-context retrieval accuracy76.0%
Key pattern: Sonnet 4.6 matches or beats Opus on practical tool calling (tau2-bench, MCP Atlas, Finance Agent, GDPval-AA). Opus leads on deep reasoning (GPQA, ARC-AGI-2, Humanity’s Last Exam) and long-context retrieval — tasks that matter for complex debugging, not typical daily bot interactions.Haiku 4.5 achieves 73.3% on SWE-bench Verified — matching Claude Sonnet 4.5 — at one-third the cost and 4-5x the speed. It reaches ~90% of Sonnet 4.5’s agentic coding performance per Augment’s evaluation.Sources: Anthropic Opus 4.6, Anthropic Sonnet 4.6, Anthropic Haiku 4.5, Vellum benchmarks, Anthropic model overview. Scores current as of February 2026.

Claude pricing

For most users, a Claude subscription costs less than API pay-as-you-go. The average Claude Code developer uses the equivalent of $130/month in API tokens — covered by a $20 Pro plan.
PlanCostDefault modelOpus accessRate limits
Pro$20/moSonnet 4.6Available (with fallback)~45 msgs/5hr
Max 5x$100/moOpus 4.6Default5x Pro
Max 20x$200/moOpus 4.6Default20x Pro
On the Pro plan, Claude Code may fall back from Opus to Sonnet when you hit a usage threshold. The exact limit is not published. Max plans have higher thresholds — Max 20x rarely triggers fallback.
Per million tokens (standard on-demand):
ModelInputOutputCache read (90% off)Batch (50% off)
Opus 4.6$5.00$25.00$0.50 in2.50in/2.50 in / 12.50 out
Sonnet 4.6$3.00$15.00$0.30 in1.50in/1.50 in / 7.50 out
Haiku 4.5$1.00$5.00$0.10 in0.50in/0.50 in / 2.50 out
Extended thinking tokens are billed at output token rates. Long context (>200K input) doubles the input cost and adds 50% to the output cost.Breakeven analysis (assuming 3:1 input-to-output ratio):
PlanMonthly costBreakeven on SonnetBreakeven on Opus
Pro$20~3.3M tokens/month~2M tokens/month
Max 5x$100~16.7M tokens/month~10M tokens/month
Max 20x$200~33.3M tokens/month~20M tokens/month
For context, Anthropic reports the average Claude Code developer uses $6/day ($130/month) in API-equivalent costs, and the 90th percentile is under $12/day (~$260/month). Pro at $20/month covers what would be $130+ on the API — a subscription is the clear winner for regular use.API pay-as-you-go only wins at very low usage (under ~3M tokens/month on Sonnet) or when you need guaranteed access without rate limit resets.Source: Anthropic API pricing, Claude Code costs.

Advanced provider options

If you need pay-as-you-go API billing, cloud provider infrastructure, or a custom LLM gateway, these options are available but require more setup.
For pay-as-you-go billing instead of a subscription:
.env
ANTHROPIC_API_KEY=sk-ant-...
This bypasses Claude Code OAuth entirely. You pay per token at Anthropic’s API rates.
Set these environment variables in your .env file:
VariableRequiredDescription
CLAUDE_CODE_USE_BEDROCKYesSet to 1 to enable Bedrock
AWS_REGIONYesAWS region (e.g. us-east-1) — not read from .aws config
AWS_ACCESS_KEY_IDConditionalAWS access key (one auth method required)
AWS_SECRET_ACCESS_KEYConditionalAWS secret key
AWS_SESSION_TOKENNoSession token for temporary credentials
AWS_PROFILEConditionalAWS SSO profile name (alternative to access keys)
.env
CLAUDE_CODE_USE_BEDROCK=1
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
Bedrock supports five authentication methods: AWS CLI config, environment variable access keys, SSO profiles, Management Console credentials, and Bedrock API keys (AWS_BEARER_TOKEN_BEDROCK).
AWS_REGION is required and is not read from your AWS CLI configuration. Always set it explicitly.
IAM permissions required: bedrock:InvokeModel, bedrock:InvokeModelWithResponseStream, bedrock:ListInferenceProfiles.For full IAM policy details, credential chain options, and guardrail configuration, see the Claude Code Bedrock docs.
Set these environment variables in your .env file:
VariableRequiredDescription
CLAUDE_CODE_USE_VERTEXYesSet to 1 to enable Vertex AI
CLOUD_ML_REGIONYesGCP region (e.g. us-east5) or global
ANTHROPIC_VERTEX_PROJECT_IDYesYour GCP project ID
GOOGLE_APPLICATION_CREDENTIALSNoPath to service account JSON (alternative to gcloud auth)
.env
CLAUDE_CODE_USE_VERTEX=1
CLOUD_ML_REGION=us-east5
ANTHROPIC_VERTEX_PROJECT_ID=my-project-id
Authenticate with gcloud auth application-default login or provide a service account key via GOOGLE_APPLICATION_CREDENTIALS.IAM role required: roles/aiplatform.user
Model access approval on Vertex AI can take 24–48 hours. Not all models are available in all regions.
For full GCP setup, region-specific configuration, and credential details, see the Claude Code Vertex AI docs.
Point ollim-bot at any endpoint that implements the Anthropic Messages API — a Bifrost gateway, vLLM, or your own gateway.
VariableRequiredDescription
ANTHROPIC_BASE_URLYesBase URL for the Messages API endpoint
ANTHROPIC_AUTH_TOKENNoStatic API key sent as Authorization header
.env
ANTHROPIC_BASE_URL=https://your-gateway:8080/anthropic
ANTHROPIC_AUTH_TOKEN=your-gateway-key
The gateway must expose /v1/messages and forward the anthropic-beta and anthropic-version headers. For general LLM gateway setup, see the Claude Code LLM gateway docs.
Global endpoint pricing is identical across Anthropic API, Amazon Bedrock, and Google Vertex AI — no markup. Regional endpoints add a 10% premium for data residency compliance.Per million tokens (global endpoints):
ModelAnthropic APIBedrockVertex AIRegional (+10%)
Opus 4.6 input$5.00$5.00$5.00$5.50
Opus 4.6 output$25.00$25.00$25.00$27.50
Sonnet 4.6 input$3.00$3.00$3.00$3.30
Sonnet 4.6 output$15.00$15.00$15.00$16.50
Haiku 4.5 input$1.00$1.00$1.00$1.10
Haiku 4.5 output$5.00$5.00$5.00$5.50
Feature availability:
FeatureAnthropic APIBedrockVertex AI
Prompt cachingYesYesYes
Batch API (50% off)YesYesYes
Extended thinkingYesYesYes
Fast mode (6x pricing)YesNot confirmedNot confirmed
1M context (beta)YesVerifyVerify
New model availabilityFirstDelayedDelayed
Provisioned throughputNoYesYes
Choose based on your infrastructure, not pricing — the per-token cost is the same. Bedrock and Vertex AI add value through IAM integration, compliance frameworks, and provisioned throughput for predictable workloads. The Anthropic API gets new features and models first.Source: Anthropic pricing, Bedrock pricing, Vertex AI pricing. Pricing current as of February 2026.

Next steps

Configuration reference

All environment variables and configuration options.

Self-host ollim-bot

Fork, configure, and deploy your own instance.

Routines

Per-routine model overrides and background fork configuration.

Slash commands

The /model command and other runtime controls.