Running tests
No extra configuration is needed. The
conftest.py sets default environment
variables (OLLIM_USER_NAME=TestUser, OLLIM_BOT_NAME=test-bot) so tests
run without a .env file.Test philosophy
Tests verify real behavior against real data. The guiding principle: mock only what you cannot control. What gets tested with real instances:- Dataclass construction and field defaults (
Routine,Reminder,BudgetState) - File I/O — JSONL reading/writing, markdown parsing, roundtrip serialization
- State transitions — ping budget refill, session compaction, permission approval
- Configuration loading and validation
- Discord API calls (
channel.send(), message objects) — these require an active gateway connection - Agent/Client creation in fork execution — these start real Claude API calls
Test structure
Dependencies
| Package | Version | Purpose |
|---|---|---|
pytest | >=9.0.2 | Test framework |
pytest-asyncio | >=1.3.0 | Async test support |
pytest-cov | >=7.0.0 | Coverage reporting |
dev dependency group and installed by uv sync.
The data_dir fixture
Every test that touches the filesystem uses the data_dir fixture from
conftest.py. It redirects all module-level path constants to a
tmp_path directory:
~/.ollim-bot/. Each test gets an isolated temp
directory that is cleaned up automatically.
File organization
All tests live intests/ as module-level functions — no test classes. Each file maps to a source module:
| Test file | Source module |
|---|---|
test_agent_streaming.py | agent_streaming.py |
test_agent_tools.py | agent_tools.py |
test_bot.py | bot.py |
test_cli.py | main.py, routine_cmd.py, reminder_cmd.py |
test_config.py | config.py |
test_embeds.py | embeds.py, views.py |
test_forks.py | forks.py, fork_state.py |
test_formatting.py | formatting.py |
test_inquiries.py | inquiries.py |
test_permissions.py | permissions.py |
test_ping_budget.py | ping_budget.py |
test_reminders.py | scheduling/reminders.py |
test_routines.py | scheduling/routines.py |
test_runtime_config.py | runtime_config.py |
test_scheduler_prompts.py | scheduling/preamble.py, prompts.py |
test_sessions.py | sessions.py |
test_skills.py | skills.py |
test_storage.py | storage.py |
test_stream_compact.py | streamer.py, sessions.py |
test_streamer.py | streamer.py |
test_subagents.py | subagents.py |
test_tool_policy.py | tool_policy.py |
test_tool_restrictions.py | agent_tools.py |
test_webhook.py | webhook.py |
evals/test_judge.py | evals/judge.py |
evals/test_runner.py | evals/runner.py |
evals/test_scenario.py | evals/scenario.py |
evals/test_eval_cmd.py | evals/eval_cmd.py |
evals/test_results.py | evals/results.py |
test_behavior_prompts.py | ADHD prompt content regression tests |
test_behavior_scenarios.py | ADHD multi-step interaction flow regression tests |
test_behavior_context.py | ADHD context assembly regression tests |
ADHD behavior evals
The eval system tests whether the bot responds appropriately to simulated ADHD users. It runs multi-turn conversations between a Haiku-powered user-proxy (playing a simulated ADHD user) and the real bot agent, then scores the transcript with a Sonnet-powered LLM judge.What it tests
Each scenario defines a persona (personality, goal, opening message, max turns) and criteria the judge scores on a 1-5 scale. Criteria are ADHD-specific behavioral anchors — things like whether the bot gives exactly one next step instead of an option dump, uses warm tone instead of productivity-coach language, and avoids adding cognitive load. There are 6 scenarios insrc/ollim_bot/evals/cases/:
| Scenario ID | Goal type | What it tests |
|---|---|---|
overwhelmed-by-tasks | file_created | Bot gives one clear next step, not a prioritized list |
vague-routine-request | file_created | Bot guides a vague request into a concrete routine |
proactive-morning-checkin | qualitative | Morning check-in feels like care, not noise |
frustrated-nothing-works | qualitative | Bot handles frustration without dismissing or lecturing |
reminder-negotiation | qualitative | Bot negotiates a missed commitment without guilt |
context-recovery-after-gap | qualitative | Bot recovers context after the user left mid-conversation |
goal_type: file_created pass when the bot creates a routine
or reminder file. Scenarios with goal_type: qualitative always run to
max_turns and rely entirely on the judge’s scoring.
Running evals
Evals require Claude authentication and make real API calls — they are not
part of the fast
pytest suite. Run them separately when you want to validate
ADHD behavior quality.ADHD behavior regression tests
Three test modules (33 tests, 441 lines) protect ADHD-specific behavior without requiring API calls:test_behavior_prompts.py— verifies prompt content includes required ADHD-specific instructionstest_behavior_scenarios.py— tests multi-step interaction flows and scenario data model integritytest_behavior_context.py— tests context assembly for eval environments
pytest suite and catch regressions in
prompt content and scenario definitions.
Counterfactual trajectory testing
Thecounterfactual command answers a different question from ADHD behavior
evals: “what would the agent have done here if I changed X?” It replays a
real production transcript up to a chosen point, applies an intervention, and
prints the new response next to the original so you can see exactly what
changed.
Use it when you want to validate a prompt edit, tool restriction, or model
swap against a session that already happened — without waiting for similar
behavior to recur in production.
| Tool | Question it answers | Input |
|---|---|---|
ollim-bot eval | Does the agent handle simulated ADHD users well? | Synthetic scenarios |
counterfactual | How would the agent have responded here with different settings? | A real past session and a rewind point |
counterfactual is a separate top-level command installed alongside
ollim-bot by uv tool install --editable .. It is not an ollim-bot
subcommand.How it works
counterfactual truncates the session’s JSONL transcript at the rewind point,
then uses the Claude Agent SDK’s fork feature to resume from the truncated
state with modified ClaudeAgentOptions.
It can run two forks in parallel — a baseline (same settings as the
original) and a variant (with the intervention) — so you can distinguish
sampling noise from the intervention’s effect. The truncated file and both
fork sessions are cleaned up after the run.
Source: src/ollim_bot/eval/counterfactual.py and
src/ollim_bot/eval/counterfactual_cli.py.
Running a test
session argument accepts UUID prefixes, prev, prev-N, or slug names —
matching claude-history conventions. The rewind_uuid must be a
user message UUID (assistant and tool-result UUIDs are rejected), and it
cannot be the first record in the session.
See the counterfactual CLI reference
for every flag and default.
Bundled Claude Code skill
The source repo ships acounterfactual-test Claude Code skill at
.claude/skills/counterfactual-test/SKILL.md. It walks Claude Code through
picking a session and rewind point, choosing an intervention, and interpreting
the output. The skill sets disable-model-invocation: true, so it runs only
when you type /counterfactual-test in a Claude Code session inside the
source repo — the model will not trigger it automatically.
This is a Claude Code dev-harness skill, separate from
ollim-bot runtime skills loaded by the bot at startup.
Cost and gotchas
- Default caps are
$0.50per run and5turns;--with-baselinedoubles cost by running two forks. - Discord MCP tools are not connected — pick rewind points where the
original response did not depend on
ping_user,discord_embed, or other Discord-side tools, or the comparison is invalid. - Variant runs use
bypassPermissions— tools denied in production can succeed on replay, which may change tool selection behavior. - Profile drift — the variant uses the current
IDENTITY.mdandUSER.md. If these changed since the original session, differences may reflect profile edits rather than the intervention. - Interrupted runs may leave orphaned JSONL files under
~/.claude/projects/. Delete them manually if cleanup did not run.
Writing tests
Basic pattern
Tests follow a three-part structure: set up state, call the function, assert the result.data_dir as a fixture parameter whenever the test involves file I/O.
For tests that only need a temp directory without path redirection, use
pytest’s built-in tmp_path.
Async tests
- async via _run() helper
- sync test with real objects
Most async tests use a sync
_run() helper that drives coroutines
through the event loop — no @pytest.mark.asyncio needed:Assertions
Use direct assertions rather than assertion helpers:Next steps
Development guide
Dev setup, project structure, and code conventions.
CLI reference
All ollim-bot subcommands and flags.
Architecture overview
Module map and data flow through the system.
Troubleshooting
Common issues and debugging techniques.
