Running tests
No extra configuration is needed. The
conftest.py sets default environment
variables (OLLIM_USER_NAME=TestUser, OLLIM_BOT_NAME=test-bot) so tests
run without a .env file.Test philosophy
Tests verify real behavior against real data. The guiding principle: mock only what you cannot control. What gets tested with real instances:- Dataclass construction and field defaults (
Routine,Reminder,BudgetState) - File I/O — JSONL reading/writing, markdown parsing, roundtrip serialization
- State transitions — ping budget refill, session compaction, permission approval
- Configuration loading and validation
- Discord API calls (
channel.send(), message objects) — these require an active gateway connection - Agent/Client creation in fork execution — these start real Claude API calls
Test structure
Dependencies
| Package | Version | Purpose |
|---|---|---|
pytest | >=9.0.2 | Test framework |
pytest-asyncio | >=1.3.0 | Async test support |
pytest-cov | >=7.0.0 | Coverage reporting |
dev dependency group and installed by uv sync.
The data_dir fixture
Every test that touches the filesystem uses the data_dir fixture from
conftest.py. It redirects all module-level path constants to a
tmp_path directory:
~/.ollim-bot/. Each test gets an isolated temp
directory that is cleaned up automatically.
File organization
All tests live intests/ as module-level functions — no test classes. Each file maps to a source module:
| Test file | Source module |
|---|---|
test_agent_streaming.py | agent_streaming.py |
test_agent_tools.py | agent_tools.py |
test_bot.py | bot.py |
test_cli.py | main.py, routine_cmd.py, reminder_cmd.py |
test_config.py | config.py |
test_embeds.py | embeds.py, views.py |
test_forks.py | forks.py, fork_state.py |
test_formatting.py | formatting.py |
test_inquiries.py | inquiries.py |
test_permissions.py | permissions.py |
test_ping_budget.py | ping_budget.py |
test_reminders.py | scheduling/reminders.py |
test_routines.py | scheduling/routines.py |
test_runtime_config.py | runtime_config.py |
test_scheduler_prompts.py | scheduling/preamble.py, prompts.py |
test_sessions.py | sessions.py |
test_skills.py | skills.py |
test_storage.py | storage.py |
test_stream_compact.py | streamer.py, sessions.py |
test_streamer.py | streamer.py |
test_subagents.py | subagents.py |
test_tool_policy.py | tool_policy.py |
test_tool_restrictions.py | agent_tools.py |
test_webhook.py | webhook.py |
evals/test_judge.py | evals/judge.py |
evals/test_runner.py | evals/runner.py |
evals/test_scenario.py | evals/scenario.py |
evals/test_eval_cmd.py | evals/eval_cmd.py |
evals/test_results.py | evals/results.py |
test_behavior_prompts.py | ADHD prompt content regression tests |
test_behavior_scenarios.py | ADHD multi-step interaction flow regression tests |
test_behavior_context.py | ADHD context assembly regression tests |
ADHD behavior evals
The eval system tests whether the bot responds appropriately to simulated ADHD users. It runs multi-turn conversations between a Haiku-powered user-proxy (playing a simulated ADHD user) and the real bot agent, then scores the transcript with a Sonnet-powered LLM judge.What it tests
Each scenario defines a persona (personality, goal, opening message, max turns) and criteria the judge scores on a 1-5 scale. Criteria are ADHD-specific behavioral anchors — things like whether the bot gives exactly one next step instead of an option dump, uses warm tone instead of productivity-coach language, and avoids adding cognitive load. There are 6 scenarios insrc/ollim_bot/evals/cases/:
| Scenario ID | Goal type | What it tests |
|---|---|---|
overwhelmed-by-tasks | file_created | Bot gives one clear next step, not a prioritized list |
vague-routine-request | file_created | Bot guides a vague request into a concrete routine |
proactive-morning-checkin | qualitative | Morning check-in feels like care, not noise |
frustrated-nothing-works | qualitative | Bot handles frustration without dismissing or lecturing |
reminder-negotiation | qualitative | Bot negotiates a missed commitment without guilt |
context-recovery-after-gap | qualitative | Bot recovers context after the user left mid-conversation |
goal_type: file_created pass when the bot creates a routine
or reminder file. Scenarios with goal_type: qualitative always run to
max_turns and rely entirely on the judge’s scoring.
Running evals
Evals require Claude authentication and make real API calls — they are not
part of the fast
pytest suite. Run them separately when you want to validate
ADHD behavior quality.ADHD behavior regression tests
Three test modules (33 tests, 441 lines) protect ADHD-specific behavior without requiring API calls:test_behavior_prompts.py— verifies prompt content includes required ADHD-specific instructionstest_behavior_scenarios.py— tests multi-step interaction flows and scenario data model integritytest_behavior_context.py— tests context assembly for eval environments
pytest suite and catch regressions in
prompt content and scenario definitions.
Writing tests
Basic pattern
Tests follow a three-part structure: set up state, call the function, assert the result.data_dir as a fixture parameter whenever the test involves file I/O.
For tests that only need a temp directory without path redirection, use
pytest’s built-in tmp_path.
Async tests
- async via _run() helper
- sync test with real objects
Most async tests use a sync
_run() helper that drives coroutines
through the event loop — no @pytest.mark.asyncio needed:Assertions
Use direct assertions rather than assertion helpers:Next steps
Development guide
Dev setup, project structure, and code conventions.
CLI reference
All ollim-bot subcommands and flags.
Architecture overview
Module map and data flow through the system.
Troubleshooting
Common issues and debugging techniques.
