Skip to main content
ollim-bot has a pytest-based test suite covering data structures, storage I/O, scheduling, permissions, forks, and more. Tests run against real files in temp directories rather than mocking internal behavior.

Running tests

uv run pytest
To run with coverage reporting:
uv run pytest --cov
To run a single test file:
uv run pytest tests/test_ping_budget.py
No extra configuration is needed. The conftest.py sets default environment variables (OLLIM_USER_NAME=TestUser, OLLIM_BOT_NAME=test-bot) so tests run without a .env file.

Test philosophy

Tests verify real behavior against real data. The guiding principle: mock only what you cannot control. What gets tested with real instances:
  • Dataclass construction and field defaults (Routine, Reminder, BudgetState)
  • File I/O — JSONL reading/writing, markdown parsing, roundtrip serialization
  • State transitions — ping budget refill, session compaction, permission approval
  • Configuration loading and validation
What gets mocked:
  • Discord API calls (channel.send(), message objects) — these require an active gateway connection
  • Agent/Client creation in fork execution — these start real Claude API calls
This is not “no mocks.” It is no gratuitous mocks. If the code under test can run with real objects and temp files, it does.

Test structure

Dependencies

PackageVersionPurpose
pytest>=9.0.2Test framework
pytest-asyncio>=1.3.0Async test support
pytest-cov>=7.0.0Coverage reporting
All three are in the dev dependency group and installed by uv sync.

The data_dir fixture

Every test that touches the filesystem uses the data_dir fixture from conftest.py. It redirects all module-level path constants to a tmp_path directory:
@pytest.fixture()
def data_dir(tmp_path, monkeypatch):
    """Redirect all data file paths to a temp directory."""
    import ollim_bot.forks as forks_mod
    import ollim_bot.inquiries as inquiries_mod
    import ollim_bot.ping_budget as ping_budget_mod
    import ollim_bot.runtime_config as runtime_config_mod
    import ollim_bot.scheduling.reminders as reminders_mod
    import ollim_bot.scheduling.routines as routines_mod
    import ollim_bot.sessions as sessions_mod
    import ollim_bot.storage as storage_mod

    state_dir = tmp_path / "state"
    monkeypatch.setattr(storage_mod, "DATA_DIR", tmp_path)
    monkeypatch.setattr(storage_mod, "STATE_DIR", state_dir)
    monkeypatch.setattr(routines_mod, "ROUTINES_DIR", tmp_path / "routines")
    monkeypatch.setattr(reminders_mod, "REMINDERS_DIR", tmp_path / "reminders")
    monkeypatch.setattr(inquiries_mod, "INQUIRIES_FILE", state_dir / "inquiries.json")
    monkeypatch.setattr(ping_budget_mod, "BUDGET_FILE", state_dir / "ping_budget.json")
    monkeypatch.setattr(runtime_config_mod, "CONFIG_FILE", state_dir / "config.json")
    monkeypatch.setattr(sessions_mod, "SESSIONS_FILE", state_dir / "sessions.json")
    monkeypatch.setattr(sessions_mod, "HISTORY_FILE", state_dir / "session_history.jsonl")
    monkeypatch.setattr(sessions_mod, "FORK_MESSAGES_FILE", state_dir / "fork_messages.json")
    monkeypatch.setattr(forks_mod, "_UPDATES_FILE", state_dir / "pending_updates.json")

    import ollim_bot.skills as skills_mod
    import ollim_bot.webhook as webhook_mod

    monkeypatch.setattr(skills_mod, "SKILLS_DIR", tmp_path / "skills")
    monkeypatch.setattr(webhook_mod, "WEBHOOKS_DIR", tmp_path / "webhooks")
    return tmp_path
This means tests never touch ~/.ollim-bot/. Each test gets an isolated temp directory that is cleaned up automatically.

File organization

All tests live in tests/ as module-level functions — no test classes. Each file maps to a source module:
Test fileSource module
test_agent_streaming.pyagent_streaming.py
test_agent_tools.pyagent_tools.py
test_bot.pybot.py
test_cli.pymain.py, routine_cmd.py, reminder_cmd.py
test_config.pyconfig.py
test_embeds.pyembeds.py, views.py
test_forks.pyforks.py, fork_state.py
test_formatting.pyformatting.py
test_inquiries.pyinquiries.py
test_permissions.pypermissions.py
test_ping_budget.pyping_budget.py
test_reminders.pyscheduling/reminders.py
test_routines.pyscheduling/routines.py
test_runtime_config.pyruntime_config.py
test_scheduler_prompts.pyscheduling/preamble.py, prompts.py
test_sessions.pysessions.py
test_skills.pyskills.py
test_storage.pystorage.py
test_stream_compact.pystreamer.py, sessions.py
test_streamer.pystreamer.py
test_subagents.pysubagents.py
test_tool_policy.pytool_policy.py
test_tool_restrictions.pyagent_tools.py
test_webhook.pywebhook.py
evals/test_judge.pyevals/judge.py
evals/test_runner.pyevals/runner.py
evals/test_scenario.pyevals/scenario.py
evals/test_eval_cmd.pyevals/eval_cmd.py
evals/test_results.pyevals/results.py
test_behavior_prompts.pyADHD prompt content regression tests
test_behavior_scenarios.pyADHD multi-step interaction flow regression tests
test_behavior_context.pyADHD context assembly regression tests

ADHD behavior evals

The eval system tests whether the bot responds appropriately to simulated ADHD users. It runs multi-turn conversations between a Haiku-powered user-proxy (playing a simulated ADHD user) and the real bot agent, then scores the transcript with a Sonnet-powered LLM judge.

What it tests

Each scenario defines a persona (personality, goal, opening message, max turns) and criteria the judge scores on a 1-5 scale. Criteria are ADHD-specific behavioral anchors — things like whether the bot gives exactly one next step instead of an option dump, uses warm tone instead of productivity-coach language, and avoids adding cognitive load. There are 6 scenarios in src/ollim_bot/evals/cases/:
Scenario IDGoal typeWhat it tests
overwhelmed-by-tasksfile_createdBot gives one clear next step, not a prioritized list
vague-routine-requestfile_createdBot guides a vague request into a concrete routine
proactive-morning-checkinqualitativeMorning check-in feels like care, not noise
frustrated-nothing-worksqualitativeBot handles frustration without dismissing or lecturing
reminder-negotiationqualitativeBot negotiates a missed commitment without guilt
context-recovery-after-gapqualitativeBot recovers context after the user left mid-conversation
Scenarios with goal_type: file_created pass when the bot creates a routine or reminder file. Scenarios with goal_type: qualitative always run to max_turns and rely entirely on the judge’s scoring.

Running evals

# Run all scenarios
ollim-bot eval run

# Run one scenario with full transcript
ollim-bot eval run overwhelmed-by-tasks --verbose

# View past results
ollim-bot eval results

# Compare latest vs previous for regression detection
ollim-bot eval compare
See CLI reference for the full list of subcommands and flags.
Evals require Claude authentication and make real API calls — they are not part of the fast pytest suite. Run them separately when you want to validate ADHD behavior quality.

ADHD behavior regression tests

Three test modules (33 tests, 441 lines) protect ADHD-specific behavior without requiring API calls:
  • test_behavior_prompts.py — verifies prompt content includes required ADHD-specific instructions
  • test_behavior_scenarios.py — tests multi-step interaction flows and scenario data model integrity
  • test_behavior_context.py — tests context assembly for eval environments
These run as part of the regular pytest suite and catch regressions in prompt content and scenario definitions.

Counterfactual trajectory testing

The counterfactual command answers a different question from ADHD behavior evals: “what would the agent have done here if I changed X?” It replays a real production transcript up to a chosen point, applies an intervention, and prints the new response next to the original so you can see exactly what changed. Use it when you want to validate a prompt edit, tool restriction, or model swap against a session that already happened — without waiting for similar behavior to recur in production.
ToolQuestion it answersInput
ollim-bot evalDoes the agent handle simulated ADHD users well?Synthetic scenarios
counterfactualHow would the agent have responded here with different settings?A real past session and a rewind point
counterfactual is a separate top-level command installed alongside ollim-bot by uv tool install --editable .. It is not an ollim-bot subcommand.

How it works

counterfactual truncates the session’s JSONL transcript at the rewind point, then uses the Claude Agent SDK’s fork feature to resume from the truncated state with modified ClaudeAgentOptions. It can run two forks in parallel — a baseline (same settings as the original) and a variant (with the intervention) — so you can distinguish sampling noise from the intervention’s effect. The truncated file and both fork sessions are cleaned up after the run. Source: src/ollim_bot/eval/counterfactual.py and src/ollim_bot/eval/counterfactual_cli.py.

Running a test

# 1. Find the session and the message UUID you want to rewind to
claude-history sessions --cwd ~/.ollim-bot --since 7d
claude-history transcript <session_id> --cwd ~/.ollim-bot

# 2. Replay with an intervention
counterfactual <session> <rewind_uuid> --append "Respond in one sentence."
The session argument accepts UUID prefixes, prev, prev-N, or slug names — matching claude-history conventions. The rewind_uuid must be a user message UUID (assistant and tool-result UUIDs are rejected), and it cannot be the first record in the session. See the counterfactual CLI reference for every flag and default.

Bundled Claude Code skill

The source repo ships a counterfactual-test Claude Code skill at .claude/skills/counterfactual-test/SKILL.md. It walks Claude Code through picking a session and rewind point, choosing an intervention, and interpreting the output. The skill sets disable-model-invocation: true, so it runs only when you type /counterfactual-test in a Claude Code session inside the source repo — the model will not trigger it automatically. This is a Claude Code dev-harness skill, separate from ollim-bot runtime skills loaded by the bot at startup.

Cost and gotchas

  • Default caps are $0.50 per run and 5 turns; --with-baseline doubles cost by running two forks.
  • Discord MCP tools are not connected — pick rewind points where the original response did not depend on ping_user, discord_embed, or other Discord-side tools, or the comparison is invalid.
  • Variant runs use bypassPermissions — tools denied in production can succeed on replay, which may change tool selection behavior.
  • Profile drift — the variant uses the current IDENTITY.md and USER.md. If these changed since the original session, differences may reflect profile edits rather than the intervention.
  • Interrupted runs may leave orphaned JSONL files under ~/.claude/projects/. Delete them manually if cleanup did not run.

Writing tests

Basic pattern

Tests follow a three-part structure: set up state, call the function, assert the result.
def test_load_returns_defaults_when_no_file(data_dir):
    state = ping_budget.load()

    assert state.capacity == 5
    assert state.available == 5.0
    assert state.refill_rate_minutes == 90
    assert state.critical_used == 0
    assert state.daily_used == 0
Use data_dir as a fixture parameter whenever the test involves file I/O. For tests that only need a temp directory without path redirection, use pytest’s built-in tmp_path.

Async tests

Most async tests use a sync _run() helper that drives coroutines through the event loop — no @pytest.mark.asyncio needed:
def _run(coro):
    return asyncio.get_event_loop().run_until_complete(coro)


def test_many_concurrent_appends(data_dir):
    """Stress test: 20 concurrent append_update calls — all must survive."""

    async def _scenario():
        tasks = [asyncio.create_task(append_update(f"update-{i}")) for i in range(20)]
        await asyncio.gather(*tasks)

        result = await pop_pending_updates()
        messages = sorted([u.message for u in result])
        expected = sorted([f"update-{i}" for i in range(20)])
        assert messages == expected

    _run(_scenario())

Assertions

Use direct assertions rather than assertion helpers:
# Equality
assert loaded.capacity == 5

# Membership
assert "5/5 available" in status

# Boolean
assert result is True

# Approximate (for floats)
from pytest import approx
assert loaded.available == approx(3.0, abs=0.01)

# Exceptions
with pytest.raises(ValueError):
    parse_cron("not a cron")

Next steps

Development guide

Dev setup, project structure, and code conventions.

CLI reference

All ollim-bot subcommands and flags.

Architecture overview

Module map and data flow through the system.

Troubleshooting

Common issues and debugging techniques.