Skip to main content
ollim-bot has a pytest-based test suite covering data structures, storage I/O, scheduling, permissions, forks, and more. Tests run against real files in temp directories rather than mocking internal behavior.

Running tests

uv run pytest
To run with coverage reporting:
uv run pytest --cov
To run a single test file:
uv run pytest tests/test_ping_budget.py
No extra configuration is needed. The conftest.py sets default environment variables (OLLIM_USER_NAME=TestUser, OLLIM_BOT_NAME=test-bot) so tests run without a .env file.

Test philosophy

Tests verify real behavior against real data. The guiding principle: mock only what you cannot control. What gets tested with real instances:
  • Dataclass construction and field defaults (Routine, Reminder, BudgetState)
  • File I/O — JSONL reading/writing, markdown parsing, roundtrip serialization
  • State transitions — ping budget refill, session compaction, permission approval
  • Configuration loading and validation
What gets mocked:
  • Discord API calls (channel.send(), message objects) — these require an active gateway connection
  • Agent/Client creation in fork execution — these start real Claude API calls
This is not “no mocks.” It is no gratuitous mocks. If the code under test can run with real objects and temp files, it does.

Test structure

Dependencies

PackageVersionPurpose
pytest>=9.0.2Test framework
pytest-asyncio>=1.3.0Async test support
pytest-cov>=7.0.0Coverage reporting
All three are in the dev dependency group and installed by uv sync.

The data_dir fixture

Every test that touches the filesystem uses the data_dir fixture from conftest.py. It redirects all module-level path constants to a tmp_path directory:
@pytest.fixture()
def data_dir(tmp_path, monkeypatch):
    """Redirect all data file paths to a temp directory."""
    import ollim_bot.forks as forks_mod
    import ollim_bot.inquiries as inquiries_mod
    import ollim_bot.ping_budget as ping_budget_mod
    import ollim_bot.runtime_config as runtime_config_mod
    import ollim_bot.scheduling.reminders as reminders_mod
    import ollim_bot.scheduling.routines as routines_mod
    import ollim_bot.sessions as sessions_mod
    import ollim_bot.storage as storage_mod

    state_dir = tmp_path / "state"
    monkeypatch.setattr(storage_mod, "DATA_DIR", tmp_path)
    monkeypatch.setattr(storage_mod, "STATE_DIR", state_dir)
    monkeypatch.setattr(routines_mod, "ROUTINES_DIR", tmp_path / "routines")
    monkeypatch.setattr(reminders_mod, "REMINDERS_DIR", tmp_path / "reminders")
    monkeypatch.setattr(inquiries_mod, "INQUIRIES_FILE", state_dir / "inquiries.json")
    monkeypatch.setattr(ping_budget_mod, "BUDGET_FILE", state_dir / "ping_budget.json")
    monkeypatch.setattr(runtime_config_mod, "CONFIG_FILE", state_dir / "config.json")
    monkeypatch.setattr(sessions_mod, "SESSIONS_FILE", state_dir / "sessions.json")
    monkeypatch.setattr(sessions_mod, "HISTORY_FILE", state_dir / "session_history.jsonl")
    monkeypatch.setattr(sessions_mod, "FORK_MESSAGES_FILE", state_dir / "fork_messages.json")
    monkeypatch.setattr(forks_mod, "_UPDATES_FILE", state_dir / "pending_updates.json")

    import ollim_bot.skills as skills_mod
    import ollim_bot.webhook as webhook_mod

    monkeypatch.setattr(skills_mod, "SKILLS_DIR", tmp_path / "skills")
    monkeypatch.setattr(webhook_mod, "WEBHOOKS_DIR", tmp_path / "webhooks")
    return tmp_path
This means tests never touch ~/.ollim-bot/. Each test gets an isolated temp directory that is cleaned up automatically.

File organization

All tests live in tests/ as module-level functions — no test classes. Each file maps to a source module:
Test fileSource module
test_agent_streaming.pyagent_streaming.py
test_agent_tools.pyagent_tools.py
test_bot.pybot.py
test_cli.pymain.py, routine_cmd.py, reminder_cmd.py
test_config.pyconfig.py
test_embeds.pyembeds.py, views.py
test_forks.pyforks.py, fork_state.py
test_formatting.pyformatting.py
test_inquiries.pyinquiries.py
test_permissions.pypermissions.py
test_ping_budget.pyping_budget.py
test_reminders.pyscheduling/reminders.py
test_routines.pyscheduling/routines.py
test_runtime_config.pyruntime_config.py
test_scheduler_prompts.pyscheduling/preamble.py, prompts.py
test_sessions.pysessions.py
test_skills.pyskills.py
test_storage.pystorage.py
test_stream_compact.pystreamer.py, sessions.py
test_streamer.pystreamer.py
test_subagents.pysubagents.py
test_tool_policy.pytool_policy.py
test_tool_restrictions.pyagent_tools.py
test_webhook.pywebhook.py
evals/test_judge.pyevals/judge.py
evals/test_runner.pyevals/runner.py
evals/test_scenario.pyevals/scenario.py
evals/test_eval_cmd.pyevals/eval_cmd.py
evals/test_results.pyevals/results.py
test_behavior_prompts.pyADHD prompt content regression tests
test_behavior_scenarios.pyADHD multi-step interaction flow regression tests
test_behavior_context.pyADHD context assembly regression tests

ADHD behavior evals

The eval system tests whether the bot responds appropriately to simulated ADHD users. It runs multi-turn conversations between a Haiku-powered user-proxy (playing a simulated ADHD user) and the real bot agent, then scores the transcript with a Sonnet-powered LLM judge.

What it tests

Each scenario defines a persona (personality, goal, opening message, max turns) and criteria the judge scores on a 1-5 scale. Criteria are ADHD-specific behavioral anchors — things like whether the bot gives exactly one next step instead of an option dump, uses warm tone instead of productivity-coach language, and avoids adding cognitive load. There are 6 scenarios in src/ollim_bot/evals/cases/:
Scenario IDGoal typeWhat it tests
overwhelmed-by-tasksfile_createdBot gives one clear next step, not a prioritized list
vague-routine-requestfile_createdBot guides a vague request into a concrete routine
proactive-morning-checkinqualitativeMorning check-in feels like care, not noise
frustrated-nothing-worksqualitativeBot handles frustration without dismissing or lecturing
reminder-negotiationqualitativeBot negotiates a missed commitment without guilt
context-recovery-after-gapqualitativeBot recovers context after the user left mid-conversation
Scenarios with goal_type: file_created pass when the bot creates a routine or reminder file. Scenarios with goal_type: qualitative always run to max_turns and rely entirely on the judge’s scoring.

Running evals

# Run all scenarios
ollim-bot eval run

# Run one scenario with full transcript
ollim-bot eval run overwhelmed-by-tasks --verbose

# View past results
ollim-bot eval results

# Compare latest vs previous for regression detection
ollim-bot eval compare
See CLI reference for the full list of subcommands and flags.
Evals require Claude authentication and make real API calls — they are not part of the fast pytest suite. Run them separately when you want to validate ADHD behavior quality.

ADHD behavior regression tests

Three test modules (33 tests, 441 lines) protect ADHD-specific behavior without requiring API calls:
  • test_behavior_prompts.py — verifies prompt content includes required ADHD-specific instructions
  • test_behavior_scenarios.py — tests multi-step interaction flows and scenario data model integrity
  • test_behavior_context.py — tests context assembly for eval environments
These run as part of the regular pytest suite and catch regressions in prompt content and scenario definitions.

Writing tests

Basic pattern

Tests follow a three-part structure: set up state, call the function, assert the result.
def test_load_returns_defaults_when_no_file(data_dir):
    state = ping_budget.load()

    assert state.capacity == 5
    assert state.available == 5.0
    assert state.refill_rate_minutes == 90
    assert state.critical_used == 0
    assert state.daily_used == 0
Use data_dir as a fixture parameter whenever the test involves file I/O. For tests that only need a temp directory without path redirection, use pytest’s built-in tmp_path.

Async tests

Most async tests use a sync _run() helper that drives coroutines through the event loop — no @pytest.mark.asyncio needed:
def _run(coro):
    return asyncio.get_event_loop().run_until_complete(coro)


def test_many_concurrent_appends(data_dir):
    """Stress test: 20 concurrent append_update calls — all must survive."""

    async def _scenario():
        tasks = [asyncio.create_task(append_update(f"update-{i}")) for i in range(20)]
        await asyncio.gather(*tasks)

        result = await pop_pending_updates()
        messages = sorted([u.message for u in result])
        expected = sorted([f"update-{i}" for i in range(20)])
        assert messages == expected

    _run(_scenario())

Assertions

Use direct assertions rather than assertion helpers:
# Equality
assert loaded.capacity == 5

# Membership
assert "5/5 available" in status

# Boolean
assert result is True

# Approximate (for floats)
from pytest import approx
assert loaded.available == approx(3.0, abs=0.01)

# Exceptions
with pytest.raises(ValueError):
    parse_cron("not a cron")

Next steps

Development guide

Dev setup, project structure, and code conventions.

CLI reference

All ollim-bot subcommands and flags.

Architecture overview

Module map and data flow through the system.

Troubleshooting

Common issues and debugging techniques.