> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ollim.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Testing

> Run the test suite and understand the real-data, no-gratuitous-mocks testing approach.

ollim-bot has a pytest-based test suite covering data structures, storage I/O,
scheduling, permissions, forks, and more. Tests run against real files in temp
directories rather than mocking internal behavior.

## Running tests

```bash theme={null}
uv run pytest
```

To run with coverage reporting:

```bash theme={null}
uv run pytest --cov
```

To run a single test file:

```bash theme={null}
uv run pytest tests/test_ping_budget.py
```

<Note>
  No extra configuration is needed. The `conftest.py` sets default environment
  variables (`OLLIM_USER_NAME=TestUser`, `OLLIM_BOT_NAME=test-bot`) so tests
  run without a `.env` file.
</Note>

## Test philosophy

Tests verify real behavior against real data. The guiding principle: mock only what you cannot control.

**What gets tested with real instances:**

* Dataclass construction and field defaults (`Routine`, `Reminder`, `BudgetState`)
* File I/O — JSONL reading/writing, markdown parsing, roundtrip serialization
* State transitions — ping budget refill, session compaction, permission approval
* Configuration loading and validation

**What gets mocked:**

* Discord API calls (`channel.send()`, message objects) — these require an active gateway connection
* Agent/Client creation in fork execution — these start real Claude API calls

This is not "no mocks." It is no *gratuitous* mocks. If the code under test
can run with real objects and temp files, it does.

## Test structure

### Dependencies

| Package          | Version | Purpose            |
| ---------------- | ------- | ------------------ |
| `pytest`         | >=9.0.2 | Test framework     |
| `pytest-asyncio` | >=1.3.0 | Async test support |
| `pytest-cov`     | >=7.0.0 | Coverage reporting |

All three are in the `dev` dependency group and installed by `uv sync`.

### The `data_dir` fixture

Every test that touches the filesystem uses the `data_dir` fixture from
`conftest.py`. It redirects all module-level path constants to a
`tmp_path` directory:

```python theme={null}
@pytest.fixture()
def data_dir(tmp_path, monkeypatch):
    """Redirect all data file paths to a temp directory."""
    import ollim_bot.forks as forks_mod
    import ollim_bot.inquiries as inquiries_mod
    import ollim_bot.ping_budget as ping_budget_mod
    import ollim_bot.runtime_config as runtime_config_mod
    import ollim_bot.scheduling.reminders as reminders_mod
    import ollim_bot.scheduling.routines as routines_mod
    import ollim_bot.sessions as sessions_mod
    import ollim_bot.storage as storage_mod

    state_dir = tmp_path / "state"
    monkeypatch.setattr(storage_mod, "DATA_DIR", tmp_path)
    monkeypatch.setattr(storage_mod, "STATE_DIR", state_dir)
    monkeypatch.setattr(routines_mod, "ROUTINES_DIR", tmp_path / "routines")
    monkeypatch.setattr(reminders_mod, "REMINDERS_DIR", tmp_path / "reminders")
    monkeypatch.setattr(inquiries_mod, "INQUIRIES_FILE", state_dir / "inquiries.json")
    monkeypatch.setattr(ping_budget_mod, "BUDGET_FILE", state_dir / "ping_budget.json")
    monkeypatch.setattr(runtime_config_mod, "CONFIG_FILE", state_dir / "config.json")
    monkeypatch.setattr(sessions_mod, "SESSIONS_FILE", state_dir / "sessions.json")
    monkeypatch.setattr(sessions_mod, "HISTORY_FILE", state_dir / "session_history.jsonl")
    monkeypatch.setattr(sessions_mod, "FORK_MESSAGES_FILE", state_dir / "fork_messages.json")
    monkeypatch.setattr(forks_mod, "_UPDATES_FILE", state_dir / "pending_updates.json")

    import ollim_bot.skills as skills_mod
    import ollim_bot.webhook as webhook_mod

    monkeypatch.setattr(skills_mod, "SKILLS_DIR", tmp_path / "skills")
    monkeypatch.setattr(webhook_mod, "WEBHOOKS_DIR", tmp_path / "webhooks")
    return tmp_path
```

This means tests never touch `~/.ollim-bot/`. Each test gets an isolated temp
directory that is cleaned up automatically.

### File organization

All tests live in `tests/` as module-level functions — no test classes. Each file maps to a source module:

| Test file                    | Source module                                     |
| ---------------------------- | ------------------------------------------------- |
| `test_agent_streaming.py`    | `agent_streaming.py`                              |
| `test_agent_tools.py`        | `agent_tools.py`                                  |
| `test_bot.py`                | `bot.py`                                          |
| `test_cli.py`                | `main.py`, `routine_cmd.py`, `reminder_cmd.py`    |
| `test_config.py`             | `config.py`                                       |
| `test_embeds.py`             | `embeds.py`, `views.py`                           |
| `test_forks.py`              | `forks.py`, `fork_state.py`                       |
| `test_formatting.py`         | `formatting.py`                                   |
| `test_inquiries.py`          | `inquiries.py`                                    |
| `test_permissions.py`        | `permissions.py`                                  |
| `test_ping_budget.py`        | `ping_budget.py`                                  |
| `test_reminders.py`          | `scheduling/reminders.py`                         |
| `test_routines.py`           | `scheduling/routines.py`                          |
| `test_runtime_config.py`     | `runtime_config.py`                               |
| `test_scheduler_prompts.py`  | `scheduling/preamble.py`, `prompts.py`            |
| `test_sessions.py`           | `sessions.py`                                     |
| `test_skills.py`             | `skills.py`                                       |
| `test_storage.py`            | `storage.py`                                      |
| `test_stream_compact.py`     | `streamer.py`, `sessions.py`                      |
| `test_streamer.py`           | `streamer.py`                                     |
| `test_subagents.py`          | `subagents.py`                                    |
| `test_tool_policy.py`        | `tool_policy.py`                                  |
| `test_tool_restrictions.py`  | `agent_tools.py`                                  |
| `test_webhook.py`            | `webhook.py`                                      |
| `evals/test_judge.py`        | `evals/judge.py`                                  |
| `evals/test_runner.py`       | `evals/runner.py`                                 |
| `evals/test_scenario.py`     | `evals/scenario.py`                               |
| `evals/test_eval_cmd.py`     | `evals/eval_cmd.py`                               |
| `evals/test_results.py`      | `evals/results.py`                                |
| `test_behavior_prompts.py`   | ADHD prompt content regression tests              |
| `test_behavior_scenarios.py` | ADHD multi-step interaction flow regression tests |
| `test_behavior_context.py`   | ADHD context assembly regression tests            |

## ADHD behavior evals

The eval system tests whether the bot responds appropriately to simulated ADHD
users. It runs multi-turn conversations between a Haiku-powered user-proxy
(playing a simulated ADHD user) and the real bot agent, then scores the
transcript with a Sonnet-powered LLM judge.

### What it tests

Each scenario defines a **persona** (personality, goal, opening message, max
turns) and **criteria** the judge scores on a 1-5 scale. Criteria are
ADHD-specific behavioral anchors — things like whether the bot gives exactly
one next step instead of an option dump, uses warm tone instead of
productivity-coach language, and avoids adding cognitive load.

There are 6 scenarios in `src/ollim_bot/evals/cases/`:

| Scenario ID                  | Goal type      | What it tests                                             |
| ---------------------------- | -------------- | --------------------------------------------------------- |
| `overwhelmed-by-tasks`       | `file_created` | Bot gives one clear next step, not a prioritized list     |
| `vague-routine-request`      | `file_created` | Bot guides a vague request into a concrete routine        |
| `proactive-morning-checkin`  | `qualitative`  | Morning check-in feels like care, not noise               |
| `frustrated-nothing-works`   | `qualitative`  | Bot handles frustration without dismissing or lecturing   |
| `reminder-negotiation`       | `qualitative`  | Bot negotiates a missed commitment without guilt          |
| `context-recovery-after-gap` | `qualitative`  | Bot recovers context after the user left mid-conversation |

Scenarios with `goal_type: file_created` pass when the bot creates a routine
or reminder file. Scenarios with `goal_type: qualitative` always run to
`max_turns` and rely entirely on the judge's scoring.

### Running evals

```bash theme={null}
# Run all scenarios
ollim-bot eval run

# Run one scenario with full transcript
ollim-bot eval run overwhelmed-by-tasks --verbose

# View past results
ollim-bot eval results

# Compare latest vs previous for regression detection
ollim-bot eval compare
```

See [CLI reference](/development/cli-reference#eval) for the full list of
subcommands and flags.

<Note>
  Evals require Claude authentication and make real API calls — they are not
  part of the fast `pytest` suite. Run them separately when you want to validate
  ADHD behavior quality.
</Note>

### ADHD behavior regression tests

Three test modules (33 tests, 441 lines) protect ADHD-specific behavior
without requiring API calls:

* `test_behavior_prompts.py` — verifies prompt content includes required
  ADHD-specific instructions
* `test_behavior_scenarios.py` — tests multi-step interaction flows and
  scenario data model integrity
* `test_behavior_context.py` — tests context assembly for eval environments

These run as part of the regular `pytest` suite and catch regressions in
prompt content and scenario definitions.

## Counterfactual trajectory testing

The `counterfactual` command answers a different question from ADHD behavior
evals: **"what would the agent have done here if I changed X?"** It replays a
real production transcript up to a chosen point, applies an intervention, and
prints the new response next to the original so you can see exactly what
changed.

Use it when you want to validate a prompt edit, tool restriction, or model
swap against a session that already happened — without waiting for similar
behavior to recur in production.

| Tool             | Question it answers                                              | Input                                  |
| ---------------- | ---------------------------------------------------------------- | -------------------------------------- |
| `ollim-bot eval` | Does the agent handle simulated ADHD users well?                 | Synthetic scenarios                    |
| `counterfactual` | How would the agent have responded here with different settings? | A real past session and a rewind point |

<Note>
  `counterfactual` is a separate top-level command installed alongside
  `ollim-bot` by `uv tool install --editable .`. It is not an `ollim-bot`
  subcommand.
</Note>

### How it works

`counterfactual` truncates the session's JSONL transcript at the rewind point,
then uses the Claude Agent SDK's fork feature to resume from the truncated
state with modified [`ClaudeAgentOptions`](https://github.com/anthropics/claude-agent-sdk).
It can run two forks in parallel — a **baseline** (same settings as the
original) and a **variant** (with the intervention) — so you can distinguish
sampling noise from the intervention's effect. The truncated file and both
fork sessions are cleaned up after the run.

Source: `src/ollim_bot/eval/counterfactual.py` and
`src/ollim_bot/eval/counterfactual_cli.py`.

### Running a test

```bash theme={null}
# 1. Find the session and the message UUID you want to rewind to
claude-history sessions --cwd ~/.ollim-bot --since 7d
claude-history transcript <session_id> --cwd ~/.ollim-bot

# 2. Replay with an intervention
counterfactual <session> <rewind_uuid> --append "Respond in one sentence."
```

The `session` argument accepts UUID prefixes, `prev`, `prev-N`, or slug names —
matching `claude-history` conventions. The `rewind_uuid` must be a
**user message** UUID (assistant and tool-result UUIDs are rejected), and it
cannot be the first record in the session.

See the [counterfactual CLI reference](/development/cli-reference#counterfactual)
for every flag and default.

### Bundled Claude Code skill

The source repo ships a `counterfactual-test` Claude Code skill at
`.claude/skills/counterfactual-test/SKILL.md`. It walks Claude Code through
picking a session and rewind point, choosing an intervention, and interpreting
the output. The skill sets `disable-model-invocation: true`, so it runs only
when you type `/counterfactual-test` in a Claude Code session inside the
source repo — the model will not trigger it automatically.

This is a Claude Code dev-harness skill, separate from
[ollim-bot runtime skills](/extending/skills) loaded by the bot at startup.

### Cost and gotchas

* Default caps are `$0.50` per run and `5` turns; `--with-baseline` doubles
  cost by running two forks.
* **Discord MCP tools are not connected** — pick rewind points where the
  original response did not depend on `ping_user`, `discord_embed`, or other
  Discord-side tools, or the comparison is invalid.
* **Variant runs use `bypassPermissions`** — tools denied in production can
  succeed on replay, which may change tool selection behavior.
* **Profile drift** — the variant uses the current `IDENTITY.md` and
  `USER.md`. If these changed since the original session, differences may
  reflect profile edits rather than the intervention.
* Interrupted runs may leave orphaned JSONL files under
  `~/.claude/projects/`. Delete them manually if cleanup did not run.

## Writing tests

### Basic pattern

Tests follow a three-part structure: set up state, call the function, assert the result.

```python theme={null}
def test_load_returns_defaults_when_no_file(data_dir):
    state = ping_budget.load()

    assert state.capacity == 5
    assert state.available == 5.0
    assert state.refill_rate_minutes == 90
    assert state.critical_used == 0
    assert state.daily_used == 0
```

Use `data_dir` as a fixture parameter whenever the test involves file I/O.
For tests that only need a temp directory without path redirection, use
pytest's built-in `tmp_path`.

### Async tests

<Tabs>
  <Tab title="async via _run() helper">
    Most async tests use a sync `_run()` helper that drives coroutines
    through the event loop — no `@pytest.mark.asyncio` needed:

    ```python theme={null}
    def _run(coro):
        return asyncio.get_event_loop().run_until_complete(coro)


    def test_many_concurrent_appends(data_dir):
        """Stress test: 20 concurrent append_update calls — all must survive."""

        async def _scenario():
            tasks = [asyncio.create_task(append_update(f"update-{i}")) for i in range(20)]
            await asyncio.gather(*tasks)

            result = await pop_pending_updates()
            messages = sorted([u.message for u in result])
            expected = sorted([f"update-{i}" for i in range(20)])
            assert messages == expected

        _run(_scenario())
    ```
  </Tab>

  <Tab title="sync test with real objects">
    For sync test functions that work with synchronous APIs:

    ```python theme={null}
    def test_append_and_list_routines(data_dir):
        r1 = Routine.new(message="morning", cron="0 8 * * *")
        r2 = Routine.new(message="evening", cron="0 18 * * *")

        append_routine(r1)
        append_routine(r2)
        result = list_routines()

        assert len(result) == 2
        messages = {r.message for r in result}
        assert messages == {"morning", "evening"}
    ```
  </Tab>
</Tabs>

### Assertions

Use direct assertions rather than assertion helpers:

```python theme={null}
# Equality
assert loaded.capacity == 5

# Membership
assert "5/5 available" in status

# Boolean
assert result is True

# Approximate (for floats)
from pytest import approx
assert loaded.available == approx(3.0, abs=0.01)

# Exceptions
with pytest.raises(ValueError):
    parse_cron("not a cron")
```

## Next steps

<Columns cols={2}>
  <Card title="Development guide" icon="code" href="/development/guide">
    Dev setup, project structure, and code conventions.
  </Card>

  <Card title="CLI reference" icon="terminal" href="/development/cli-reference">
    All ollim-bot subcommands and flags.
  </Card>

  <Card title="Architecture overview" icon="sitemap" href="/architecture/overview">
    Module map and data flow through the system.
  </Card>

  <Card title="Troubleshooting" icon="wrench" href="/development/troubleshooting">
    Common issues and debugging techniques.
  </Card>
</Columns>
