Skip to main content
This page covers the internal streaming pipeline — it’s aimed at developers working on the bot’s source code. For how messages appear in Discord from a user perspective, see Conversations.
Agent responses stream token-by-token from the Claude Agent SDK into Discord messages. Two modules handle this: agent_streaming.py consumes the SDK response (parsing StreamEvent messages, handling auto-compaction retries, and detecting fork interrupts), while streamer.py bridges the resulting text deltas to Discord — handling rate limits, the 2000-character message cap, and typing indicators during tool execution pauses.

Overview

The streaming pipeline has three stages:
  1. Agent.stream_chat() yields text deltas and StreamStatus signals. It delegates SDK response consumption to stream_response() in agent_streaming.py, which handles the message loop, auto-compaction retry, and fork interrupt detection. Raw SSE events are parsed by StreamParser in streamer.py.
  2. stream_to_channel() buffers text and routes status signals to an ephemeral status message, progressively editing a Discord message
  3. A background editor task flushes the buffer on a fixed interval, keeping edits throttled and showing typing indicators during pauses
The key design constraint is Discord’s rate limit on message edits (roughly 5 edits per 5 seconds per channel). The streamer stays well within this by editing at a fixed interval rather than on every delta.

Text deltas and status signals

Agent.stream_chat() is an async generator that yields two types of output: text strings (for display) and StreamStatus signals (for phase transitions like thinking, tool use, and compaction). StreamParser.feed() processes individual SSE event dicts from the SDK:
Event typeHandling
content_block_delta with textYielded as a text string
content_block_delta with input_json_deltaAccumulated for tool label formatting (not yielded as text)
content_block_start with tool_useCaptures tool name for label emission
content_block_stop for tool useEmits a StreamStatus(kind="tool_start") signal with a formatted label
content_block_start with thinkingEmits a StreamStatus(kind="thinking_start") signal
Tool labels are rendered progressively as subdued text in Discord (using the -# markdown small-text prefix). Each label flushes when the next tool starts or when text arrives — rather than batching all labels until text appears. This eliminates visual jumps when multiple tools complete before the agent writes text. StreamParser._drain() converts deferred labels into -# *{label}* text, -# *~~{label}~~ — denied (use /permissions ask to approve)* if the tool was denied by the permission system, or -# *~~{label}~~ — error* if the tool returned an error or failed during execution (interrupted failures are skipped). MCP tool names are displayed without the mcp__<server>__ prefix — the formatter strips the prefix generically for all MCP servers, so mcp__discord__discord_embed renders as discord_embed in the stream label.

Nested tool activity

When the agent delegates work to a subagent (the Agent tool in Claude Code CLI 2.1+), the status message surfaces what tool the subagent is currently using — instead of showing a static timer with no visibility into subagent work. stream_response() listens for TaskProgressMessage events from the SDK, extracts last_tool_name, and yields a StreamStatus(kind="task_progress") signal with a formatted label. The label format is agent_name(description) · tool_name, for example:
ollim-bot-guide(search for docs) · Read... (12s)
The description is truncated to 40 characters. MCP tool names are stripped of the mcp__<server>__ prefix, matching the convention used for top-level tool labels. On the Discord side, stream_to_channel() updates the existing status message label without resetting the timer — so the elapsed time reflects how long the subagent has been running overall, not how long the current tool has been active.
When an enter_fork tool fires during streaming, stream_response() interrupts the SDK client and suppresses remaining stream events. The loop continues to drain messages so the ResultMessage still saves the session ID, but no further text deltas are yielded.

Throttled editing

stream_to_channel() consumes text deltas and StreamStatus signals, maintaining a text buffer. A background editor coroutine flushes the buffer to Discord on a fixed schedule, while StreamStatus signals control an ephemeral status message (for thinking indicators, tool labels, and compaction progress).

Timing constants

ConstantValuePurpose
FIRST_FLUSH_DELAY0.2sInitial delay so the first message accumulates a meaningful chunk
EDIT_INTERVAL0.5sResponsive feel while staying within Discord’s rate limits
MAX_MSG_LEN2000Discord’s maximum message length
STATUS_TICK1.0sInterval between timer ticks on the status message (e.g., “Thinking… (1s)“)

Flush cycle

The editor runs this loop:
  1. Wait FIRST_FLUSH_DELAY before the first flush
  2. Flush the buffer — send a new message or edit the existing one
  3. Wait EDIT_INTERVAL
  4. If a status message is active (thinking or tool use), update its timer display
  5. Else if new content arrived (stale flag is set), flush again
  6. Else if the response isn’t done, send a typing indicator
  7. Repeat from step 3 until the delta stream ends
The stale flag is set whenever new text arrives from the generator and cleared after each successful flush. This avoids unnecessary edits when no new text has accumulated.
Discord.py handles HTTP 429 rate-limit responses transparently, so even if edits occasionally bunch up, the library retries automatically.

Overflow handling

When the buffer exceeds 2000 characters, the streamer splits across multiple messages at natural boundaries — preferring the last newline, then the last space, then a hard cut at 2000 characters. A natural break is only accepted if it falls within the last 200 characters of the window, so messages use most of the available capacity rather than splitting too early.
  1. The current message is finalized at the chosen split point (via msg.edit() or initial channel.send())
  2. A new message is sent with the overflow text
  3. If the overflow itself exceeds 2000 characters, the process repeats in a loop until all accumulated text is dispatched
  4. The msg_start index tracks where the current message begins in the full buffer, so the streamer always knows which slice to send
Each new message created during overflow is registered with track_message() for fork session tracking. This ensures that if the response came from a background fork, the user can reply to any of the overflow messages to resume that fork.

Typing indicators

The streamer shows Discord typing indicators (channel.typing()) when the agent is working but not producing text — typically between tool executions. This happens in the editor loop: when the interval fires, no status message is active, and the stale flag is false (no new text), the editor sends a typing indicator instead of editing the message. During active tool use or thinking, the ephemeral status message handles visibility instead. Before each stream_to_channel() call, the bot sends an initial channel.typing(). Then, immediately after starting the editor task, the streamer shows a “Thinking…” status message — before the API sends its first SSE event. This eliminates the dead zone between client.query() and the first event where only the typing indicator was visible. When the real thinking_start event arrives, _set_status sees the same label and keeps the timer running without resetting it. If text arrives first (no thinking phase), the text handler clears the status automatically.

Interrupt on new message

When the user sends a new message while a response is streaming, the bot interrupts the current response:
  1. on_message checks if the agent lock is held (meaning a response is in progress)
  2. If locked and not compacting, it calls agent.interrupt(), which cancels pending permission requests and interrupts the SDK client
  3. The interrupted stream_chat() generator stops yielding deltas
  4. stream_to_channel() finishes its final flush with whatever text accumulated
  5. The new message is processed with a fresh stream_to_channel() call
Interrupts are skipped while the agent is auto-compacting (agent.is_compacting). An interrupt during compaction would kill the post-compaction response, and the new message would trigger a redundant compaction cycle. The new message still queues behind the lock and runs after compaction finishes. The /interrupt slash command provides the same behavior on demand.

Empty responses

If the delta stream produces no text at all — and no fork entry was requested, and no auto-compaction occurred — the streamer sends a fallback message:
no response — try again.
This covers edge cases where the agent’s entire response was tool use with no text output. The fallback is suppressed when a fork entry was requested (the agent called enter_fork) or when auto-compaction occurred — both are legitimate reasons for an empty buffer.

Message tracking

Every message sent by the streamer — both initial messages and overflow continuations — is registered via track_message(message_id). This feeds into the fork session tracking system: when a background fork streams a response, the message IDs are collected so that a user reply to any of those messages can resume the fork’s session. See reply-to-fork context for the full tracking lifecycle.

Auto-compaction annotation

When the SDK auto-compacts context mid-response, the streamer renders a visible annotation in the DM so you know what happened. The flow:
  1. The SDK emits a SystemMessage with subtype="compact_boundary" and ends the stream. stream_response() detects this, extracts the pre-compaction token count from compact_metadata, and yields a StreamStatus event (kind="compact_start") including an optional compact_tokens count.
  2. The streamer flushes any pre-compaction text to its own message
  3. An ephemeral status message appears with a timer: -# *Auto-compacting 5k tokens... (3s)*
  4. stream_response() re-sends the original query against the freshly compacted context (via client.query()), then streams the new response
  5. When the post-compaction response arrives, the streamer edits the status to a permanent annotation: -# *auto-compacted · 5k tokens · 8s*
  6. Post-compaction content continues in a new message
The annotation stays visible in chat history — the streamer does not delete it after compaction finishes. This gives you a clear record of when and why context was compacted.

Context usage warning

Auto-compaction handles context overflow automatically, but a heads-up before it triggers lets you compact on your own terms. After each response, stream_response() checks how much of the 200k context window the current input_tokens occupy. If context exceeds 60% and no auto-compaction just happened, stream_to_channel() sends a small annotation below the response. Two tiers of warning:
Context usageAnnotation
60–79%-# *context: 67% (134k) — consider /compact*
80%+-# *context: 85% (170k) — compaction soon, /compact recommended*
The annotation is skipped after auto-compaction — context is freshly compacted, so there is nothing to warn about. The flow:
  1. stream_response() captures input_tokens from the final ResultMessage.usage
  2. After all text is yielded, if context exceeds 60% of 200k and the response was not auto-compacted, it yields a StreamStatus(kind="context_warning") with input_tokens and context_pct
  3. stream_to_channel() stores the warning signal and — after the final flush — sends the annotation as a small italic message
  4. The warning message is registered via track_message() for fork session tracking, so replying to it resumes the correct session
The 80% escalation threshold is defined as _ESCALATE_PCT in streamer.py. The base 60% threshold is _WARN_PCT in agent_streaming.py.

Next steps

Context flow

How context moves between sessions, forks, and pending updates.

Session management

Session IDs, lifecycle events, and compaction.

Forks

Interactive forks, exit strategies, and idle timeout.

Conversations

DM interface, message flow, and interrupt behavior.

Development guide

How to modify the streaming pipeline and other core modules.