Streaming (SSE)

Render a typing effect by streaming the answer as it is generated.

Audience: developers building chat UIs. What you will accomplish: consume POST /api/v1/chat/stream as Server-Sent Events and render tokens as they arrive.

The endpoint

POST http://127.0.0.1:8000/api/v1/chat/stream
Content-Type: application/json

The request body is identical to /api/v1/chat — same q, mode, lang, top_k, score_threshold. The response is a text/event-stream.

Event order

Frames arrive in this order:

event: token    data: {"delta": "You can "}
event: token    data: {"delta": "return..."}
event: sources  data: {"sources": [ … ]}
event: done     data: {"meta": { … }}

token (repeated) — incremental answer text in delta.
sources — the structured citations, once retrieval is resolved.
done — final meta, with the same fields as the non-streaming response (including grounded / grounded_score).
error — emitted on failure instead of done, with no internal detail.

Read the stream in the browser

A POST cannot use EventSource, so read the response body stream directly and split on blank lines into SSE frames.

const res = await fetch("http://127.0.0.1:8000/api/v1/chat/stream", {
method: "POST",
headers: { "Content-Type": "application/json", "X-User-Id": "alice" },
body: JSON.stringify({ q: "What is the return policy?" }),
});

const reader = res.body.getReader();
const dec = new TextDecoder();
let buf = "";
let answer = "";

for (;;) {
const { done, value } = await reader.read();
if (done) break;
buf += dec.decode(value, { stream: true });

// SSE frames are separated by a blank line.
let sep;
while ((sep = buf.indexOf("\n\n")) !== -1) {
  const frame = buf.slice(0, sep);
  buf = buf.slice(sep + 2);

  let event = "message";
  let data = "";
  for (const line of frame.split("\n")) {
    if (line.startsWith("event:")) event = line.slice(6).trim();
    if (line.startsWith("data:")) data += line.slice(5).trim();
  }
  if (!data) continue;
  const payload = JSON.parse(data);

  if (event === "token") answer += payload.delta;       // render incrementally
  else if (event === "sources") renderCitations(payload.sources);
  else if (event === "done") finalize(payload.meta);    // reconcile grounding
  else if (event === "error") showError();
}
}

curl -N -X POST http://127.0.0.1:8000/api/v1/chat/stream \
-H "Content-Type: application/json" \
-H "X-User-Id: alice" \
-d '{"q":"What is the return policy?"}'
# -N disables curl buffering so you see token frames as they arrive.

Verify your result

Verify: You receive one or more token frames, then a single sources frame, then a done frame.
Verify: Concatenating every token.delta reproduces the full answer.
Verify: done.meta carries grounded and correlation_id, matching the non-streaming shape.
Verify: On failure you receive an error frame (and no done).

Common failure modes

Input rejected before any token → a guardrail (prompt-injection) block returns a 400 error frame before streaming begins.
Stop mid-stream → abort the fetch (e.g. via an AbortController); the reference web client in web/ ships a Stop button that does exactly this.

The shared request body and modes live in Chatting.
Reconcile the done verdict using Trust & citations.
Handle the error frame using Errors & rate limits.

Chatting Languages