Streaming (SSE)
Render a typing effect by streaming the answer as it is generated.
Audience: developers building chat UIs.
What you will accomplish: consume POST /api/v1/chat/stream as Server-Sent Events and
render tokens as they arrive.
The endpoint
POST http://127.0.0.1:8000/api/v1/chat/stream
Content-Type: application/jsonThe request body is identical to /api/v1/chat — same q,
mode, lang, top_k, score_threshold. The response is a text/event-stream.
Event order
Frames arrive in this order:
event: token data: {"delta": "You can "}
event: token data: {"delta": "return..."}
event: sources data: {"sources": [ … ]}
event: done data: {"meta": { … }}token(repeated) — incremental answer text indelta.sources— the structured citations, once retrieval is resolved.done— finalmeta, with the same fields as the non-streaming response (includinggrounded/grounded_score).error— emitted on failure instead ofdone, with no internal detail.
Read the stream in the browser
A POST cannot use EventSource, so read the response body stream directly and split on
blank lines into SSE frames.
const res = await fetch("http://127.0.0.1:8000/api/v1/chat/stream", {
method: "POST",
headers: { "Content-Type": "application/json", "X-User-Id": "alice" },
body: JSON.stringify({ q: "What is the return policy?" }),
});
const reader = res.body.getReader();
const dec = new TextDecoder();
let buf = "";
let answer = "";
for (;;) {
const { done, value } = await reader.read();
if (done) break;
buf += dec.decode(value, { stream: true });
// SSE frames are separated by a blank line.
let sep;
while ((sep = buf.indexOf("\n\n")) !== -1) {
const frame = buf.slice(0, sep);
buf = buf.slice(sep + 2);
let event = "message";
let data = "";
for (const line of frame.split("\n")) {
if (line.startsWith("event:")) event = line.slice(6).trim();
if (line.startsWith("data:")) data += line.slice(5).trim();
}
if (!data) continue;
const payload = JSON.parse(data);
if (event === "token") answer += payload.delta; // render incrementally
else if (event === "sources") renderCitations(payload.sources);
else if (event === "done") finalize(payload.meta); // reconcile grounding
else if (event === "error") showError();
}
}curl -N -X POST http://127.0.0.1:8000/api/v1/chat/stream \
-H "Content-Type: application/json" \
-H "X-User-Id: alice" \
-d '{"q":"What is the return policy?"}'
# -N disables curl buffering so you see token frames as they arrive.Verify your result
- Verify: You receive one or more
tokenframes, then a singlesourcesframe, then adoneframe. - Verify: Concatenating every
token.deltareproduces the full answer. - Verify:
done.metacarriesgroundedandcorrelation_id, matching the non-streaming shape. - Verify: On failure you receive an
errorframe (and nodone).
Common failure modes
- Input rejected before any token → a guardrail (prompt-injection) block returns a
400error frame before streaming begins. - Stop mid-stream → abort the fetch (e.g. via an
AbortController); the reference web client inweb/ships a Stop button that does exactly this.
Related next steps
- The shared request body and modes live in Chatting.
- Reconcile the
doneverdict using Trust & citations. - Handle the
errorframe using Errors & rate limits.