Architecture
A high-level map of the system so you know which component does what.
Audience: developers and operators who want the mental model. What you will accomplish: understand the layers, the 8-node pipeline, and the three request lifecycles.
The layers
- FastAPI app (
main.py) — the HTTP surface, versioned at/api/v1(legacy/api/*still works, deprecated). - Middleware chain — correlation ID injection, request timing, proxy-aware rate limiting, CORS, and API-key auth on protected ingest/review/feedback-list routes.
- LangGraph orchestrator (
graph/builder.py) — the 8-node conversation pipeline that turns a question into a grounded, verified answer. - Redis — conversation memory + running summary, rate-limit counters, the ingest
registry, the durable ingest queue, and the
learning_reviewqueue. - ChromaDB — the vector store (cosine distance) holding ingested document chunks, plus
a separate
synthesized_answerscollection used only in learning modes.
The 8-node pipeline
The orchestrator runs these nodes in order:
Step 1: load_memory (Redis)
Load the user’s conversation history and running summary.
Step 2: condense_query (LLM)
Context-aware rewrite — condense a multi-turn follow-up into a standalone search query. Skipped on the first turn. Generation still uses the original question. See Retrieval.
Step 3: retrieve_context (Chroma)
Mode-aware relevance score gate, then MMR (or hybrid) retrieval of diverse chunks.
Step 4: generate_answer (LLM)
Mode-specific prompt → resilient LLM call (retries + circuit breaker).
Step 5: verify_answer
Groundedness gate. Strict mode refuses an answer unsupported by the retrieved chunks. See Groundedness.
Step 6: self_ingest (Chroma)
Learning modes only — capture a synthesized answer (embed now in
learning, or queue for review inlearning_review).Step 7: summarize
Update the rolling conversation summary for long-term context.
Step 8: store_memory (Redis)
Persist the updated conversation back to Redis (TTL-based expiry).
The node order, expressed compactly:
load_memory → condense_query → retrieve_context → generate_answer
→ verify_answer → self_ingest → summarize → store_memoryThe README architecture diagram
┌─────────────────────────────────────────────────────┐
│ User Query │
└────────────────────────┬────────────────────────────┘
│
▼
FastAPI POST /api/chat
│
▼
┌──────────────────────────────┐
│ LangGraph Orchestrator │
│ │
│ 1. load_memory (Redis) │
│ 2. condense_query (LLM) │ ← context-aware rewrite (multi-turn)
│ 3. retrieve_context (Chroma)│ ← mode-aware gate + mmr/hybrid
│ 4. generate_answer (LLM) │ ← mode-specific prompt
│ 5. verify_answer │ ← groundedness gate (strict refuses)
│ 6. self_ingest (Chroma) │ ← learning mode only
│ 7. summarize │
│ 8. store_memory (Redis) │
└──────────────────────────────┘
│
▼
Response to UserRequest lifecycles
- Sync chat (
POST /api/v1/chat) — runs the full pipeline and returns a typed{ answer, sources[], meta }envelope in one response. - Streaming (
POST /api/v1/chat/stream) — runs the same pipeline but streams tokens as Server-Sent Events in the ordertoken → sources → done(witherroron failure). The groundedness refusal in strict mode applies to the stored answer and thedonemeta; tokens already streamed cannot be retracted. - Async ingest (
POST /api/v1/ingest//ingest/upload) — returns202 Acceptedwithstatus=queuedand aLocationheader, then processes in the background. PollGET /api/v1/ingest/status/{doc_id}. See Deployment for inline vs queue modes.
Verify your result
- Verify: You can name all 8 pipeline nodes in order.
- Verify: You know which lifecycle (sync / streaming / async ingest) a given endpoint uses.
- Verify: You know that Redis backs memory + rate limiting + queues, and ChromaDB backs retrieval.
Common pitfalls
- Assuming streaming can retract a refusal — once tokens stream they are sent; only the
stored answer and
donemeta reflect a strict-mode refusal. - Treating
synthesized_answersas the main collection — it is consulted only in learning modes and never pollutes strict/open retrieval.
Related next steps
- Dig into the score gate and MMR/hybrid in Retrieval.
- Understand the verification step in Groundedness.
- Compare the four behaviors in Chat modes.