Architecture

A high-level map of the system so you know which component does what.

Audience: developers and operators who want the mental model. What you will accomplish: understand the layers, the 8-node pipeline, and the three request lifecycles.

The layers

FastAPI app (main.py) — the HTTP surface, versioned at /api/v1 (legacy /api/* still works, deprecated).
Middleware chain — correlation ID injection, request timing, proxy-aware rate limiting, CORS, and API-key auth on protected ingest/review/feedback-list routes.
LangGraph orchestrator (graph/builder.py) — the 8-node conversation pipeline that turns a question into a grounded, verified answer.
Redis — conversation memory + running summary, rate-limit counters, the ingest registry, the durable ingest queue, and the learning_review queue.
ChromaDB — the vector store (cosine distance) holding ingested document chunks, plus a separate synthesized_answers collection used only in learning modes.

The 8-node pipeline

The orchestrator runs these nodes in order:

Step 1: load_memory (Redis)
Load the user’s conversation history and running summary.
Step 2: condense_query (LLM)
Context-aware rewrite — condense a multi-turn follow-up into a standalone search query. Skipped on the first turn. Generation still uses the original question. See Retrieval.
Step 3: retrieve_context (Chroma)
Mode-aware relevance score gate, then MMR (or hybrid) retrieval of diverse chunks.
Step 4: generate_answer (LLM)
Mode-specific prompt → resilient LLM call (retries + circuit breaker).
Step 5: verify_answer
Groundedness gate. Strict mode refuses an answer unsupported by the retrieved chunks. See Groundedness.
Step 6: self_ingest (Chroma)
Learning modes only — capture a synthesized answer (embed now in learning, or queue for review in learning_review).
Step 7: summarize
Update the rolling conversation summary for long-term context.
Step 8: store_memory (Redis)
Persist the updated conversation back to Redis (TTL-based expiry).

The node order, expressed compactly:

load_memory → condense_query → retrieve_context → generate_answer
  → verify_answer → self_ingest → summarize → store_memory

The README architecture diagram

┌─────────────────────────────────────────────────────┐
│                   User Query                        │
└────────────────────────┬────────────────────────────┘
                         │
                         ▼
              FastAPI  POST /api/chat
                         │
                         ▼
          ┌──────────────────────────────┐
          │     LangGraph Orchestrator   │
          │                              │
          │  1. load_memory   (Redis)    │
          │  2. condense_query  (LLM)    │ ← context-aware rewrite (multi-turn)
          │  3. retrieve_context (Chroma)│ ← mode-aware gate + mmr/hybrid
          │  4. generate_answer  (LLM)   │ ← mode-specific prompt
          │  5. verify_answer            │ ← groundedness gate (strict refuses)
          │  6. self_ingest  (Chroma)    │ ← learning mode only
          │  7. summarize                │
          │  8. store_memory  (Redis)    │
          └──────────────────────────────┘
                         │
                         ▼
                  Response to User

Request lifecycles

Sync chat (POST /api/v1/chat) — runs the full pipeline and returns a typed { answer, sources[], meta } envelope in one response.
Streaming (POST /api/v1/chat/stream) — runs the same pipeline but streams tokens as Server-Sent Events in the order token → sources → done (with error on failure). The groundedness refusal in strict mode applies to the stored answer and the done meta; tokens already streamed cannot be retracted.
Async ingest (POST /api/v1/ingest / /ingest/upload) — returns 202 Accepted with status=queued and a Location header, then processes in the background. Poll GET /api/v1/ingest/status/{doc_id}. See Deployment for inline vs queue modes.

Verify your result

Verify: You can name all 8 pipeline nodes in order.
Verify: You know which lifecycle (sync / streaming / async ingest) a given endpoint uses.
Verify: You know that Redis backs memory + rate limiting + queues, and ChromaDB backs retrieval.

Common pitfalls

Assuming streaming can retract a refusal — once tokens stream they are sent; only the stored answer and done meta reflect a strict-mode refusal.
Treating synthesized_answers as the main collection — it is consulted only in learning modes and never pollutes strict/open retrieval.

Dig into the score gate and MMR/hybrid in Retrieval.
Understand the verification step in Groundedness.
Compare the four behaviors in Chat modes.

Quickstart Retrieval