ConceptsArchitecture

Architecture

A high-level map of the system so you know which component does what.

Audience: developers and operators who want the mental model. What you will accomplish: understand the layers, the 8-node pipeline, and the three request lifecycles.

The layers

  • FastAPI app (main.py) — the HTTP surface, versioned at /api/v1 (legacy /api/* still works, deprecated).
  • Middleware chain — correlation ID injection, request timing, proxy-aware rate limiting, CORS, and API-key auth on protected ingest/review/feedback-list routes.
  • LangGraph orchestrator (graph/builder.py) — the 8-node conversation pipeline that turns a question into a grounded, verified answer.
  • Redis — conversation memory + running summary, rate-limit counters, the ingest registry, the durable ingest queue, and the learning_review queue.
  • ChromaDB — the vector store (cosine distance) holding ingested document chunks, plus a separate synthesized_answers collection used only in learning modes.

The 8-node pipeline

The orchestrator runs these nodes in order:

  1. Step 1: load_memory (Redis)

    Load the user’s conversation history and running summary.

  2. Step 2: condense_query (LLM)

    Context-aware rewrite — condense a multi-turn follow-up into a standalone search query. Skipped on the first turn. Generation still uses the original question. See Retrieval.

  3. Step 3: retrieve_context (Chroma)

    Mode-aware relevance score gate, then MMR (or hybrid) retrieval of diverse chunks.

  4. Step 4: generate_answer (LLM)

    Mode-specific prompt → resilient LLM call (retries + circuit breaker).

  5. Step 5: verify_answer

    Groundedness gate. Strict mode refuses an answer unsupported by the retrieved chunks. See Groundedness.

  6. Step 6: self_ingest (Chroma)

    Learning modes only — capture a synthesized answer (embed now in learning, or queue for review in learning_review).

  7. Step 7: summarize

    Update the rolling conversation summary for long-term context.

  8. Step 8: store_memory (Redis)

    Persist the updated conversation back to Redis (TTL-based expiry).

The node order, expressed compactly:

load_memory → condense_query → retrieve_context → generate_answer
  → verify_answer → self_ingest → summarize → store_memory

The README architecture diagram

┌─────────────────────────────────────────────────────┐
│                   User Query                        │
└────────────────────────┬────────────────────────────┘


              FastAPI  POST /api/chat


          ┌──────────────────────────────┐
          │     LangGraph Orchestrator   │
          │                              │
          │  1. load_memory   (Redis)    │
          │  2. condense_query  (LLM)    │ ← context-aware rewrite (multi-turn)
          │  3. retrieve_context (Chroma)│ ← mode-aware gate + mmr/hybrid
          │  4. generate_answer  (LLM)   │ ← mode-specific prompt
          │  5. verify_answer            │ ← groundedness gate (strict refuses)
          │  6. self_ingest  (Chroma)    │ ← learning mode only
          │  7. summarize                │
          │  8. store_memory  (Redis)    │
          └──────────────────────────────┘


                  Response to User

Request lifecycles

  • Sync chat (POST /api/v1/chat) — runs the full pipeline and returns a typed { answer, sources[], meta } envelope in one response.
  • Streaming (POST /api/v1/chat/stream) — runs the same pipeline but streams tokens as Server-Sent Events in the order token → sources → done (with error on failure). The groundedness refusal in strict mode applies to the stored answer and the done meta; tokens already streamed cannot be retracted.
  • Async ingest (POST /api/v1/ingest / /ingest/upload) — returns 202 Accepted with status=queued and a Location header, then processes in the background. Poll GET /api/v1/ingest/status/{doc_id}. See Deployment for inline vs queue modes.

Verify your result

  • Verify: You can name all 8 pipeline nodes in order.
  • Verify: You know which lifecycle (sync / streaming / async ingest) a given endpoint uses.
  • Verify: You know that Redis backs memory + rate limiting + queues, and ChromaDB backs retrieval.

Common pitfalls

  • Assuming streaming can retract a refusal — once tokens stream they are sent; only the stored answer and done meta reflect a strict-mode refusal.
  • Treating synthesized_answers as the main collection — it is consulted only in learning modes and never pollutes strict/open retrieval.