OperationsObservability

Observability

Trace a single request end-to-end and know when the service is healthy.

Audience: operators monitoring the backend. What you will accomplish: wire logs to an aggregator and use the health/readiness probes correctly.

Correlation IDs

Every request is tagged with a correlation ID (X-Correlation-Id) that propagates through all log calls, graph nodes, and service layers — so one request can be traced across Redis operations, ChromaDB retrievals, LLM calls, and ingest steps.

  • The correlation-ID middleware injects or preserves the X-Correlation-Id header and stores it in a contextvars.ContextVar for async-safe propagation.
  • The value also appears as meta.correlation_id in chat responses — quote it in support requests.

Request timing

The request-timing middleware logs method, path, status code, duration (ms), and the correlation ID for every request.

Structured logging

Set LOG_FORMAT=json for Datadog, CloudWatch, or ELK ingestion. Each log line includes timestamp, level, correlation_id, and message fields. The default is text.

LOG_FORMAT=json        # "text" (default) or "json" for log aggregators
LOG_LEVEL=INFO

Logs go to the console and a rotating file: logs/app.log is capped at 10 MB with 5 rotated backups.

Health vs readiness probes

Two distinct probes:

  • GET /health — cached startup flags; returns ok or degraded based on Redis and ChromaDB connectivity at startup time.
  • GET /ready — live probe; returns 200 with {"status": "ready"} only if both Redis and ChromaDB respond right now, or 503 with dependency-specific error detail if either is down. Use this for Kubernetes readiness probes / load-balancer health checks.
curl
curl http://127.0.0.1:8000/health
# → ok  (or degraded if Redis/ChromaDB unreachable at startup)

curl -i http://127.0.0.1:8000/ready
# → 200  {"status": "ready"}        when both deps respond
# → 503  {... dependency detail ...} when one is down

If it fails: Connection refused means the server isn't running. A 503 from /ready names the failing dependency.

Verify your result

  • Verify: Your client logs X-Correlation-Id from every response.
  • Verify: Log aggregation is enabled with LOG_FORMAT=json.
  • Verify: Kubernetes readiness uses /ready; a quick liveness check can use /health.

Common failure modes