Evaluation

How answer and retrieval quality are measured. These tools are for maintainers — they are not part of the request path.

Audience: project maintainers and contributors. What you will accomplish: know which quality tools exist and when each runs.

RAGAS offline harness

eval/run_ragas.py is an offline harness that scores answers on RAGAS metrics: faithfulness, answer relevancy, context precision, and context recall, against a golden set (eval/golden.jsonl).

Hermetic retrieval-regression test

tests/test_retrieval_regression.py (marked @pytest.mark.retrieval) seeds a tiny labeled corpus using the real FastEmbed model and asserts that expected documents are retrieved. It is hermetic — no network or live services — so it can run deterministically.

Terminal

# Run only the retrieval-regression test
pytest -m retrieval

If it fails: This downloads the FastEmbed model on first run; ensure outbound access for that initial fetch, then it runs offline.

Opt-in eval CI

A separate non-PR workflow, .github/workflows/eval.yml (workflow_dispatch + a weekly schedule), seeds a corpus (eval/seed_corpus.py), runs the RAGAS harness, and uploads the results as artifacts. It is opt-in by design so the expensive, non-deterministic eval never blocks ordinary PRs.

Verify your result

Verify: You know RAGAS (eval/run_ragas.py) is offline and not CI-gated.
Verify: You can run the hermetic retrieval check with pytest -m retrieval.
Verify: You know eval.yml runs on demand / on a schedule, not on PRs.

Common failure modes

Understand what’s being measured in Retrieval and Groundedness.
Close the quality loop with user feedback in Rating answers.

Observability Web client