Evaluation
How answer and retrieval quality are measured. These tools are for maintainers — they are not part of the request path.
Audience: project maintainers and contributors. What you will accomplish: know which quality tools exist and when each runs.
RAGAS offline harness
eval/run_ragas.py is an offline harness that scores answers on RAGAS metrics:
faithfulness, answer relevancy, context precision, and context recall,
against a golden set (eval/golden.jsonl).
Hermetic retrieval-regression test
tests/test_retrieval_regression.py (marked @pytest.mark.retrieval) seeds a tiny labeled
corpus using the real FastEmbed model and asserts that expected documents are retrieved.
It is hermetic — no network or live services — so it can run deterministically.
# Run only the retrieval-regression test
pytest -m retrievalIf it fails: This downloads the FastEmbed model on first run; ensure outbound access for that initial fetch, then it runs offline.
Opt-in eval CI
A separate non-PR workflow, .github/workflows/eval.yml (workflow_dispatch + a weekly
schedule), seeds a corpus (eval/seed_corpus.py), runs the RAGAS harness, and uploads the
results as artifacts. It is opt-in by design so the expensive, non-deterministic eval never
blocks ordinary PRs.
Verify your result
- Verify: You know RAGAS (
eval/run_ragas.py) is offline and not CI-gated. - Verify: You can run the hermetic retrieval check with
pytest -m retrieval. - Verify: You know
eval.ymlruns on demand / on a schedule, not on PRs.
Common failure modes
Related next steps
- Understand what’s being measured in Retrieval and Groundedness.
- Close the quality loop with user feedback in Rating answers.