RAG Evaluation Checklist

Use labeled questions

Build an evaluation file with questions, expected source documents, acceptable answers, and examples where no answer should be returned. Include recent documents, old documents, duplicate documents, and documents with conflicting claims.

Measure recall at k before judging answer quality. If the right passage never reaches the model, prompt engineering will only hide the real problem.

Inspect chunks, not only answers

Chunking decisions are product decisions. A support bot may need short policy sections; a legal review tool may need larger context blocks with section headers and document metadata.

Review failed queries by looking at retrieved chunks. Common failures include missing titles, broken tables, boilerplate crowding out the useful passage, and embeddings that treat two product versions as interchangeable.

Require citations and uncertainty

A RAG answer without citations is hard to audit. Ask the system to cite exact documents, show confidence, and admit when the retrieved context does not support an answer.

Use a separate check for groundedness: every material claim should map back to retrieved evidence. This matters more for customer support, healthcare, finance, and enterprise knowledge bases.

Plan for drift

Documents change, product names change, and teams forget to remove stale content. Track index freshness, failed retrievals, low-confidence answers, and user corrections.

A mature RAG system has a data owner, an indexing schedule, access-control checks, and a way to remove revoked or sensitive content quickly.

Practical checklist

1Create labeled questions with expected sources.
2Measure retrieval before answer quality.
3Inspect failed chunks manually.
4Require citations for material claims.
5Monitor stale documents and access control.

Related comparisons

GPT-5.5 vs Claude Opus for Professional Work Claude Code vs Cursor OpenAI vs Gemini for Product Teams