← Blog · Jul 16, 2026 · AI
RAG in practice: when to use it, when not, and how to measure quality
RAG is the default answer to "I want a chatbot that knows my documents." And often it's the right one. But it's also where most projects get stuck: it's built by guesswork, never measured, and nobody knows why it sometimes answers badly. This is the practical, no-hype version of when to use RAG, how to build it and how to know if it works. It's the natural follow-up to LLMs in production.
1. What RAG is (and isn't)
RAG = retrieve first, generate second. For each question, you search your documents for the relevant chunks and pass them to the model as context, so it answers with your information instead of what it "remembers." It's not magic and it's not retraining the model: it's a search engine in front of a generator. If retrieval fails, the answer fails, no matter how good the model is.
2. When YES and when NO
- Large or changing knowledge base (docs, tickets, manuals).
- Answers that must cite the source.
- Information that updates and you don't want to retrain anything.
- Support, internal search, assistants over documentation.
- The knowledge fits entirely in the prompt (few stable docs).
- The task needs no external data (classify, translate, rewrite).
- You need an exact value: a database query is better.
- Precise keyword search: sometimes full-text search is enough.
Short rule: RAG adds infrastructure (indexing, embeddings, retrieval). If you don't need it, it's free complexity.
3. How it's built, piece by piece
- Chunking: split documents into meaningful chunks (by section, not a blind fixed character count). Size and overlap matter more than they seem.
- Embeddings: turn each chunk into a vector. Pick an embedding model suited to your language and domain; not all perform equally outside English.
- Vector store:
pgvectoron Postgres for most cases; dedicated stores (Qdrant, Pinecone) only with millions of vectors or complex filters. - Retrieval: fetch the k nearest chunks. Often hybrid is best: combine vector search with keyword (BM25) so you don't miss literal matches.
- Reranking: a reranker that refines the candidates before handing them to the model. It lifts quality a lot when there's noise.
- Generation: the model answers only from the given chunks, citing the source and with explicit permission to say "I don't know."
4. How to measure quality (what almost nobody does)
Without measurement, RAG is faith. Always separate two layers:
- Retrieval quality: are the right chunks among those retrieved? Metrics: recall@k (does the good source appear in the top-k?) and precision@k (how much noise?).
- Generation quality: is the answer faithful to those chunks (no making things up) and does it actually answer the question? Metrics: faithfulness and relevance.
To measure you need an evaluation set: 30-100 real questions with their expected answer/source. With it you run the evaluation before and after each change (a different embedding model, a different chunk size, adding reranking) and you know whether you improved or regressed. Without that set, every tweak is blind.
5. Common mistakes we see
- Blaming the model when retrieval fails. 80% of "hallucinations" in RAG are badly retrieved chunks, not the LLM.
- Blind chunking. Cutting by fixed length splits ideas in half and wrecks retrieval.
- Vector search only. You miss literal matches (codes, exact names); hybrid fixes it.
- Not reindexing when changing embeddings. Mixing vectors from different models gives incoherent results.
- Zero evaluation. Without metrics, the system silently degrades when the data changes.
FAQ
What is RAG in one sentence?
Giving the model, for each question, the relevant chunks of your documents so it answers with your information. Retrieve first, generate second.
When is it NOT worth it?
When knowledge fits in the prompt, the task needs no external data, or you need an exact database value. RAG adds infrastructure.
How do you measure it?
Separately: retrieval (recall@k, precision@k) and generation (faithfulness, relevance), with a set of real questions and their expected source.
Do you need a dedicated vector DB?
Not always. pgvector on Postgres is enough for small/medium volumes; dedicated ones make sense with millions of vectors or complex filters.
Want an assistant that truly knows your documents?
We design RAG systems with hybrid retrieval, evaluation and cost control. Fixed price by milestones.
Related resources
- LLMs in production — cost, latency and evaluation of LLMs.
- Automate support without breaking the CRM — where a RAG assistant fits.
- The stack we use in 2026 — Postgres + pgvector as the base.
Published: July 16, 2026 · Written by the RoviDev studio.