Retrieval-augmented generation is easy to demo and hard to operate. Here is what production-grade RAG actually requires.
Retrieval-augmented generation makes for a compelling demo. Wire an LLM to a vector store and answers appear. Operating that system safely in a regulated enterprise is a different discipline entirely.
Production RAG needs evaluation harnesses, retrieval quality monitoring, citation enforcement, and guardrails against prompt injection. Without them you are shipping a confident, unaccountable system into a high-stakes context.
# Production RAG: retrieve, ground, and *cite* — never answer from# outside the retrieved, access-checked context.chunks = retriever.search(query, k=8, filters={"acl": user.groups})context = "\n\n".join(f"[{c.id}] {c.text}" for c in chunks) answer = llm.complete( system="Answer only from CONTEXT. Cite sources as [id]. " "If unsupported, say you don't know.", prompt=f"CONTEXT:\n{context}\n\nQUESTION: {query}", temperature=0,)assert_citations_resolve(answer, chunks) # block ungrounded claimsThe teams succeeding treat RAG as an engineering system with the same observability and governance they would demand of any other production service handling sensitive data.