
Production RAG System with Evaluation Layer
A hybrid retrieval architecture combining vector similarity, BM25 ranking, and HNSW indexing via pgvector for high-precision document retrieval.
Timeline
1 Month
Role
Full Stack Engineer
Team
Solo
Status
CompletedTechnology Stack
Key Challenges
- Optimizing hybrid search with pgvector
- Streaming answers in real-time
- Evaluating LLM faithfulness
- Monorepo configuration
Key Learnings
- Vector database design with pgvector
- LLM-as-a-Judge evaluation techniques
- Turborepo workspace management
- Next.js App Router streaming APIs
Production RAG System with Evaluation Layer
Situation
As AI applications scale, hallucination and retrieval inaccuracy become critical bottlenecks. Relying purely on semantic search often misses exact keyword matches, and without an automated way to evaluate the generated answers, systems can silently degrade over time. The need was for a robust Retrieval-Augmented Generation (RAG) architecture that not only retrieved accurate context but also constantly monitored its own performance.
Task
The goal was to design and implement a complete RAG pipeline from scratch. It needed to:
- Ingest large documents and store them efficiently.
- Retrieve context using a hybrid approach (combining semantic and keyword search).
- Provide a streaming web interface for real-time interaction.
- Include an automated evaluation layer to score the faithfulness of the LLM's responses against a golden dataset.
Action
I engineered the solution as a Turborepo monorepo to cleanly separate the core RAG engine, utility scripts, and the web application.
1. Document Ingestion & Storage
- Built a data ingestion pipeline to chunk large documents into smaller pieces.
- Generated vector embeddings using OpenAI's
text-embedding-3-small. - Stored chunks and embeddings in Supabase using PostgreSQL + pgvector, managed by Drizzle ORM for type-safe database operations.
2. Hybrid Retrieval Engine
- Implemented Hybrid Search that seamlessly combines dense semantic search (pgvector cosine similarity with HNSW indexing) with sparse keyword matching (TF-IDF BM25) for best-of-both retrieval quality.
3. Generation & Streaming API
- Retrieved top relevant chunks and prompted GPT-4o-mini to generate context-grounded answers.
- Built a Next.js App Router endpoint to stream answers back to the client in real-time, prepending citation metadata so users know exactly where the information came from.
4. Automated Evaluation Pipeline
- Created an LLM-as-a-Judge pipeline that runs automated evaluations on the system's outputs.
- Scored answer faithfulness on a scale of 0 to 1 against a curated golden Q&A dataset, persisting these metrics back to Supabase to actively monitor response drift over time.
Result
The outcome is a highly modular, production-ready RAG system that is transparent about its performance.
- Enhanced Reliability: Hybrid search significantly improved retrieval precision compared to pure semantic search.
- Seamless User Experience: The Next.js streaming API ensures the web UI feels instantaneous, just like ChatGPT.
- Clean Maintainability: The monorepo structure allows the core RAG logic (
packages/rag-core) to be swapped, tested, or upgraded without touching the front-end web application. - Built-in Quality Assurance: The dedicated evaluation layer guarantees that hallucination rates are measured on every run, turning a manual debugging step into a continuous integration check.
