Building a Production-Ready RAG System From Scratch

I just wrapped a major milestone on my RAG System - a hands-on build I developed entirely from scratch to deeply understand how modern Retrieval-Augmented Generation systems really work under the hood. This wasn't just about connecting APIs; it was about building a production-grade system with security, performance, and user experience in mind.

Why Build From Scratch?

While there are many RAG frameworks and libraries available, building from scratch gave me invaluable insights into:

How LLM providers differ in their APIs and capabilities
The intricacies of vector search and semantic retrieval
Memory management and token optimization strategies
Security considerations for API key handling
The challenges of building reliable, production-ready AI systems

System Architecture Overview

The system consists of several key components that work together seamlessly:

LLM Orchestration Layer: Multi-provider support with unified interface
Security Layer: Encrypted API key management
Memory Management: Token-aware state with intelligent context handling
Vector Database: ChromaDB for semantic search
Document Pipeline: Robust ingestion and processing
User Interface: Streamlit-based frontend with modern UX

Let's dive into each component.

Multi-Provider LLM Integration

One of the core features is support for multiple LLM providers through a unified orchestration layer:

class LLMOrchestrator:
    def __init__(self, provider: str, api_key: str):
        self.provider = provider
        self.client = self._initialize_client(provider, api_key)
    
    def _initialize_client(self, provider: str, api_key: str):
        if provider == "groq":
            return GroqClient(api_key)
        elif provider == "openai":
            return OpenAIClient(api_key)
        elif provider == "gemini":
            return GeminiClient(api_key)
        elif provider == "deepseek":
            return DeepSeekClient(api_key)
        else:
            raise ValueError(f"Unsupported provider: {provider}")
    
    def generate(self, messages: List[Dict], **kwargs):
        return self.client.chat_completion(messages, **kwargs)

This abstraction allows seamless switching between providers without changing the core application logic. Each provider has its own adapter that handles provider-specific quirks and API differences.

Secure API Key Management

Security was a top priority. I implemented encrypted API key storage using Fernet symmetric encryption with user-specific key derivation:

from cryptography.fernet import Fernet
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2

def derive_key(user_id: str, salt: bytes) -> bytes:
    kdf = PBKDF2(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=100000,
    )
    return base64.urlsafe_b64encode(kdf.derive(user_id.encode()))

def encrypt_api_key(api_key: str, user_id: str) -> str:
    cipher = Fernet(derive_key(user_id, SALT))
    return cipher.encrypt(api_key.encode()).decode()

This ensures that API keys are never stored in plain text and are encrypted per-user, adding an extra layer of security.

Intelligent Memory Management

One of the biggest challenges in RAG systems is managing context effectively. I implemented a sophisticated memory management system with three key features:

1. Token-Aware State

The system tracks token usage in real-time to prevent exceeding model limits:

class MemoryManager:
    def __init__(self, max_tokens: int = 4096):
        self.max_tokens = max_tokens
        self.messages = []
        self.current_tokens = 0
    
    def count_tokens(self, text: str) -> int:
        # Approximate token count (1 token ≈ 4 chars)
        return len(text) // 4
    
    def add_message(self, role: str, content: str):
        tokens = self.count_tokens(content)
        
        if self.current_tokens + tokens > self.max_tokens:
            self._trim_context()
        
        self.messages.append({"role": role, "content": content})
        self.current_tokens += tokens

2. Automatic Summarization

When the context grows too large, the system automatically summarizes older messages:

def _trim_context(self):
    # Summarize oldest messages to free up token space
    old_messages = self.messages[:5]
    summary = self._generate_summary(old_messages)
    
    # Replace old messages with summary
    self.messages = [
        {"role": "system", "content": f"Previous context: {summary}"}
    ] + self.messages[5:]
    
    # Recalculate token count
    self.current_tokens = sum(
        self.count_tokens(msg["content"]) 
        for msg in self.messages
    )

3. Context-Trim Logic

The system intelligently decides what to keep and what to trim based on relevance and recency.

Vector Search with ChromaDB

For semantic retrieval, I used ChromaDB with HuggingFace BGE embeddings:

import chromadb
from chromadb.utils import embedding_functions

class VectorStore:
    def __init__(self):
        self.client = chromadb.PersistentClient(path="./chroma_db")
        self.embedding_fn = embedding_functions.HuggingFaceEmbeddingFunction(
            model_name="BAAI/bge-small-en-v1.5"
        )
        self.collection = self.client.get_or_create_collection(
            name="documents",
            embedding_function=self.embedding_fn
        )
    
    def add_documents(self, documents: List[Dict]):
        self.collection.add(
            documents=[doc["content"] for doc in documents],
            metadatas=[doc["metadata"] for doc in documents],
            ids=[doc["id"] for doc in documents]
        )
    
    def search(self, query: str, n_results: int = 5, filter: Dict = None):
        return self.collection.query(
            query_texts=[query],
            n_results=n_results,
            where=filter
        )

The BGE embeddings provide excellent semantic search quality while being lightweight enough for fast retrieval.

Document Processing Pipeline

The document pipeline handles everything from ingestion to storage:

Multi-Format Parsing

class DocumentProcessor:
    def __init__(self):
        self.parsers = {
            ".pdf": PDFParser(),
            ".txt": TextParser(),
            ".docx": DocxParser(),
            ".md": MarkdownParser(),
        }
    
    def process(self, file_path: str) -> List[Dict]:
        ext = os.path.splitext(file_path)[1]
        parser = self.parsers.get(ext)
        
        if not parser:
            raise ValueError(f"Unsupported file format: {ext}")
        
        text = parser.parse(file_path)
        chunks = self._chunk_text(text)
        return self._enrich_metadata(chunks, file_path)

Intelligent Chunking

Documents are split into semantically meaningful chunks:

def _chunk_text(self, text: str, chunk_size: int = 500, overlap: int = 50):
    chunks = []
    sentences = text.split(". ")
    current_chunk = ""
    
    for sentence in sentences:
        if len(current_chunk) + len(sentence) < chunk_size:
            current_chunk += sentence + ". "
        else:
            chunks.append(current_chunk.strip())
            # Overlap: keep last sentence for context
            current_chunk = sentence + ". "
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

Metadata Enrichment

Each chunk is enriched with metadata for better filtering and retrieval:

def _enrich_metadata(self, chunks: List[str], file_path: str):
    return [
        {
            "content": chunk,
            "metadata": {
                "source": file_path,
                "chunk_index": i,
                "total_chunks": len(chunks),
                "timestamp": datetime.now().isoformat(),
            },
            "id": f"{file_path}_{i}"
        }
        for i, chunk in enumerate(chunks)
    ]

System Prompting for Retrieval Fidelity

To prevent hallucinations, I implemented strict system prompts:

SYSTEM_PROMPT = """You are a helpful AI assistant with access to a document database.

CRITICAL RULES:
1. ONLY answer questions using information from the retrieved documents
2. If information is not in the documents, explicitly say "I don't have that information"
3. Always cite the source document when providing answers
4. Never make up or infer information not present in the documents
5. If asked about something outside the documents, politely redirect to available information

Format your responses with:
- Clear, concise answers
- Source citations: [Source: filename.pdf, Chunk X]
- Confidence indicators when appropriate
"""

This ensures the system stays grounded in the retrieved context and doesn't hallucinate.

Modern UI with Streamlit

The user interface was designed with UX best practices in mind:

Key Features

API Key Configuration Modal: No command-line hassle, secure input right in the UI
Document Upload Interface: Drag-and-drop support for easy document ingestion
One-Click Actions: Clear database, reset chat history
Message Management: Copy-to-clipboard for individual messages
Export Functionality: Download full conversation as PDF
Real-Time Feedback: Typing indicators and message timestamps

import streamlit as st

# Configuration Modal
with st.sidebar:
    with st.expander("🔑 API Configuration"):
        provider = st.selectbox("Provider", ["Groq", "OpenAI", "Gemini", "DeepSeek"])
        api_key = st.text_input("API Key", type="password")
        if st.button("Save Configuration"):
            save_encrypted_key(provider, api_key, st.session_state.user_id)
            st.success("Configuration saved securely!")

# Document Upload
uploaded_files = st.file_uploader(
    "Upload Documents",
    accept_multiple_files=True,
    type=["pdf", "txt", "docx", "md"]
)

if uploaded_files:
    with st.spinner("Processing documents..."):
        process_and_store_documents(uploaded_files)
    st.success(f"Processed {len(uploaded_files)} documents!")

# Chat Interface
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])
        st.caption(f"🕒 {message['timestamp']}")
        if st.button("📋 Copy", key=f"copy_{message['id']}"):
            st.write("Copied to clipboard!")

Performance Optimizations

Several optimizations ensure the system runs smoothly:

Caching: Frequently accessed embeddings are cached
Batch Processing: Documents are processed in batches
Lazy Loading: Vector store is loaded only when needed
Connection Pooling: Reuse LLM API connections

Lessons Learned

Building this system from scratch taught me valuable lessons:

LLM APIs vary significantly: Each provider has quirks that need careful handling
Token management is crucial: Poor memory management leads to errors and high costs
Vector search isn't magic: Chunk size, overlap, and embedding quality matter immensely
Security can't be an afterthought: Encrypted storage and secure handling are essential
UX makes or breaks AI tools: Even the best AI is useless if users can't interact with it effectively

Future Enhancements

There's always room for improvement:

[ ] Support for more document formats (HTML, XML, JSON)
[ ] Advanced filtering options (date ranges, custom metadata)
[ ] Multi-language support with language-specific embeddings
[ ] Hybrid search (combining semantic and keyword search)
[ ] Query rewriting for better retrieval
[ ] Conversation threading and branching
[ ] Analytics dashboard for usage insights

Conclusion

Building a RAG system from scratch is a challenging but incredibly rewarding experience. It forced me to understand every component deeply - from LLM APIs to vector databases, from security to UX design.

The result is a production-ready system that handles real-world use cases with:

Multiple LLM provider support
Secure API key management
Intelligent memory management
Precise vector search
Robust document processing
Modern, intuitive UI

If you're interested in building AI systems, I highly recommend starting from first principles. The knowledge gained from understanding how these systems work under the hood is invaluable.

The complete source code and documentation are available on GitHub. Feel free to explore, contribute, or use it as a learning resource for your own RAG projects!

Tech Stack: Python • ChromaDB • HuggingFace • Streamlit • Cryptography • LlamaIndex concepts

Key Takeaway: Building from scratch isn't about reinventing the wheel - it's about understanding the wheel so well that you can build better vehicles.