Building a Production-Ready RAG System From Scratch
A deep dive into building a Retrieval-Augmented Generation system with multi-provider LLM integration, intelligent memory management, and a modern UI - developed entirely from scratch.
I just wrapped a major milestone on my RAG System - a hands-on build I developed entirely from scratch to deeply understand how modern Retrieval-Augmented Generation systems really work under the hood. This wasn't just about connecting APIs; it was about building a production-grade system with security, performance, and user experience in mind.
Why Build From Scratch?
While there are many RAG frameworks and libraries available, building from scratch gave me invaluable insights into:
How LLM providers differ in their APIs and capabilities
The intricacies of vector search and semantic retrieval
Memory management and token optimization strategies
Security considerations for API key handling
The challenges of building reliable, production-ready AI systems
System Architecture Overview
The system consists of several key components that work together seamlessly:
LLM Orchestration Layer: Multi-provider support with unified interface
Security Layer: Encrypted API key management
Memory Management: Token-aware state with intelligent context handling
Vector Database: ChromaDB for semantic search
Document Pipeline: Robust ingestion and processing
User Interface: Streamlit-based frontend with modern UX
Let's dive into each component.
Multi-Provider LLM Integration
One of the core features is support for multiple LLM providers through a unified orchestration layer:
This abstraction allows seamless switching between providers without changing the core application logic. Each provider has its own adapter that handles provider-specific quirks and API differences.
Secure API Key Management
Security was a top priority. I implemented encrypted API key storage using Fernet symmetric encryption with user-specific key derivation:
This ensures that API keys are never stored in plain text and are encrypted per-user, adding an extra layer of security.
Intelligent Memory Management
One of the biggest challenges in RAG systems is managing context effectively. I implemented a sophisticated memory management system with three key features:
1. Token-Aware State
The system tracks token usage in real-time to prevent exceeding model limits:
Documents are split into semantically meaningful chunks:
def _chunk_text(self, text: str, chunk_size: int = 500, overlap: int = 50):
chunks = []
sentences = text.split(". ")
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) < chunk_size:
current_chunk += sentence + ". "
else:
chunks.append(current_chunk.strip())
# Overlap: keep last sentence for context
current_chunk = sentence + ". "
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
Metadata Enrichment
Each chunk is enriched with metadata for better filtering and retrieval:
def _enrich_metadata(self, chunks: List[str], file_path: str):
return [
{
"content": chunk,
"metadata": {
"source": file_path,
"chunk_index": i,
"total_chunks": len(chunks),
"timestamp": datetime.now().isoformat(),
},
"id": f"{file_path}_{i}"
}
for i, chunk in enumerate(chunks)
]
System Prompting for Retrieval Fidelity
To prevent hallucinations, I implemented strict system prompts:
SYSTEM_PROMPT = """You are a helpful AI assistant with access to a document database.
CRITICAL RULES:
1. ONLY answer questions using information from the retrieved documents
2. If information is not in the documents, explicitly say "I don't have that information"
3. Always cite the source document when providing answers
4. Never make up or infer information not present in the documents
5. If asked about something outside the documents, politely redirect to available information
Format your responses with:
- Clear, concise answers
- Source citations: [Source: filename.pdf, Chunk X]
- Confidence indicators when appropriate
"""
This ensures the system stays grounded in the retrieved context and doesn't hallucinate.
Modern UI with Streamlit
The user interface was designed with UX best practices in mind:
Key Features
API Key Configuration Modal: No command-line hassle, secure input right in the UI
Document Upload Interface: Drag-and-drop support for easy document ingestion
One-Click Actions: Clear database, reset chat history
Message Management: Copy-to-clipboard for individual messages
Export Functionality: Download full conversation as PDF
Real-Time Feedback: Typing indicators and message timestamps
import streamlit as st
# Configuration Modal
with st.sidebar:
with st.expander("🔑 API Configuration"):
provider = st.selectbox("Provider", ["Groq", "OpenAI", "Gemini", "DeepSeek"])
api_key = st.text_input("API Key", type="password")
if st.button("Save Configuration"):
save_encrypted_key(provider, api_key, st.session_state.user_id)
st.success("Configuration saved securely!")
# Document Upload
uploaded_files = st.file_uploader(
"Upload Documents",
accept_multiple_files=True,
type=["pdf", "txt", "docx", "md"]
)
if uploaded_files:
with st.spinner("Processing documents..."):
process_and_store_documents(uploaded_files)
st.success(f"Processed {len(uploaded_files)} documents!")
# Chat Interface
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
st.caption(f"🕒 {message['timestamp']}")
if st.button("📋 Copy", key=f"copy_{message['id']}"):
st.write("Copied to clipboard!")
Performance Optimizations
Several optimizations ensure the system runs smoothly:
Caching: Frequently accessed embeddings are cached
Batch Processing: Documents are processed in batches
Lazy Loading: Vector store is loaded only when needed
Connection Pooling: Reuse LLM API connections
Lessons Learned
Building this system from scratch taught me valuable lessons:
LLM APIs vary significantly: Each provider has quirks that need careful handling
Token management is crucial: Poor memory management leads to errors and high costs
[ ] Multi-language support with language-specific embeddings
[ ] Hybrid search (combining semantic and keyword search)
[ ] Query rewriting for better retrieval
[ ] Conversation threading and branching
[ ] Analytics dashboard for usage insights
Conclusion
Building a RAG system from scratch is a challenging but incredibly rewarding experience. It forced me to understand every component deeply - from LLM APIs to vector databases, from security to UX design.
The result is a production-ready system that handles real-world use cases with:
Multiple LLM provider support
Secure API key management
Intelligent memory management
Precise vector search
Robust document processing
Modern, intuitive UI
If you're interested in building AI systems, I highly recommend starting from first principles. The knowledge gained from understanding how these systems work under the hood is invaluable.
The complete source code and documentation are available on GitHub. Feel free to explore, contribute, or use it as a learning resource for your own RAG projects!
Implementing a robust multi-agent system using LangGraph for automated competitor analysis, featuring validation gates, intelligent retry mechanisms, and comprehensive quality assurance.
A comprehensive guide to effective AI-assisted development, covering common issues, best practices, and strategies to maximize productivity while maintaining code quality.