I just wrapped a major milestone on my RAG System - a hands-on build I developed entirely from scratch to deeply understand how modern Retrieval-Augmented Generation systems really work under the hood. This wasn't just about connecting APIs; it was about building a production-grade system with security, performance, and user experience in mind.
Why Build From Scratch?
While there are many RAG frameworks and libraries available, building from scratch gave me invaluable insights into:
- How LLM providers differ in their APIs and capabilities
- The intricacies of vector search and semantic retrieval
- Memory management and token optimization strategies
- Security considerations for API key handling
- The challenges of building reliable, production-ready AI systems
System Architecture Overview
The system consists of several key components that work together seamlessly:
- LLM Orchestration Layer: Multi-provider support with unified interface
- Security Layer: Encrypted API key management
- Memory Management: Token-aware state with intelligent context handling
- Vector Database: ChromaDB for semantic search
- Document Pipeline: Robust ingestion and processing
- User Interface: Streamlit-based frontend with modern UX
Let's dive into each component.
Multi-Provider LLM Integration
One of the core features is support for multiple LLM providers through a unified orchestration layer:
class LLMOrchestrator:
def __init__(self, provider: str, api_key: str):
self.provider = provider
self.client = self._initialize_client(provider, api_key)
def _initialize_client(self, provider: str, api_key: str):
if provider == "groq":
return GroqClient(api_key)
elif provider == "openai":
return OpenAIClient(api_key)
elif provider == "gemini":
return GeminiClient(api_key)
elif provider == "deepseek":
return DeepSeekClient(api_key)
else:
raise ValueError(f"Unsupported provider: {provider}")
def generate(self, messages: List[Dict], **kwargs):
return self.client.chat_completion(messages, **kwargs)
This abstraction allows seamless switching between providers without changing the core application logic. Each provider has its own adapter that handles provider-specific quirks and API differences.
Secure API Key Management
Security was a top priority. I implemented encrypted API key storage using Fernet symmetric encryption with user-specific key derivation:
from cryptography.fernet import Fernet
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2
def derive_key(user_id: str, salt: bytes) -> bytes:
kdf = PBKDF2(
algorithm=hashes.SHA256(),
length=32,
salt=salt,
iterations=100000,
)
return base64.urlsafe_b64encode(kdf.derive(user_id.encode()))
def encrypt_api_key(api_key: str, user_id: str) -> str:
cipher = Fernet(derive_key(user_id, SALT))
return cipher.encrypt(api_key.encode()).decode()
This ensures that API keys are never stored in plain text and are encrypted per-user, adding an extra layer of security.
Intelligent Memory Management
One of the biggest challenges in RAG systems is managing context effectively. I implemented a sophisticated memory management system with three key features:
1. Token-Aware State
The system tracks token usage in real-time to prevent exceeding model limits:
class MemoryManager:
def __init__(self, max_tokens: int = 4096):
self.max_tokens = max_tokens
self.messages = []
self.current_tokens = 0
def count_tokens(self, text: str) -> int:
# Approximate token count (1 token ≈ 4 chars)
return len(text) // 4
def add_message(self, role: str, content: str):
tokens = self.count_tokens(content)
if self.current_tokens + tokens > self.max_tokens:
self._trim_context()
self.messages.append({"role": role, "content": content})
self.current_tokens += tokens
2. Automatic Summarization
When the context grows too large, the system automatically summarizes older messages:
def _trim_context(self):
# Summarize oldest messages to free up token space
old_messages = self.messages[:5]
summary = self._generate_summary(old_messages)
# Replace old messages with summary
self.messages = [
{"role": "system", "content": f"Previous context: {summary}"}
] + self.messages[5:]
# Recalculate token count
self.current_tokens = sum(
self.count_tokens(msg["content"])
for msg in self.messages
)
3. Context-Trim Logic
The system intelligently decides what to keep and what to trim based on relevance and recency.
Vector Search with ChromaDB
For semantic retrieval, I used ChromaDB with HuggingFace BGE embeddings:
import chromadb
from chromadb.utils import embedding_functions
class VectorStore:
def __init__(self):
self.client = chromadb.PersistentClient(path="./chroma_db")
self.embedding_fn = embedding_functions.HuggingFaceEmbeddingFunction(
model_name="BAAI/bge-small-en-v1.5"
)
self.collection = self.client.get_or_create_collection(
name="documents",
embedding_function=self.embedding_fn
)
def add_documents(self, documents: List[Dict]):
self.collection.add(
documents=[doc["content"] for doc in documents],
metadatas=[doc["metadata"] for doc in documents],
ids=[doc["id"] for doc in documents]
)
def search(self, query: str, n_results: int = 5, filter: Dict = None):
return self.collection.query(
query_texts=[query],
n_results=n_results,
where=filter
)
The BGE embeddings provide excellent semantic search quality while being lightweight enough for fast retrieval.
Document Processing Pipeline
The document pipeline handles everything from ingestion to storage:
Multi-Format Parsing
class DocumentProcessor:
def __init__(self):
self.parsers = {
".pdf": PDFParser(),
".txt": TextParser(),
".docx": DocxParser(),
".md": MarkdownParser(),
}
def process(self, file_path: str) -> List[Dict]:
ext = os.path.splitext(file_path)[1]
parser = self.parsers.get(ext)
if not parser:
raise ValueError(f"Unsupported file format: {ext}")
text = parser.parse(file_path)
chunks = self._chunk_text(text)
return self._enrich_metadata(chunks, file_path)
Intelligent Chunking
Documents are split into semantically meaningful chunks:
def _chunk_text(self, text: str, chunk_size: int = 500, overlap: int = 50):
chunks = []
sentences = text.split(". ")
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) < chunk_size:
current_chunk += sentence + ". "
else:
chunks.append(current_chunk.strip())
# Overlap: keep last sentence for context
current_chunk = sentence + ". "
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
Metadata Enrichment
Each chunk is enriched with metadata for better filtering and retrieval:
def _enrich_metadata(self, chunks: List[str], file_path: str):
return [
{
"content": chunk,
"metadata": {
"source": file_path,
"chunk_index": i,
"total_chunks": len(chunks),
"timestamp": datetime.now().isoformat(),
},
"id": f"{file_path}_{i}"
}
for i, chunk in enumerate(chunks)
]
System Prompting for Retrieval Fidelity
To prevent hallucinations, I implemented strict system prompts:
SYSTEM_PROMPT = """You are a helpful AI assistant with access to a document database.
CRITICAL RULES:
1. ONLY answer questions using information from the retrieved documents
2. If information is not in the documents, explicitly say "I don't have that information"
3. Always cite the source document when providing answers
4. Never make up or infer information not present in the documents
5. If asked about something outside the documents, politely redirect to available information
Format your responses with:
- Clear, concise answers
- Source citations: [Source: filename.pdf, Chunk X]
- Confidence indicators when appropriate
"""
This ensures the system stays grounded in the retrieved context and doesn't hallucinate.
Modern UI with Streamlit
The user interface was designed with UX best practices in mind:
Key Features
- API Key Configuration Modal: No command-line hassle, secure input right in the UI
- Document Upload Interface: Drag-and-drop support for easy document ingestion
- One-Click Actions: Clear database, reset chat history
- Message Management: Copy-to-clipboard for individual messages
- Export Functionality: Download full conversation as PDF
- Real-Time Feedback: Typing indicators and message timestamps
import streamlit as st
# Configuration Modal
with st.sidebar:
with st.expander("🔑 API Configuration"):
provider = st.selectbox("Provider", ["Groq", "OpenAI", "Gemini", "DeepSeek"])
api_key = st.text_input("API Key", type="password")
if st.button("Save Configuration"):
save_encrypted_key(provider, api_key, st.session_state.user_id)
st.success("Configuration saved securely!")
# Document Upload
uploaded_files = st.file_uploader(
"Upload Documents",
accept_multiple_files=True,
type=["pdf", "txt", "docx", "md"]
)
if uploaded_files:
with st.spinner("Processing documents..."):
process_and_store_documents(uploaded_files)
st.success(f"Processed {len(uploaded_files)} documents!")
# Chat Interface
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
st.caption(f"🕒 {message['timestamp']}")
if st.button("📋 Copy", key=f"copy_{message['id']}"):
st.write("Copied to clipboard!")
Performance Optimizations
Several optimizations ensure the system runs smoothly:
- Caching: Frequently accessed embeddings are cached
- Batch Processing: Documents are processed in batches
- Lazy Loading: Vector store is loaded only when needed
- Connection Pooling: Reuse LLM API connections
Lessons Learned
Building this system from scratch taught me valuable lessons:
- LLM APIs vary significantly: Each provider has quirks that need careful handling
- Token management is crucial: Poor memory management leads to errors and high costs
- Vector search isn't magic: Chunk size, overlap, and embedding quality matter immensely
- Security can't be an afterthought: Encrypted storage and secure handling are essential
- UX makes or breaks AI tools: Even the best AI is useless if users can't interact with it effectively
Future Enhancements
There's always room for improvement:
- [ ] Support for more document formats (HTML, XML, JSON)
- [ ] Advanced filtering options (date ranges, custom metadata)
- [ ] Multi-language support with language-specific embeddings
- [ ] Hybrid search (combining semantic and keyword search)
- [ ] Query rewriting for better retrieval
- [ ] Conversation threading and branching
- [ ] Analytics dashboard for usage insights
Conclusion
Building a RAG system from scratch is a challenging but incredibly rewarding experience. It forced me to understand every component deeply - from LLM APIs to vector databases, from security to UX design.
The result is a production-ready system that handles real-world use cases with:
- Multiple LLM provider support
- Secure API key management
- Intelligent memory management
- Precise vector search
- Robust document processing
- Modern, intuitive UI
If you're interested in building AI systems, I highly recommend starting from first principles. The knowledge gained from understanding how these systems work under the hood is invaluable.
The complete source code and documentation are available on GitHub. Feel free to explore, contribute, or use it as a learning resource for your own RAG projects!
Tech Stack: Python • ChromaDB • HuggingFace • Streamlit • Cryptography • LlamaIndex concepts
Key Takeaway: Building from scratch isn't about reinventing the wheel - it's about understanding the wheel so well that you can build better vehicles.



