RAG with Ollama: Local LLM, Private Data, Zero External API

Jul 2, 20267 min

RAG (Retrieval-Augmented Generation) lets a LLM answer questions from your own documents. Most tutorials pipe everything through OpenAI or Anthropic. Here’s how to do the same thing fully locally with Ollama.

Why Local?

Three practical reasons:

GDPR: sensitive data never leaves your infrastructure
Cost: zero API calls, zero per-token billing
Offline: works without an internet connection

The trade-off: you need a powerful enough machine (8 GB VRAM minimum for 7B models).

Stack

Ollama         → local LLM server (Llama 3.1, Mistral, Phi-3...)
nomic-embed    → local embedding model (via Ollama)
ChromaDB       → in-memory or persistent vector store
Python         → glue

Pipeline in 4 Steps

1. Install Ollama and Models

# Install Ollama (Linux/Mac)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the LLM and embedding model
ollama pull llama3.1:8b
ollama pull nomic-embed-text

2. Index Documents

import chromadb
import ollama

def embed(text: str) -> list[float]:
    resp = ollama.embeddings(model="nomic-embed-text", prompt=text)
    return resp["embedding"]

client = chromadb.Client()
collection = client.create_collection("docs")

documents = [
    {"id": "doc1", "text": "The company GDPR policy states that..."},
    {"id": "doc2", "text": "Customer data is stored in..."},
]

for doc in documents:
    collection.add(
        ids=[doc["id"]],
        embeddings=[embed(doc["text"])],
        documents=[doc["text"]],
    )

3. Retrieval — Find Relevant Passages

def retrieve(query: str, n: int = 3) -> list[str]:
    results = collection.query(
        query_embeddings=[embed(query)],
        n_results=n,
    )
    return results["documents"][0]

4. Generation with Context

def rag(question: str) -> str:
    passages = retrieve(question)
    context = "\n\n".join(passages)

    response = ollama.chat(
        model="llama3.1:8b",
        messages=[{
            "role": "user",
            "content": f"""Answer the question using ONLY the context provided.
If the answer is not in the context, say so clearly.

Context:
{context}

Question: {question}"""
        }]
    )
    return response["message"]["content"]

What Changes vs Cloud RAG

Local vs cloud embedding: nomic-embed-text is noticeably less accurate than OpenAI’s text-embedding-3-large on complex corpora. On domain-specific documents, the gap is real — always benchmark on your actual data before choosing.

Latency: on an M2 MacBook or a mid-range GPU Linux box, expect 2–5s per 7B response. Fine for internal tools, not for real-time.

ChromaDB in production: use persistent mode or swap to pgvector (PostgreSQL) for higher data volumes.

# Persistent ChromaDB
client = chromadb.PersistentClient(path="/data/chroma")

Best Use Cases

Internal HR knowledge base (contracts, policies)
Customer support on proprietary documentation
Assistant on medical or legal data
Any context where data must not leave the internal network

Stéphanie Caumont

AI Product Owner · Learn more

← All articles Contact me