How I Built a RAG System in Rails Using Nomic Embeddings and OpenAI

2025-07-18

Retrieval-Augmented Generation (RAG) is a practical way to bring your own data into LLM workflows. Instead of fine-tuning, you give the model context that makes its answers specific and trustworthy.

In this post, I’ll walk through how I wired up a RAG pipeline inside a Rails app using:

Nomic embeddings (open-source, high-quality, self-hostable)
PgVector for vector search
OpenAI for response generation

The result is a system that feels light, flexible, and doesn’t lock you into one vendor.

🧠 What Is RAG, Really?

Think of RAG as a two-step handshake:

Find the right data → Embed the query, search your knowledge base, and pull back relevant snippets.
Generate with context → Hand both the query and those snippets to an LLM so it answers with precision.

It looks like this:

[ User Question ]
↓
[ Embed with Nomic ]
↓
[ Vector Search in PgVector ]
↓
[ Retrieve Relevant Chunks ]
↓
[ Assemble Prompt ]
↓
[ Generate Answer with OpenAI ]

This avoids heavy fine-tuning and keeps your system adaptable.

🧰 The Stack

Rails → Controllers, persistence, orchestration
FastAPI → Lightweight Python service serving Nomic embeddings
Nomic Embedding Model → For semantic search
PgVector → PostgreSQL extension for vector queries
OpenAI GPT-4 / GPT-3.5 → Generation step

🛠 Step 1: Running Nomic Locally

I wanted to avoid API costs and limits, so I served embeddings locally via FastAPI and sentence-transformers:

from fastapi import FastAPI, Request
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe")

@app.post("/embed")
async def embed(req: Request):
    data = await req.json()
    input_text = data["input"]
    embedding = model.encode(input_text).tolist()
    return { "embedding": embedding }

This internal API cleanly replaces OpenAI’s /embeddings.

📄 Step 2: Chunk and Store Data

Split content into short passages (~100–300 words). Embed each passage and store it in Postgres with pgvector:

CREATE EXTENSION IF NOT EXISTS vector;

class AddEmbeddingToDocuments < ActiveRecord::Migration[7.1]
  def change
    add_column :documents, :embedding, :vector, limit: 768 # Nomic v2-moe size
  end
end

🤖 Step 3: Embed Queries

Your Rails controller can call the FastAPI service:

def get_embedding(text)
  response = Faraday.post(
    "http://localhost:8000/embed",
    { input: text }.to_json,
    "Content-Type" => "application/json"
  )
  JSON.parse(response.body)["embedding"]
end

Use the same embedding model for both queries and documents — consistency matters.

🔍 Step 4: Vector Search

Find the closest matches with cosine similarity:

Document.order("embedding <-> cube(array[?])", query_vector).limit(5)

These top matches form the “knowledge pack” for the LLM.

🧾 Step 5: Prompt Assembly

Concatenate the retrieved passages into the prompt:

client.chat(
  parameters: {
    model: "gpt-4",
    messages: [
      { role: "system", content: "Answer using the provided context." },
      { role: "user", content: build_contextual_prompt(user_input, top_chunks) }
    ]
  }
)

✅ Why Nomic for Embeddings?

Open-source and multilingual
Runs locally → no token limits, no vendor lock-in
Solid benchmarks (MTEB) and practical retrieval quality

💡 Why Still Use OpenAI?

For me, the generation step is where OpenAI models shine. By decoupling the embedding layer, I get flexibility: I can swap LLMs later without rebuilding the pipeline.

🧠 Takeaways

RAG doesn’t need to be a heavyweight system.
Pairing open-source embeddings with OpenAI generation creates a powerful hybrid.
With Rails + PgVector, vector search feels like a natural extension of your existing app.

If you’re new to RAG, start small. Build the pipeline end-to-end with one document table and a FastAPI service. Don’t overengineer at first, once you see it work on a toy dataset, scaling to production is mostly a matter of indexing and monitoring.