Building a RAG System from Scratch with TypeScript (Part 1)

This is a RAG tutorial. My goal is for you to learn and understand the concepts and problems that RAG solves in a clear and simple way. Let's get into it!
Here's a real scenario: you've just been hired by a mid-size company that has 400 internal HR documents: vacation policies, benefits guides, compliance docs. Employees can never find anything, so they ping HR constantly with questions that are already answered somewhere in those files. The ask: build an internal tool where an employee types a question in plain English and gets a direct answer, sourced from the actual documents.
That's what RAG is for. Giving an LLM grounded access to your data so it can answer questions it otherwise couldn't, or worse, would hallucinate an answer to.
In this article we'll build that system from scratch, step by step, in TypeScript. No frameworks, no magic wrappers. By the end of Part 1 you'll be able to understand RAG and have a working retrieval pipeline. Part 2 wires in the generation and hardens it for real use.
Let's go.
Before We Start: What You Need
If you want to do the exercises yourself, you need two API keys and to clone the GitHub repo to follow along. Otherwise I made the explanations deliberately simple so you can also get the gist by just reading the article.
1. Clone the repo
All the code for this series lives here:
git clone https://github.com/garosan/rag-tutorial
cd rag-tutorial
npm install
2. OpenAI API Key (for embeddings)
We use OpenAI's text-embedding-3-small model to convert text into vectors. This is the only part of the project that uses OpenAI.
To get your key:
Go to platform.openai.com
Sign up or log in
Navigate to API Keys and create a new key
Add balance or make sure your API key has balance
For this tutorial, with our small set of policy documents, your total cost will be a fraction of a cent. Running through the entire tutorial end to end should cost less than $0.01.
3. Anthropic API Key (for generation)
We use Claude for the generation step in Part 2, but you'll need the key set up now.
To get your key:
Go to console.anthropic.com
Sign up or log in
Navigate to API Keys and create a new key
Add balance to your Anthropic key as well
Cost: We use claude-haiku-4-5, Anthropic's fastest and most affordable model, which is more than capable for this use case. Pricing is as low as \(1/\)5 per M input/output tokens so for the handful of calls in this tutorial, expect to spend a few cents at most.
4. Set up your environment
Copy the example env file and add your keys:
cp .env.example .env
You're ready. Let's build.
How to Follow Along
Each step in this tutorial maps to a file in the repo. At the end of each section you'll see a command like:
npm run part1:naive
Run it, read the output, then keep reading. The code is meant to be executed, not just skimmed. Each file builds on the previous one, so by the time you reach the end of Part 1, you'll have a working retrieval system you built piece by piece.
The Setup: Meridian Health
To make this concrete, we're working with a fictional company called Meridian Health. They have four policy documents sitting in src/data/policies/:
vacation-policy.txtremote-work-policy.txthealth-benefits.txtcode-of-conduct.txt
In a real system you might have hundreds or thousands of these. For now, four is enough to demonstrate exactly where things break.
Step 1: The Naive Approach
The most obvious thing to try: load all the documents and dump them into the prompt. If the LLM can read text, just give it all the text, right?
const files = fs.readdirSync(POLICIES_DIR).filter((f) => f.endsWith(".txt"));
let allContent = "";
for (const file of files) {
const content = fs.readFileSync(path.join(POLICIES_DIR, file), "utf-8");
allContent += `\n\n--- \({file} ---\n\){content}`;
}
const response = await anthropic.messages.create({
model: "claude-opus-4-5",
max_tokens: 1024,
messages: [{
role: "user",
content: `Here are the policy documents: \({allContent}\n\nQuestion: \){question}`
}]
});
Run it:
npm run part1:naive
It works. You ask "How many vacation days do I get after 3 years?" and Claude answers correctly, citing the right section.
So you think: great, I'll just load all 400 documents and do the same thing.
And then you hit a wall.
Most models have a hard limit called the context window: the maximum number of tokens they can process in a single request. If you exceed it, the API rejects the request entirely. But even when everything fits, stuffing irrelevant content into the prompt degrades answer quality and you pay for every input token on every request. This is the core problem RAG solves.
You don't need to feed the whole library to the model. You need to find the relevant pages first, then pass only those.
That's the idea everything else builds on.
Step 2: Chunking
So we need to be selective about what we pass to the model. The way RAG does this is by splitting documents into smaller pieces called chunks. This is standard terminology you'll see everywhere in the RAG ecosystem.
A chunk is just a small section of text, typically a few hundred characters. Instead of passing an entire 10-page policy document to the model, you split it into 20 or 30 chunks and later retrieve only the ones relevant to the question being asked.
Two parameters matter when chunking:
chunkSize controls how big each piece is. Too small and you lose context (a chunk that just says "15 days" with no surrounding explanation is useless). Too large and you're back to the original problem.
overlap is the subtle one. Consecutive chunks share a small strip of text at their edges. Why? Because answers often live at chunk boundaries. If a policy says "employees in their first two years get 10 days, after year three this increases to 15 days" and you cut it right down the middle, neither chunk tells the full story. Overlap ensures both chunks carry enough surrounding context to be useful.
If you want to take a closer look at the implementation, you can find the chunking function in src/part1/02-chunking.ts. For now, lets
Run it:
npm run part1:chunking
You'll see each document split into several chunks, with a sample printed to the terminal. Four documents become a collection of individual, retrievable pieces.
But we still have a problem. We have chunks. We have a question. How do we know which chunks are relevant?
Step 3: Embeddings
This is where the "intelligence" in the retrieval pipeline comes from.
A keyword search would fail here. An employee might ask "Can I take time off around Christmas?" and the vacation policy never uses the phrase "time off around Christmas." It says "holiday blackout periods" and "Q4 restrictions." A keyword search finds nothing. A smarter search finds the right section immediately.
That smarter search is powered by embeddings.
An embedding is a list of numbers, a vector, that represents the meaning of a piece of text. OpenAI's text-embedding-3-small model converts any string into a list of 1536 numbers like [0.023141, -0.007829, 0.041205, -0.019342, ...]. The key property: text with similar meaning produces vectors that point in similar directions. "Time off around Christmas" and "Q4 holiday blackout periods" end up close together in that 1536-dimensional space, even though they share no words.
The function below calls the OpenAI API and returns that list of 1536 numbers for any text you pass in:
export async function embedText(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
});
return response.data[0].embedding; // 1536 numbers
}
You call this once for every chunk at "index time", store the results, and then call it again for each incoming query at "query time." The expensive work happens upfront. In a production system you'd persist these vectors to a database so you don't recompute them on every run.
Run it:
npm run part1:embeddings
You'll see a progress counter as each chunk gets embedded, and then a sample of what an embedding actually looks like: an array of floating point numbers that means nothing to a human but everything to a similarity search.
Step 4: The Vector Store
Now we have chunks with embeddings. We need somewhere to store them and a way to search them.
In production you'd use a dedicated vector database like Pinecone, Chroma, or pgvector. For this tutorial, a plain in-memory array works perfectly and keeps the focus on the concepts rather than infrastructure setup.
export class VectorStore {
private entries: EmbeddedChunk[] = [];
add(chunks: EmbeddedChunk[]): void {
this.entries.push(...chunks);
}
search(queryEmbedding: number[], topK: number = 3) {
return this.entries
.map((entry) => ({
...entry,
score: this.cosineSimilarity(queryEmbedding, entry.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
private cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, magA = 0, magB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
}
The add method is straightforward: it stores all our embedded chunks in memory. The search method is where the interesting part happens. It takes a query embedding, scores every single chunk against it, sorts by score, and returns the top topK results.
The cosineSimilarity function is the core of the whole retrieval system. It measures the angle between two vectors. A score of 1.0 means the vectors have the same meaning. A score of 0.0 means they're completely unrelated. A score close to 1.0 means the chunk is probably relevant to the query.
Run it:
npm run part1:vector-store
When you run this, you'll see the index being built (same chunking and embedding process as before), and then the actual search results for the query "How many vacation days do I get after 3 years?". You'll see the top 3 chunks returned, each with a similarity score and a short preview of the chunk text. Keep in mind this is just a preview, the full chunk contains more context than what's displayed.
Step 5: Putting It Together
The last step of Part 1 wires everything into a clean retrieve() function that you can call with any question.
async function retrieve(query: string, topK = 3) {
// Build the index once
const chunks = loadAndChunkAllDocs();
const embeddedChunks = await embedAllChunks(chunks);
const store = new VectorStore();
store.add(embeddedChunks);
// Embed the query and search
const queryEmbedding = await embedText(query);
return store.search(queryEmbedding, topK);
}
In a real system you'd separate the index-building step (done once, on document ingestion) from the query step (done on every user request). Here they're combined for simplicity.
Run it:
npm run part1:retrieval
Try the three sample questions in the file. For each one, you'll see which chunk was retrieved, which document it came from, and its similarity score. The system finds the right information without knowing anything about keywords, document structure, or which file to look in.
What We've Built
Now you can explain to anyone how a retrieval pipeline works and implement it yourself:
First, you split your documents into small overlapping chunks so they can be retrieved individually
Then you convert each chunk into a vector using OpenAI embeddings, a numerical representation of its meaning
You store those vectors and use cosine similarity to score how relevant each chunk is to a given question
Finally, any plain-English query returns the most semantically relevant chunks, no keywords needed
Great work getting here. In the second part of this tutorial we will add the G in RAG: the generation part: we will take those retrieved chunks, pass them to Claude with a carefully designed prompt, and get grounded accurate answers back. We will also look at what breaks in a real system and how to fix it.
Follow me on X or Linkedin for Part 2 and let me know what you thought of this tutorial, any feedback is useful. See you very soon!




