Embed
Retrieve
Rerank
Generate
Try:
RAG Pipeline
1
Embed Query
all-MiniLM-L6-v2 (384-dim)
Question is encoded into a 384-dimensional dense vector via the LLM service embedding endpoint
~60ms
2
Vector Search
MongoDB cosine similarity
Retrieve top_k × 4 candidate chunks from MongoDB using cosine distance over pre-computed embeddings
~10ms
3
Cross-Encoder Rerank
BAAI/bge-reranker-v2-m3
Score each (question, chunk) pair with a cross-encoder model on GPU, then return the top_k most relevant results
~180ms
4
Generate Answer
Qwen3-32B via vLLM
Retrieved chunks are injected as context into a system prompt, then the LLM generates a grounded answer via streaming completions
~9s
1.000
MRR
1.000
Hit@1
0.994
NDCG@5
0.925
Keyword Recall
How Each Metric Is Calculated
MRR (Mean Reciprocal Rank)
1.000
score = 1 / rank_of_correct_source MRR = avg(scores)
For each query, find where the expected source appears in the ranked results. If it's rank 1, score = 1/1 = 1.0. If rank 3, score = 1/3 = 0.33. Average across all queries.
All 20 queries returned the correct source at rank #1, so each scored 1/1 = 1.0. Average = 1.000
Hit@1
1.000
hit = 1 if correct_source in top_1 else 0 Hit@1 = avg(hits)
Binary check: did the correct source appear as the #1 result? 1 = yes, 0 = no. Average across all queries gives the fraction of "perfect" retrievals.
20 out of 20 queries had the correct source as their top result. 20/20 = 1.000
NDCG@5 (Normalized Discounted Cumulative Gain)
0.994
DCG@k = Σ rel(i) / log₂(i+1) NDCG = DCG / ideal_DCG
Measures ranking quality of the top 5 results. Each relevant document contributes a score, but documents at lower positions are penalized logarithmically. Normalized against the ideal (all relevant docs at the top).
The correct source always appears at rank #1. However, when the same source has multiple chunks in the top 5, the ideal ranking would group all matching chunks first. A few queries had matching chunks at positions 3-5 instead of 2-3, slightly reducing the score to 0.994
Keyword Recall
0.925
recall = keywords_found_in_answer / total_expected_keywords avg across queries
For each query, check how many of the expected keywords appear (case-insensitive) in the LLM-generated answer. A query expecting ["espresso", "milk"] where the answer contains "espresso" but not "milk" scores 1/2 = 0.50.
43 keywords across 20 queries. Most were found in the generated answers. Roughly 3 queries each missed 1 keyword (LLM phrased it differently or omitted it), giving an average of 0.925
Ground-Truth Test Suite 20 pairs across 4 categories
| # | Category | Question | Expected Source | Expected Keywords | Hit@1 |
|---|---|---|---|---|---|
| 1 | Origins | Where did espresso originate? | Espresso | Italy Italian |
✓ |
| 2 | Origins | What is the history of Turkish coffee? | Turkish Coffee | Ottoman Turkey |
✓ |
| 3 | Origins | Where does Vietnamese coffee come from? | Vietnamese Coffee | Vietnam condensed milk |
✓ |
| 4 | Origins | What is the origin of Dalgona coffee? | Dalgona Coffee | Korea whipped |
✓ |
| 5 | Origins | Where was the flat white invented? | Flat White | Australia New Zealand |
✓ |
| 6 | Preparation | How do you make a cappuccino? | Cappuccino | espresso steamed milk foam |
✓ |
| 7 | Preparation | How does a French press work? | French Press | plunger steep |
✓ |
| 8 | Preparation | How do you use an AeroPress? | AeroPress | pressure filter |
✓ |
| 9 | Preparation | How does a Moka pot brew coffee? | Moka Pot | stovetop steam |
✓ |
| 10 | Preparation | How is iced coffee different from cold brew? | Iced Coffee | ice cold |
✓ |
| 11 | Composition | What is a latte made of? | Latte | espresso milk |
✓ |
| 12 | Composition | What ingredients are in a mocha? | Mocha | espresso chocolate |
✓ |
| 13 | Composition | What is an affogato? | Affogato | espresso gelato |
✓ |
| 14 | Composition | What makes Irish coffee special? | Irish Coffee | whiskey cream |
✓ |
| 15 | Composition | What is a cortado? | Cortado | espresso milk |
✓ |
| 16 | Variations | What is a ristretto? | Ristretto | short espresso |
✓ |
| 17 | Variations | What is a macchiato? | Macchiato | espresso milk |
✓ |
| 18 | Variations | What is a Frappuccino? | Frappuccino | Starbucks blended |
✓ |
| 19 | Variations | What is a cafe cubano? | Cafe Cubano | sugar espresso Cuban |
✓ |
| 20 | Variations | What is a lungo? | Lungo | long water espresso |
✓ |
Evaluated with
python scripts/evaluate.py --top-k 5 against the live pipeline. All 20 queries returned the correct source at rank #1.Prometheus Metrics Fetching...
Total Queries
-
Errors
-
Avg Embedding Latency
-
Avg Retrieval Latency
-
Avg Rerank Latency
-
Avg Generation Latency
-
Avg Docs Retrieved
-
Sources
-
Total Chunks
-
Embedding Model
all-MiniLM-L6-v2
Chunk Strategy
Structure-aware (see below)
Vector Dimensions
384
Data Source
Wikipedia articles
Chunking Strategy
Documents are split using a structure-aware pipeline that preserves semantic boundaries,
rather than a naive fixed-window approach.
1
Section splitting — Wikipedia
== headers == are detected
to split the article into logical sections (e.g. History, Preparation, Variations).
2
Sentence-boundary chunking — Large sections are split at sentence
boundaries (never mid-sentence) into chunks of up to
500 tokens
(cl100k_base tokenizer). Small sections stay as one chunk.
3
Heading prepend — Each chunk is prefixed with its section heading
so every chunk is self-contained and the embedding captures the topic context.
4
Fragment merging — Orphan paragraphs without a heading are merged
into the previous section to avoid tiny, context-less chunks.
No sliding-window overlap is used — section boundaries provide natural context instead.
Model context limit: 512 tokens (all-MiniLM-L6-v2), with a 500-token chunk cap as safety buffer.
Documents