Coffee RAG - DAT Workspace

Embed

Retrieve

Rerank

Generate

Try:

RAG Pipeline

Embed Query

all-MiniLM-L6-v2 (384-dim)

Question is encoded into a 384-dimensional dense vector via the LLM service embedding endpoint

~60ms

Vector Search

MongoDB cosine similarity

Retrieve top_k × 4 candidate chunks from MongoDB using cosine distance over pre-computed embeddings

~10ms

Cross-Encoder Rerank

BAAI/bge-reranker-v2-m3

Score each (question, chunk) pair with a cross-encoder model on GPU, then return the top_k most relevant results

~180ms

Generate Answer

Qwen3-32B via vLLM

Retrieved chunks are injected as context into a system prompt, then the LLM generates a grounded answer via streaming completions

~9s

1.000

MRR

1.000

Hit@1

0.994

NDCG@5

0.925

Keyword Recall

How Each Metric Is Calculated

MRR (Mean Reciprocal Rank) 1.000

score = 1 / rank_of_correct_source MRR = avg(scores)

For each query, find where the expected source appears in the ranked results. If it's rank 1, score = 1/1 = 1.0. If rank 3, score = 1/3 = 0.33. Average across all queries.

All 20 queries returned the correct source at rank #1, so each scored 1/1 = 1.0. Average = 1.000

Hit@1 1.000

hit = 1 if correct_source in top_1 else 0 Hit@1 = avg(hits)

Binary check: did the correct source appear as the #1 result? 1 = yes, 0 = no. Average across all queries gives the fraction of "perfect" retrievals.

20 out of 20 queries had the correct source as their top result. 20/20 = 1.000

NDCG@5 (Normalized Discounted Cumulative Gain) 0.994

DCG@k = Σ rel(i) / log₂(i+1) NDCG = DCG / ideal_DCG

Measures ranking quality of the top 5 results. Each relevant document contributes a score, but documents at lower positions are penalized logarithmically. Normalized against the ideal (all relevant docs at the top).

The correct source always appears at rank #1. However, when the same source has multiple chunks in the top 5, the ideal ranking would group all matching chunks first. A few queries had matching chunks at positions 3-5 instead of 2-3, slightly reducing the score to 0.994

Keyword Recall 0.925

recall = keywords_found_in_answer / total_expected_keywords avg across queries

For each query, check how many of the expected keywords appear (case-insensitive) in the LLM-generated answer. A query expecting ["espresso", "milk"] where the answer contains "espresso" but not "milk" scores 1/2 = 0.50.

43 keywords across 20 queries. Most were found in the generated answers. Roughly 3 queries each missed 1 keyword (LLM phrased it differently or omitted it), giving an average of 0.925

Ground-Truth Test Suite 20 pairs across 4 categories

#	Category	Question	Expected Source	Expected Keywords	Hit@1
1	Origins	Where did espresso originate?	Espresso	`Italy` `Italian`	✓
2	Origins	What is the history of Turkish coffee?	Turkish Coffee	`Ottoman` `Turkey`	✓
3	Origins	Where does Vietnamese coffee come from?	Vietnamese Coffee	`Vietnam` `condensed milk`	✓
4	Origins	What is the origin of Dalgona coffee?	Dalgona Coffee	`Korea` `whipped`	✓
5	Origins	Where was the flat white invented?	Flat White	`Australia` `New Zealand`	✓
6	Preparation	How do you make a cappuccino?	Cappuccino	`espresso` `steamed milk` `foam`	✓
7	Preparation	How does a French press work?	French Press	`plunger` `steep`	✓
8	Preparation	How do you use an AeroPress?	AeroPress	`pressure` `filter`	✓
9	Preparation	How does a Moka pot brew coffee?	Moka Pot	`stovetop` `steam`	✓
10	Preparation	How is iced coffee different from cold brew?	Iced Coffee	`ice` `cold`	✓
11	Composition	What is a latte made of?	Latte	`espresso` `milk`	✓
12	Composition	What ingredients are in a mocha?	Mocha	`espresso` `chocolate`	✓
13	Composition	What is an affogato?	Affogato	`espresso` `gelato`	✓
14	Composition	What makes Irish coffee special?	Irish Coffee	`whiskey` `cream`	✓
15	Composition	What is a cortado?	Cortado	`espresso` `milk`	✓
16	Variations	What is a ristretto?	Ristretto	`short` `espresso`	✓
17	Variations	What is a macchiato?	Macchiato	`espresso` `milk`	✓
18	Variations	What is a Frappuccino?	Frappuccino	`Starbucks` `blended`	✓
19	Variations	What is a cafe cubano?	Cafe Cubano	`sugar` `espresso` `Cuban`	✓
20	Variations	What is a lungo?	Lungo	`long` `water` `espresso`	✓

Evaluated with python scripts/evaluate.py --top-k 5 against the live pipeline. All 20 queries returned the correct source at rank #1.

Prometheus Metrics Fetching...

Total Queries -

Errors -

Avg Embedding Latency -

Avg Retrieval Latency -

Avg Rerank Latency -

Avg Generation Latency -

Avg Docs Retrieved -

Sources -

Total Chunks -

Embedding Model all-MiniLM-L6-v2

Chunk Strategy Structure-aware (see below)

Vector Dimensions 384

Data Source Wikipedia articles

Chunking Strategy

Documents are split using a structure-aware pipeline that preserves semantic boundaries, rather than a naive fixed-window approach.

Section splitting — Wikipedia == headers == are detected to split the article into logical sections (e.g. History, Preparation, Variations).

Sentence-boundary chunking — Large sections are split at sentence boundaries (never mid-sentence) into chunks of up to 500 tokens (cl100k_base tokenizer). Small sections stay as one chunk.

Heading prepend — Each chunk is prefixed with its section heading so every chunk is self-contained and the embedding captures the topic context.

Fragment merging — Orphan paragraphs without a heading are merged into the previous section to avoid tiny, context-less chunks.

No sliding-window overlap is used — section boundaries provide natural context instead. Model context limit: 512 tokens (all-MiniLM-L6-v2), with a 500-token chunk cap as safety buffer.

Documents