Coffee Knowledge Base

RAG Pipeline Demo
Embed
Retrieve
Rerank
Generate
RAG Pipeline
1
Embed Query
all-MiniLM-L6-v2 (384-dim)
Question is encoded into a 384-dimensional dense vector via the LLM service embedding endpoint
~60ms
2
Vector Search
MongoDB cosine similarity
Retrieve top_k × 4 candidate chunks from MongoDB using cosine distance over pre-computed embeddings
~10ms
3
Cross-Encoder Rerank
BAAI/bge-reranker-v2-m3
Score each (question, chunk) pair with a cross-encoder model on GPU, then return the top_k most relevant results
~180ms
4
Generate Answer
Qwen3-32B via vLLM
Retrieved chunks are injected as context into a system prompt, then the LLM generates a grounded answer via streaming completions
~9s
1.000
MRR
1.000
Hit@1
0.994
NDCG@5
0.925
Keyword Recall
How Each Metric Is Calculated
MRR (Mean Reciprocal Rank) 1.000
score = 1 / rank_of_correct_source    MRR = avg(scores)
For each query, find where the expected source appears in the ranked results. If it's rank 1, score = 1/1 = 1.0. If rank 3, score = 1/3 = 0.33. Average across all queries.
All 20 queries returned the correct source at rank #1, so each scored 1/1 = 1.0. Average = 1.000
Hit@1 1.000
hit = 1 if correct_source in top_1 else 0    Hit@1 = avg(hits)
Binary check: did the correct source appear as the #1 result? 1 = yes, 0 = no. Average across all queries gives the fraction of "perfect" retrievals.
20 out of 20 queries had the correct source as their top result. 20/20 = 1.000
NDCG@5 (Normalized Discounted Cumulative Gain) 0.994
DCG@k = Σ rel(i) / log₂(i+1)    NDCG = DCG / ideal_DCG
Measures ranking quality of the top 5 results. Each relevant document contributes a score, but documents at lower positions are penalized logarithmically. Normalized against the ideal (all relevant docs at the top).
The correct source always appears at rank #1. However, when the same source has multiple chunks in the top 5, the ideal ranking would group all matching chunks first. A few queries had matching chunks at positions 3-5 instead of 2-3, slightly reducing the score to 0.994
Keyword Recall 0.925
recall = keywords_found_in_answer / total_expected_keywords    avg across queries
For each query, check how many of the expected keywords appear (case-insensitive) in the LLM-generated answer. A query expecting ["espresso", "milk"] where the answer contains "espresso" but not "milk" scores 1/2 = 0.50.
43 keywords across 20 queries. Most were found in the generated answers. Roughly 3 queries each missed 1 keyword (LLM phrased it differently or omitted it), giving an average of 0.925
Ground-Truth Test Suite 20 pairs across 4 categories
# Category Question Expected Source Expected Keywords Hit@1
1Origins Where did espresso originate? Espresso Italy Italian
2Origins What is the history of Turkish coffee? Turkish Coffee Ottoman Turkey
3Origins Where does Vietnamese coffee come from? Vietnamese Coffee Vietnam condensed milk
4Origins What is the origin of Dalgona coffee? Dalgona Coffee Korea whipped
5Origins Where was the flat white invented? Flat White Australia New Zealand
6Preparation How do you make a cappuccino? Cappuccino espresso steamed milk foam
7Preparation How does a French press work? French Press plunger steep
8Preparation How do you use an AeroPress? AeroPress pressure filter
9Preparation How does a Moka pot brew coffee? Moka Pot stovetop steam
10Preparation How is iced coffee different from cold brew? Iced Coffee ice cold
11Composition What is a latte made of? Latte espresso milk
12Composition What ingredients are in a mocha? Mocha espresso chocolate
13Composition What is an affogato? Affogato espresso gelato
14Composition What makes Irish coffee special? Irish Coffee whiskey cream
15Composition What is a cortado? Cortado espresso milk
16Variations What is a ristretto? Ristretto short espresso
17Variations What is a macchiato? Macchiato espresso milk
18Variations What is a Frappuccino? Frappuccino Starbucks blended
19Variations What is a cafe cubano? Cafe Cubano sugar espresso Cuban
20Variations What is a lungo? Lungo long water espresso
Evaluated with python scripts/evaluate.py --top-k 5 against the live pipeline. All 20 queries returned the correct source at rank #1.
Prometheus Metrics Fetching...
Total Queries -
Errors -
Avg Embedding Latency -
Avg Retrieval Latency -
Avg Rerank Latency -
Avg Generation Latency -
Avg Docs Retrieved -
Sources -
Total Chunks -
Embedding Model all-MiniLM-L6-v2
Chunk Strategy Structure-aware (see below)
Vector Dimensions 384
Data Source Wikipedia articles
Chunking Strategy
Documents are split using a structure-aware pipeline that preserves semantic boundaries, rather than a naive fixed-window approach.
1
Section splitting — Wikipedia == headers == are detected to split the article into logical sections (e.g. History, Preparation, Variations).
2
Sentence-boundary chunking — Large sections are split at sentence boundaries (never mid-sentence) into chunks of up to 500 tokens (cl100k_base tokenizer). Small sections stay as one chunk.
3
Heading prepend — Each chunk is prefixed with its section heading so every chunk is self-contained and the embedding captures the topic context.
4
Fragment merging — Orphan paragraphs without a heading are merged into the previous section to avoid tiny, context-less chunks.
No sliding-window overlap is used — section boundaries provide natural context instead. Model context limit: 512 tokens (all-MiniLM-L6-v2), with a 500-token chunk cap as safety buffer.
Documents