Evaluating GenAI: Building Your "Golden Dataset"

Evaluation is the single most critical component of a GenAI system. If you cannot measure it, you cannot improve it. Most teams fall into one of two traps:

The manual slog: collecting logs and manually checking answers against a knowledge base.
The shotgun approach: using tools like RAGAS to randomly generate a test set from your documents.

This post proposes a surgical approach: a customizable workflow that uses topic modeling to build a "Golden Dataset" that actually tests your system's weak points.

The workflow: from random to surgical

Step 1: know your data (topic modeling)

Gather your data and run a topic modeling tool like BERTopic.

Why: this gives you a map of your knowledge base.
The insight: the distribution helps you identify scarce topics a random sampler would miss.

Step 2: targeted generation

Based on these topics, generate question-answer pairs to force coverage of weak areas.

RAGAS (shotgun): random spread of questions.
This method (scalpel): force-generate adversarial questions for scarce topics.

Tip: you can control the LLM context window based on topic size to optimize performance.

Step 3: the "fool's gold" filter (quality control)

Introduce a critic loop to filter hallucinated pairs:

Generate a QA pair.
Have a separate critic model answer using only the context.
Compare answers with simple metrics or an LLM judge; discard if they do not match.

The multi-hop dilemma

This method excels at single-hop QA, but multi-hop questions are harder.

The challenge: real users ask questions that bridge multiple documents.
The reality: knowledge graphs are best for complex reasoning.
The compromise: use topic intersection to find overlapping documents for multi-hop generation.

Why this approach wins

Edge case coverage for scarce topics.
Surgical control over what gets tested.
Higher quality data with critic filtering.
Flexibility without proprietary DSLs.