How to Build an LLM Evaluation Dataset
The dataset is the part of evaluation that teams most often get wrong, and a flawed dataset quietly corrupts every decision built on it. If the dataset is unrepresentative, your scores look good while real users suffer. If it changes underneath you, you cannot tell whether a score moved because the system changed or because the test changed. If it only contains easy cases, it gives a false sense of safety. Getting the dataset right is more important than getting the metrics or the judge right, because the metrics and the judge operate on whatever the dataset contains.
The best inputs come from your own production logs, because real users phrase questions in ways no one on the team would invent: misspelled, ambiguous, abbreviated, and combining multiple intents. If you have an observability layer, mine its traces for representative requests across the range of query types you serve. If you are pre-launch, generate candidate inputs with an LLM but treat them as a stopgap and replace them with real traffic as soon as you have it. Aim for a set that mirrors the actual distribution of usage, including the relative frequency of each query type, so that aggregate scores reflect real impact.
Each example needs something to score against: a reference answer for correctness, relevance labels for retrieval, or a rubric for judge-based scoring. Labeling is the expensive step, and the practical accelerator is LLM-assisted labeling with human review: have a strong model propose the reference answer or relevance labels, then have a human verify and correct them. This is several times faster than labeling from scratch and produces labels good enough for evaluation when the human review is genuine rather than rubber-stamping. For retrieval datasets, the labels are which documents are relevant to each query, which is the input that makes context precision and recall computable.
A dataset of only common, easy cases certifies that the system handles the easy cases, which you already knew. The value is in the hard cases: rare but important query types, adversarial inputs designed to trigger policy violations, inputs at the boundaries (empty, very long, multilingual), and the specific failure modes you have seen before. Deliberately over-sample these relative to their natural frequency in a dedicated slice, while keeping a separate representative slice for measuring real-world impact. The edge-case slice catches regressions; the representative slice estimates user impact.
Treat the dataset as versioned code, stored in source control with a changelog. The entire value of offline evaluation depends on the dataset being stable: when a score changes, you need to know it was the system that changed and not the test. When you do add or modify examples, bump the version and note it, so historical scores remain interpretable. A floating, casually-edited dataset produces score movements that no one can attribute, which destroys trust in the evaluation.
The dataset is never finished. Every time observability or a user report surfaces a real failure that the dataset did not catch, add that case with its correct expected output. This is the single most valuable source of new examples, because it directly targets the gaps in your current coverage, and it guarantees that a bug fixed once stays fixed: the next change that reintroduces it fails the evaluation. Over months, this feedback loop grows a dataset that covers the real long tail far better than any upfront curation could.
Periodically review the dataset for problems that accumulate: near-duplicate examples that inflate the weight of one query type, stale examples whose correct answer changed, and coverage gaps in query types that grew since the last review. A dataset that grows only by accretion eventually becomes lopsided. A quarterly curation pass keeps it representative and trustworthy.
The dataset matters more than the metric. Source from real traffic, label with LLM-assisted human review, over-sample edge cases in a dedicated slice, version it like code, and grow it from every production failure so fixed bugs stay fixed.
How Big Should the Dataset Be
Teams often stall on dataset size, either waiting to amass thousands of examples before they start or assuming a dozen is enough. The useful answer is that you can start small and grow with purpose. A few dozen well-chosen examples per major query type is enough to catch the large, obvious regressions that matter most, and that is far better than no evaluation while you wait for a perfect dataset. As the dataset grows, statistical power grows with it: more examples per segment narrow the confidence interval on each metric, which lets you detect smaller real changes without mistaking them for noise. The practical target is enough examples in each segment you care about that a meaningful regression in that segment would move the score beyond its variance band.
Quality matters more than raw size. A hundred carefully labeled, diverse examples that cover your real failure modes are worth more than a thousand near-duplicate easy cases that inflate the count while testing the same thing repeatedly. Watch especially for redundancy, where many examples exercise one common path and leave the hard paths thin, because this produces a dataset that is large on paper but blind where it matters. The right way to grow is deliberately, adding examples that cover a gap, a new query type, or a real failure, rather than dumping in whatever traffic is convenient. A focused dataset that grows in response to real gaps beats a large one that grew by accretion.
The Dataset as a Living Asset
A well-maintained evaluation dataset becomes one of the most valuable assets a team owns, because it encodes an accumulated definition of what good means for the product, validated against real failures. It is what lets a new team member change a prompt with confidence, what lets you safely evaluate a cheaper model, and what turns a model-provider update from a risk into a checkable event. The dataset is also what makes the rest of the evaluation pipeline work: the judge and the metrics are only as good as the examples they run on. Invest in the dataset first, and the rest of evaluation follows.