Testing LLM Systems: Beyond Assert Equals

Every QA engineer's instinct is to write a deterministic assertion: given this input, expect exactly this output. That instinct is correct for 99% of software. It fails completely for large language models.

When I built the evaluation pipeline for DocExtract AI, a document extraction system that pulls structured fields from unstructured PDFs, the first thing I had to accept was that assert result == expected is not a test strategy for LLM output - it is a reliability trap. A correctly extracted date field might come back as "2024-01-15", "January 15, 2024", or "15 Jan 2024" depending on model temperature, context window state, and document formatting. All three are correct. A brittle assertion fails two of them.

This post covers the full evaluation architecture: golden eval sets, RAGAS scoring, LLM-as-judge rubrics, Brier score calibration, and the CI regression gate that ties it all together. The underlying principle throughout is the same: measure quality floors, not exact outputs.

Why Traditional Assertions Fail for LLMs

Traditional software is deterministic by design. The same input produces the same output. A unit test that passes today passes tomorrow. QA pipelines built on this assumption fail in three ways when applied to LLMs:

Non-determinism at inference time. Even with temperature set to zero, model outputs can vary across API versions, infrastructure changes, and context window differences. A test suite with hard-coded expected strings becomes a maintenance burden that fails on every model update, not because quality degraded but because phrasing shifted.
Semantic equivalence is not string equality. "The tenant is responsible for utilities" and "Utilities are the tenant's responsibility" are the same extraction. String comparison rejects one. Embedding similarity helps but still misses domain-specific synonyms and paraphrase patterns.
Failure modes are distributional, not binary. An LLM that hallucinated once out of one hundred calls is not "broken" in the way a null pointer exception is broken. It is underperforming on a quality dimension that requires statistical measurement, not pass/fail gates alone.

The alternative is to define what "good enough" means quantitatively and measure whether the system consistently stays above that threshold. That is what a proper LLM evaluation pipeline does.

Golden Eval Sets: Designing for Adversarial Coverage

A golden eval set is a curated collection of (input, expected output) pairs that represent the full range of difficulty your system will encounter in production. For DocExtract AI, this meant 24 fixtures covering the breadth of document types the extractor handles.

The easy cases are straightforward: clean PDFs with machine-readable text, standard field names, consistent formatting. But easy cases only tell you the system works under ideal conditions. What catches real regressions is the adversarial subset. DocExtract AI's eval set includes 8 adversarial fixtures:

Contradictory documents. A lease amendment that overrides a clause in the base lease. The extractor must identify which value is authoritative and not average or concatenate the two.
Truncated inputs. Documents where OCR or upload processing cut off mid-sentence. The extractor must handle partial context without hallucinating the missing content.
Multi-language content. Documents where field labels are in one language and values are in another - common in bilingual contracts. These stress-test whether the extractor is pattern-matching on layout versus semantically understanding content.
High-noise OCR. Scanned documents with degraded text where character recognition errors are frequent. "Tenant" might appear as "T3nant" or "Tena nt". Extraction must tolerate character-level noise without treating each variant as a separate field.

Designing adversarial fixtures requires thinking like an attacker. Ask: what document characteristics would make the most common failure mode in this system trigger? For extraction systems, that is usually ambiguity, multi-source conflicts, and input degradation. Build fixtures that deliberately invoke each failure mode.

# Fixture structure for golden eval set
{
  "fixture_id": "lease_contradictory_amendment_01",
  "category": "adversarial",
  "adversarial_type": "contradictory_documents",
  "input": {
    "base_lease_text": "...",
    "amendment_text": "..."
  },
  "expected": {
    "monthly_rent": "2400", # amendment value wins
    "authoritative_source": "amendment"
  },
  "evaluation_notes": "Base lease says 2200; amendment overrides to 2400"
}

The 24-fixture set establishes a baseline extraction accuracy of 94.6%. This number did not come from a target - it came from running the pipeline against the ground-truth labels and measuring what the system actually achieves. That measured baseline then becomes the floor the CI gate enforces.

RAGAS Scoring: Three Dimensions of Retrieval Quality

Extraction accuracy tells you whether the right value came out. It does not tell you why the system succeeded or failed, or whether the reasoning behind an answer was grounded in the source document. RAGAS (Retrieval Augmented Generation Assessment) adds three complementary dimensions that decompose quality into measurable components.

The DocExtract AI pipeline weights each dimension based on what matters most for a document extraction use case:

Context recall (weight: 0.35). Did the retrieval step surface all the document sections needed to answer the question? A high context recall score means the relevant passages were in the model's context window. A low score means the extractor is answering from partial information - which is where truncation and missing-page failures manifest. The 0.35 weight reflects that retrieval gaps are the second most damaging failure mode after faithfulness issues.
Faithfulness (weight: 0.40). Does the extracted answer stay within what the source document actually says? Faithfulness is the anti-hallucination metric. A faithfulness score of 1.0 means every claim in the output is attributable to the input document. A low score means the model is generating plausible-sounding but unsupported content - the most dangerous failure mode for a system whose outputs are used in legal or financial decisions. It carries the highest weight for this reason.
Answer relevancy (weight: 0.25). Is the output actually answering the question that was asked? Relevancy catches cases where the extractor returns a real value from the document but from the wrong field - pulling the lease start date when the end date was requested, for instance. The lower weight reflects that relevancy failures are more recoverable than faithfulness failures.

from ragas.metrics import context_recall, faithfulness, answer_relevancy

RAGAS_WEIGHTS = {
  "context_recall": 0.35,
  "faithfulness": 0.40,
  "answer_relevancy": 0.25,
}

def weighted_ragas_score(scores: dict) -> float:
  return sum(
    scores[metric] * weight
    for metric, weight in RAGAS_WEIGHTS.items()
  )

The composite weighted score is tracked per fixture and per category. When a regression happens, the dimension breakdown tells you immediately whether you have a retrieval problem, a hallucination problem, or a targeting problem - which guides the fix rather than just signaling that something broke.

LLM-as-Judge: Writing a Rubric That Grades Consistently

Some extraction quality cannot be captured by string matching or even RAGAS scores. A judge needs to assess whether a multi-sentence summary of a contract clause is accurate, whether the tone of a flagged anomaly is appropriately cautious, or whether a conflict explanation correctly identifies which document takes precedence. These require semantic judgment.

LLM-as-judge uses a second model call to evaluate the first model's output against a rubric. The key insight is that the judge must be given a structured rubric and required to extract evidence - not just rate quality on a 1-10 scale. Unconstrained scores are inconsistent across calls. A rubric anchors the judge to specific criteria.

A well-designed rubric for document extraction has three components:

Scoring criteria with explicit anchors. Each score level must be defined with a description of what output it corresponds to. "4: The extracted value matches the ground truth and the reasoning cites the specific document section" is actionable. "4: Good" is not.
Required evidence extraction. The judge is asked to quote the specific document passage that supports or contradicts the extracted value. This prevents the judge from scoring based on plausibility rather than attribution, and it surfaces the evidence for human reviewers when a score is unexpected.
Failure mode categorization. When the score falls below a threshold, the judge categorizes the failure: hallucination, field targeting error, truncation artifact, or confidence mismatch. This feeds directly into the diagnostic pipeline.

JUDGE_RUBRIC = """
You are evaluating a document extraction result.

Score the extraction from 1-5 using these anchors:
5: Exact match or semantically equivalent; evidence is explicitly quoted
4: Correct value; evidence present but not precisely quoted
3: Partially correct; missing sub-fields or minor inaccuracy
2: Plausible but unsupported; no document evidence found
1: Incorrect or hallucinated; contradicts source document

Required output format:
- score: (integer 1-5)
- evidence: (quoted passage from document, or "none found")
- failure_type: (null | hallucination | field_error | truncation | confidence_mismatch)
- reasoning: (one sentence)
"""

Requiring structured output in a specified format makes judge responses parseable without post-processing heuristics. The evidence field is the most valuable: when a human audits a low score, the quoted passage tells them exactly what the judge was looking at.

Brier Score: Is the System Honest About Its Confidence?

Most LLM extraction systems return a confidence score alongside the extracted value. The Brier score measures whether those confidence scores are calibrated - whether a field the system rates as 90% confident is actually correct 90% of the time.

The Brier score is the mean squared error between predicted probabilities and actual binary outcomes. A score of 0 is perfect calibration. A score of 1 is maximally wrong. For DocExtract AI, the Brier score is computed per fixture run and tracked over time:

def brier_score(predictions: list[dict]) -> float:
  """
  predictions: list of {"confidence": float, "correct": bool}
  Lower is better. 0.0 = perfect calibration.
  """
  return sum(
    (p["confidence"] - int(p["correct"])) ** 2
    for p in predictions
  ) / len(predictions)

Why does calibration matter? An overconfident extractor that is wrong 20% of the time but always returns 0.95 confidence will fool downstream systems into skipping human review on incorrect extractions. An underconfident extractor that is correct 95% of the time but flags everything for review creates unnecessary manual work. Brier score tracking catches both problems before they reach production, and its trend over model versions is a leading indicator of confidence drift even when raw accuracy holds steady.

CI Regression Gates: Quality Metrics as Merge Blockers

All the measurement in the world is useless if it does not connect to deployment decisions. The evaluation pipeline feeds a CI regression gate that runs on every pull request and blocks merge if extraction accuracy drops below the 94.6% baseline established by the golden eval set.

The gate has three checks:

Accuracy floor. Overall extraction accuracy across all 24 fixtures must be at or above 94.6%. This is the primary gate. A drop of any size below this threshold fails the check, requiring a review of what changed and why.
Adversarial subset floor. The 8 adversarial fixtures are checked independently. A change that maintains overall accuracy but degrades performance on the adversarial cases might be hiding a regression behind easy-case gains. Adversarial cases are intentionally harder; they are the ones most likely to reflect real production edge cases.
RAGAS faithfulness floor. Faithfulness score must not drop more than 0.05 from the rolling baseline. This catches prompt changes that improve format compliance but introduce hallucination - a common failure mode when prompts are over-optimized for surface-level correctness.

# CI step: run evals and check gates
- name: Run LLM evaluation suite
  run: python -m pytest evals/ -v --json-report --json-report-file=eval_results.json

- name: Check regression gates
  run: |
    python scripts/check_gates.py \
      --results eval_results.json \
      --accuracy-floor 0.946 \
      --adversarial-floor 0.946 \
      --faithfulness-max-drop 0.05

- name: Post results to PR
  run: python scripts/post_eval_comment.py --results eval_results.json

The final step posts a summary comment to the pull request with the accuracy score, RAGAS breakdown, Brier score, and a pass/fail status per gate. This makes the evaluation results visible to reviewers without requiring them to dig through CI logs. When a gate fails, the comment identifies which fixtures regressed, which failure categories appeared, and whether the drop is in accuracy, faithfulness, or both.

Key Principle: Measuring Quality Floors, Not Asserting Exact Outputs

The throughline across every component of this pipeline is a shift in what testing means for LLM systems. Traditional testing asks: is this output exactly right? LLM testing asks: is this system reliably staying above a quality threshold across a distribution of inputs?

This is not a lowering of standards. It is a more honest accounting of what LLMs are. A model that achieves 94.6% extraction accuracy on a 24-fixture eval set including 8 adversarial cases is demonstrably reliable. A model tested with a handful of hand-crafted happy-path assertions that all pass is not - because those assertions say nothing about the adversarial cases, the calibration, the faithfulness, or what happens when the accuracy floor moves.

The practical consequence is that the evaluation suite is a living artifact. As new failure modes appear in production, they get added as adversarial fixtures. As the model improves, the accuracy floor gets raised. The CI gate is not a fixed bar - it is the current best-known performance that any future change must at least preserve.

Golden eval fixtures

94.6%

Extraction accuracy baseline

3 gates

CI checks per pull request

Where to Go From Here

If you are building an LLM system and relying on ad-hoc assertions or manual spot-checks, the first investment to make is a golden eval set. You do not need 24 fixtures to start - 10 well-chosen cases with a few adversarial examples will immediately surface failure modes that manual testing misses, and it gives you a quantitative baseline to defend against regressions.

From there, adding RAGAS scoring is a one-dependency addition that turns a binary pass/fail into a diagnostic signal. The LLM-as-judge layer adds depth for cases where semantic evaluation is needed. Brier score tracking is a late-stage addition that pays off once you have enough prediction volume to detect calibration drift. And the CI gate is the step that makes all the measurement actionable - without it, you are monitoring quality but not protecting it.

The full DocExtract AI evaluation pipeline is part of my QA portfolio. If you are working on LLM system quality and want to dig into the implementation, you can find more detail on the QA portfolio page, including the eval framework architecture, test fixture design patterns, and the CI gate configuration.