# 06 · Eval harness

> **Audience.** Engineers who write evals and wire them into the release
> pipeline.
> ⟵ [internal index](./README.md)

The eval harness is what we sell. Without it, we're just shipping
weights and hoping. With it, we can say "this model passes 96% of the
tests you care about" and mean it.

## 01 · Principles

1. **Evals before prompts.** We write the test suite before the model,
   every engagement. If we can't articulate a pass/fail condition, we
   don't understand the problem yet.
2. **Tests are code.** Eval suites are Python, versioned in git,
   reviewed in PRs. No spreadsheets.
3. **Cheap by default, expensive when needed.** Most evals are rule-
   or rubric-based and run in < 1 s each. We use LLM-as-judge sparingly.
4. **Golden traces are committed.** Every engagement includes a
   snapshot of real prompt/response pairs that must keep working
   across releases.
5. **Drift is a first-class event.** When pass rate moves > 2 pp, the
   Dashboard alerts and the Updates pipeline pauses.

## 02 · Anatomy

`libs/evals/` is the harness; `libs/evals/suites/` holds the concrete
suites.

```
libs/evals/
├── bilbs_evals/
│   ├── __init__.py
│   ├── cli.py                      # `bilbs-eval` entry
│   ├── suite.py                    # Suite / Test / Result types
│   ├── runner.py                   # async executor
│   ├── scorers/
│   │   ├── exact.py                # exact-match
│   │   ├── contains.py
│   │   ├── regex.py
│   │   ├── json_schema.py
│   │   ├── structured_diff.py
│   │   ├── semantic.py             # cosine sim via embeddings
│   │   ├── rubric.py               # LLM-as-judge (explicit flag)
│   │   └── latency.py
│   ├── fixtures/                   # test fixtures (shared across suites)
│   ├── reporters/
│   │   ├── junit.py
│   │   ├── json.py
│   │   ├── pdf.py
│   │   └── grafana.py
│   └── drift.py                    # drift detector + alarms
├── suites/
│   ├── demo/                       # default examples
│   ├── northbeam/
│   │   ├── v3/
│   │   │   ├── suite.yaml
│   │   │   ├── cases/
│   │   │   └── golden/
│   └── ...
├── tests/
└── pyproject.toml
```

## 03 · A suite in ~60 lines

Example — `suites/demo/simple/suite.yaml`:

```yaml
id: demo-simple
version: 1.0.0
owner: eng@groupebilbs.com

metadata:
  description: Smoke suite used by the monorepo CI.
  acceptance_gate:
    pass_rate_min: 0.98
    p95_latency_ms_max: 400
  runs_per_case: 3                    # for flaky scoring
  timeout_seconds_per_case: 30

deployments:                           # which runtime configs to test against
  - route: /api/v1/chat
    system_prompt: "You are a helpful assistant."

cases:
  - id: arithmetic-01
    prompt: "What is 17 * 23?"
    scorer: contains
    expected: "391"

  - id: json-output
    prompt: "Return JSON {name: 'Ada', born: 1815}. Only JSON."
    scorer: json_schema
    expected_schema:
      type: object
      required: [name, born]
      properties:
        name: {const: Ada}
        born: {const: 1815}

  - id: no-hallucination
    prompt: "Who is the 47th president of France?"
    scorer: rubric
    rubric:
      model: weights:policy-judge-v2    # a small on-box judge
      pass_criteria: |
        The response acknowledges that France has not had 47
        presidents or that the information is not available.
        Any confident fabrication fails.

  - id: latency-budget
    prompt: "Say 'ok' and nothing else."
    scorer: latency
    expected:
      p95_ms_max: 180
```

The harness runs these concurrently, reports JUnit + JSON + PDF, and
updates the drift watcher.

## 04 · Scorers

Pick the cheapest scorer that works.

| Scorer | Cost | When |
|--------|------|------|
| `exact` | free | Binary correct/not (math, schema-bound). |
| `contains` | free | Answer must include a phrase. |
| `regex` | free | Structured outputs, codes, IDs. |
| `json_schema` | cheap | Strict-structure outputs (tools, APIs). |
| `structured_diff` | cheap | Compare generated dicts/graphs against expected, ignoring order. |
| `semantic` | cheap (~5 ms) | Paraphrase-tolerant match via cosine-sim over embeddings. |
| `rubric` | expensive | LLM-as-judge. Only when you must. Always pinned model. |
| `latency` | free | p50/p95 targets. |

Custom scorers: subclass `bilbs_evals.scorers.Scorer`, register in
`scorers/__init__.py`.

## 05 · LLM-as-judge — use sparingly

We reach for `rubric` when no deterministic scorer fits — e.g.,
judging tone, judging "did it acknowledge uncertainty?" Rules:

1. **Pin the judge model.** Use a weight reference (`weights:...`) not
   a floating tag.
2. **Use a small judge.** A 3–8B judge trained on a narrow rubric task
   beats a general 70B.
3. **Pair with ground truth.** Every rubric case has a human-labelled
   pass/fail on a sample; we re-calibrate quarterly.
4. **Log the judge's rationale.** Stored in the eval report for
   review.
5. **Budget it.** `runs_per_case` × number-of-rubric-cases × judge-ms =
   keep suite runtime under 10 minutes.

## 06 · Golden traces

Each engagement pinning file lives in
`suites/<customer-slug>/golden/<version>.jsonl`. Each row is:

```json
{
  "id": "golden-0042",
  "prompt": "…",
  "reference_response": "…",
  "captured_at": "2026-04-11T14:22:00Z",
  "deployment": "/api/v1/chat",
  "meta": { "customer": "northbeam", "version": "v3.0.0" }
}
```

CI replays every golden trace against the latest build. Failure = BLEU
drop > 2 pp OR semantic-sim drop > 3 pp OR exact-regex break. Any
failure blocks merge.

Adding a golden: use `bilbs-eval capture` in a paired session with
the customer; the CLI records the prompt/response at point-in-time
and commits it.

## 07 · Running suites

```bash
bilbs-eval list                       # installed suites
bilbs-eval run demo-simple           # against the local dev runtime
bilbs-eval run northbeam/v3 \
  --runtime https://northbeam.bilbs.ai \
  --out report.json \
  --pdf report.pdf \
  --junit results.xml
```

Flags:
- `--concurrency=N` — parallel cases. Default = CPU count.
- `--filter=<regex>` — subset of cases.
- `--record` — saves the model's response alongside the test, used
  for golden capture.

### CI integration

```yaml
# .github/workflows/evals.yml
jobs:
  evals:
    runs-on: ubuntu-latest-8-cores
    steps:
      - uses: actions/checkout@v4
      - run: just bootstrap
      - run: bilbs-eval run demo-simple --junit results.xml
      - uses: test-summary/action@v2
        with: { paths: "results.xml" }
```

Every merged PR runs the `demo-simple` suite (fast, < 2 min) + the
golden replay for any engagement whose config changed. Full nightly
runs every client suite.

## 08 · Drift detector

`bilbs_evals/drift.py` runs the production-deployed model against a
small stable sample of the suite every hour. Any dimension (pass
rate, p95 latency, refusal rate, token count median) that moves
> 2σ from the rolling baseline fires an alert:

- Dashboard tile turns amber / red.
- Alert fires via the standard alertmanager path.
- If on the Updates plan, our weekly report flags it.

False positives (model improved sharply) are tagged by the on-call
and dropped from the baseline.

## 09 · Reporting

Every run produces:
- **JSON** — full structured output (`report.json`). Canonical.
- **JUnit XML** — for CI dashboards.
- **PDF** — executive summary for customers. Pie chart (pass/fail),
  latency histogram, top 10 failing cases with prompts + outputs.
- **Grafana push** — metrics emitted via OTLP, landed on the
  dashboard.

PDF reports are **self-contained** — no web fonts, no external
images. Customers review them offline.

## 10 · Writing a new suite — checklist

- [ ] `suite.yaml` with an id, version, owner, acceptance gate.
- [ ] At least 30 cases covering the golden-path + 3–5 edge cases.
- [ ] Rubric cases labelled with sample human verdicts (committed
      alongside).
- [ ] Runs locally against the dev runtime (`bilbs-eval run <id>`).
- [ ] Runs in CI under 5 min.
- [ ] Drift baseline initialised with at least 72 h of production
      samples.
- [ ] Customer sign-off recorded in the engagement folder.
- [ ] Golden traces captured (≥ 10).
- [ ] Added to `bilbs-eval list` by placing under `suites/`.
- [ ] Runbook entry written for "what if this suite fails a release?"

## 11 · Anti-patterns

- **Suites that only pass on the weights we just trained.** Suite
  should be orthogonal — a suite tests the product, not the model.
- **Huge rubric suites.** If > 50% of cases use LLM-as-judge, rethink.
  Probably you're testing a fuzzy concept that needs sharper rules.
- **Suites without golden traces.** Easy to silently regress. Add at
  least 10.
- **Suites owned by no one.** Assign an `owner` email; that person
  gets pinged on drift.
- **Suites that take longer than a merge cadence.** Target < 5 min for
  any suite that gates merges; longer suites run nightly.

## 12 · Example drift response

Scenario: pass rate drops from 97% to 91% overnight after a `runtime`
bump. What to do:

1. `bilbs-eval run <suite> --filter=@failed-today`.
2. Look at the diff between yesterday and today's reports
   (`bilbs-eval diff <a.json> <b.json>`).
3. Reproduce one failing case locally (`bilbs-eval run --filter=<id>`).
4. Bisect runtime commits with
   `just bisect runtime <good-sha> <bad-sha> "bilbs-eval run <filter>"`.
5. File the culprit, revert, re-release.

Total expected time: 60–120 min from alert to rollback.

---

Next: [07-box-manufacturing.md](./07-box-manufacturing.md).
