Start here

How Compass answers, proves, and improves.

Before scorecards or evaluation runs, we need one shared picture of the product: a person asks, Compass interprets the task, uses approved data, returns an answer, and leaves a record we can inspect.

If a score changes, a reviewer should be able to open the conversation and see the answer, the evidence, and the reason in one place.

1. Ask

A user asks for a lookup, ranking, comparison, citation, follow-up, or explanation.

2. Answer

Compass resolves the task, uses approved data, and renders the answer the user sees.

3. Prove

Compass stores the transcript, citations, artifacts, traces, snapshots, and verdicts so the answer can be reviewed.

Base mechanics

Compass is a policy-answering system, not a free-form chatbot.

For review purposes, the idea is simple: every answer should leave enough evidence behind that we can understand what Compass thought, what data it used, what it returned, and how it was judged.

User asks

A person sends a policy question: a district lookup, a ranking, a comparison, a follow-up, or a request for sources.

Compass answers

Compass interprets the task, resolves approved data, executes deterministic data work, and renders a user-facing answer.

The system records proof

Messages, artifacts, citations, traces, snapshots, and verdicts become the case file that reviewers inspect later.

Plain-language frame: the answer is the visible product; the evidence file is how Compass proves what happened; evaluation is how we decide whether that evidence met the expected behavior.

What happens after Send

A user message moves through a few predictable stages before Compass replies.

This is the non-technical version of the path. The implementation can change, but the review model should always preserve these stages.

1. MessageUser ask

Compass receives the prompt and the conversation history.

2. UnderstandTask shape

Compass decides what kind of answer is needed.

3. ResolveApproved choices

Districts, metrics, years, peer sets, and source families are resolved.

4. ExecuteData work

The selected query, ranking, count, comparison, or lookup runs.

5. RenderAnswer

Compass writes the table, explanation, sources, and export-ready artifact.

6. PersistEvidence

The session stores messages, artifacts, snapshots, trace ids, and verdicts.

Why this matters for evaluation: when an answer is wrong, the review page should show which stage broke. Was the wrong district selected, the right data retrieved but sorted incorrectly, or the answer rendered without enough source support?

What Compass produces

A conversation becomes a case file with four parts.

This is the bridge from normal use to evaluation. Reviewers are not just reading a transcript; they are reading an evidence package.

Assignment

The user ask, scenario, case, expected behavior, and any known criteria.

what good requires

Evidence

The actual answer, table, chart, CSV, citations, source panel, trace, and snapshot.

what happened

Judgment

Verdicts from deterministic checks, bounded judges, span assertions, and scenario-fit.

pass/fail/error

Follow-up

Human classification: correct verdict, product bug, data gap, criteria issue, scenario issue, evaluator issue, or expected limitation.

what we do next

Design implication: the interface should not make people jump between transcript, scorecard, trace, notes, and raw verdict rows. The case file should assemble them.

Why evaluation exists

Evaluation reads the case file and decides whether Compass did the job.

Once Compass records the assignment, evidence, and traces, we can ask better questions than "did the model sound good?" We can ask whether the answer met the expectation and where the system broke if it did not.

Product question

Did Compass give the user the right answer, in the right shape, with the right data and source support?

accuracy

Process question

Did Compass follow the expected path: resolve the right thing, run the right data work, and preserve useful evidence?

trace and artifact evidence

Improvement question

If it failed, is the owner product logic, data coverage, scenario wording, criterion design, evaluator behavior, or observability?

actionable follow-up

Plain-language transition: evaluation is not a separate product from Compass. It is the review layer that reads Compass conversations and turns them into clear product learning.

Where evaluation attaches

Evaluation is layered because each moment asks a different question.

Now that we have a case file, the evaluation layers explain when each kind of check happens. L1 through L4 are not four names for the same thing; they are four checkpoints around the same message-and-evidence flow.

L1

Before the answer returns

Can code prove the response is safe enough to send, or should Compass repair/block it first?

This can shape the visible answer.

L2

After each turn

Did this assistant turn satisfy the criteria that apply to this answer and artifact?

This writes verdict rows.

L3

Across the conversation

What is the quality state of the session so far, especially after follow-ups or drift?

This is provisional for real users.

L4

After a generated case completes

Did the whole conversation satisfy the scenario's expected behavior?

This is the scorecard gate for generated cases.

Important distinction: organic user conversations can get L1, L2, and L3 analysis. L4 needs a known scenario/case expectation, so it belongs to generated evaluation runs and scorecard batches.

Shared language

Use Batch for the group, and Evaluation Run for the execution.

This clears up the words that have been overlapping: scenarios, cases, attempts, runs, sweeps, and scorecards.

Batch

A selected group of scenarios and cases that we want to evaluate. A batch can be the official scorecard batch, one accuracy-dimension batch, a bug-focused batch, a feedback replay batch, or even one scenario.

defines what we are testing

Evaluation Run

One execution of a batch against a specific Compass build, target backend, criteria set, and K-attempt setting.

creates the conversations and scores

Attempt

One generated Compass conversation for one runnable case in an evaluation run. K=3 means three attempts per case.

the unit we inspect

Scorecard

The published history of evaluation runs over an official scorecard batch. It is not all historical verdict rows and it is not a loose dashboard rollup.

the official product report

Engineering note: sweep can stay as the runner/process word in CLI, logs, and table names. The product UI should say batch and evaluation run.

From SSN-201

The interface should explain every judgment in plain language.

A reviewer should not need to infer how a score happened. The page should assemble the answer, expectation, evidence, and reason in one place.

This answer failed because it did not do what this scenario expected, and the transcript, trace, or artifact shows why.

Scenario context

What situation or user need are we testing? Which batch, run, feedback row, or organic session created it?

Expected behavior

What good answer shape was required: selection, data, coverage, filters, sort, citations, consistency, or process?

Observed evidence

What did Compass actually say or render, and what do the artifact, trace, snapshot, and verdict rows prove?

Why the evaluator can be trusted

Evaluation can use AI, but the judgment must stay bounded.

The trust model should be visible in the interface, not buried in implementation notes.

Keep AI on rails

deterministic checks come first when code can prove the answer
AI selects from approved checks; it does not invent standards
AI judges a specific criterion against specific evidence
every AI verdict has a written reason and can be reviewed

Make the report honest

scorecards use fixed batch membership and versioned runs
denominators are cases and attempts, not loose verdict rows
health and evaluator failures are shown separately
published numbers are reproducible point-in-time report cards

Interface implication: the browser should always show whether a judgment came from deterministic code, judge prompt, span assertion, scenario-fit, or human review.

Concepts reviewers need to understand

Use a small vocabulary and repeat it everywhere.

The deck should make Compass evaluation readable to product and QA reviewers without requiring anyone to know table names.

Object

Plain meaning

Where it appears in the UI

Scenario

The assignment: a situation Compass should handle.

Evaluation run, scenario title, expected behavior.

Case

A runnable phrasing or multi-step version of a scenario.

Case/attempt selector and scorecard denominator.

Turn

The actual user/assistant exchange that happened.

Transcript and per-turn verdict cards.

Criterion

A reusable standard that must be true.

Product checks and evaluator reason rows.

Verdict

A pass/fail/error judgment against evidence.

Assessment panel and product/health chips.

Trace / artifact

The proof of how Compass produced the answer.

Evidence panel, debug links, table/chart/source preview.

Why this matters now

The evaluation pieces exist, but the story is scattered.

When a score looks surprising, a reviewer has to jump between the scorecard, run output, verdict rows, transcript, trace, and notes. The interface should put those pieces together.

Today, review feels harder than it should

the score says pass or fail, but the reason is somewhere else
the transcript shows what happened, but not what was expected
the trace shows how Compass ran, but not whether the answer was good
the evaluation run generated useful test conversations, but they are hard to browse
health problems can look like product accuracy problems

The conversation browser fixes the shape

show the user ask and Compass answer in the same place
show what good behavior was supposed to be
show the evaluator's assessment in plain language
separate product failures from run or trace health
let scorecards and evaluation runs open the exact conversations behind the number

Operating model

The operating loop is simple: define, produce, judge, review, improve.

A scenario, check, or real conversation defines what to inspect. Compass records evidence. Evaluators write verdicts. Reviewers classify what those verdicts mean.

1. DefineWhat to check

Scenario, criterion, or organic conversation.

2. ProduceSession evidence

Messages, artifacts, citations, trace, snapshot.

3. EvaluateVerdicts

Code checks, bounded judges, span assertions, scenario fit.

4. ReviewHuman meaning

Correct verdict, product bug, data gap, check issue.

5. ImproveFeedback writeback

Punchlist, scenario update, criterion update, product fix.

Key idea: a sweep is the backend process that creates conversations with batch/run metadata. The scorecard summarizes those conversations. Evaluation Review and Conversation Detail let reviewers inspect the evidence behind every number.

How evaluations actually fire

The four layers answer different questions.

This is the part the interface should teach: not every evaluation is the same kind of judgment, and not every judgment belongs in the scorecard.

Layer

When

Question

Where the UI shows it

L1 Inline

Before the answer returns

Can code prove the answer or artifact is safe enough to send?

Repair/block notes and trace evidence.

L2 Post-turn

After each assistant turn persists

Did this turn pass the criteria that apply to this answer?

Per-turn product and health checks.

L3 Session

After each turn, idle window, or reviewer open

What is the quality state of the conversation so far?

Conversation summary and needs-review flag.

L4 Case fit

When a generated case completes

Did the whole conversation satisfy the scenario's expected behavior?

Scenario-fit verdict and scorecard result.

Important: L4 is the whole-assignment judge. It is where the case's expected behavior language becomes the rubric for the full transcript. Organic sessions can receive L1, L2, and L3 analysis, but L4 strict case-fit needs a known scenario/case expectation.

Plain language: L4 case fit

Case fit asks: did Compass do what this scenario was testing?

Earlier layers check parts of the answer. Case fit reads the scenario's expected behavior and judges the whole conversation against that assignment.

What L1-L3 usually use

L1: code checks on the response before it returns
L2: criteria selected for this turn, plus artifacts, answer text, and trace evidence
L3: the conversation so far, to summarize quality state or flag review needs
These layers can use scenario metadata to choose checks, but they mostly judge specific criteria or session health.

What L4 adds

the scenario/case expected behavior from the database
the full case transcript, not just one assistant turn
a plain-language judge question: did this conversation satisfy this test?
a scorecard-ready result: pass, fail, or evaluator error for the whole case attempt

Example: L2 can check whether a table is sorted descending and has citations. L4 checks whether the whole answer satisfied the Sort Accuracy scenario: choose BA starting salary for a broad new-teacher prompt, return the requested top ranking, include salary values, citations, and denominator language, and avoid a blocking clarification.

Batches and evaluation runs

Review the batch first, then open the conversations behind it.

A scorecard batch, an accuracy-dimension batch, or a bug-focused batch is a named group of scenarios/cases. An evaluation run executes that batch and generates the conversations behind the result.

Evaluation Review

Run: Golden v1 Scorecard Batch: scorecard Filter: scorecard batches Conversations: generated Publishable

Evaluation runs

Golden v1 Scorecard scorecard batch, 84 cases, latest run

Sort Accuracy batch run dimension batch, 5 cases, 15 attempts

Clarification-loop regression ad hoc bug batch, 12 cases

NCTQ feedback replay feedback replay batch

Purpose of this evaluation run

Measure how the current Compass build performs against the fixed golden-v1 scorecard batch across the accuracy dimensions.

What this run contains

Cases84

Attempts252

Dimensions7

Score33%

Batchgolden-v1 scorecard

Buildmain@9a075

Criteria2026-05

Targetstaging DB

K attempts3 per case

Healthpublishable

Each attempt is a generated Compass conversation with known expected behavior, a final case-fit judgment, and run-health evidence.

Conversations in this run

Best-paying districts for new teachers

Sort Accuracy · expected BA starting-salary ranking · attempt 1

pass

Philadelphia district selection

Selection Accuracy · expected exact district resolution · attempt 2

fail

Coverage state for unavailable data

Coverage-State Labeling · expected not reviewed vs not applicable

review

Click a row to open the conversation detail view with transcript, L1-L3 analysis, trace, artifacts, and verdict evidence.

Selected conversation preview

User: What are the 10 best-paying districts for new teachers? Compass: I found current reviewed numeric data for 61 of 133 covered districts. The ranking below includes only districts with current numeric values...

Evaluation read

case fit pass L2 checks pass

Scenario-fit is the product gate for generated cases. Lower-level verdicts explain the result; run health says whether the evaluation can be published.

Mental model

A run explains the group. Conversation detail explains one session.

This keeps the generated evaluation layer distinct from the always-on conversation review layer.

Evaluation Review

Use this when you want to understand a publishable evaluation artifact or a diagnostic run.

Batch: the selected scenarios/cases, like golden-v1 scorecard, Sort Accuracy, or a bug-regression pack
Run: one execution of that batch against a Compass build
Attempt: one generated conversation for one case and one K trial
Scorecard: the official run over a scorecard batch, publishable only when product score and run health are both known

Conversation Detail

Use this when you want to inspect one conversation deeply.

actual user/assistant transcript
L1/L2/L3 analysis that can exist for every conversation
artifacts, citations, trace, and turn snapshot
L4 case fit only when this conversation came from a scenario/case run

Naming recommendation: use Batches for selected scenario groups, Evaluation Runs for executions, and keep sweep as the backend/process word. Users should browse batches and runs, not raw sweeps.

Publication readiness

A scorecard run needs two answers: did Compass perform, and did the run work cleanly?

Product accuracy and run health should never be collapsed into one percentage.

Product score

Did the generated conversation satisfy the case's expected behavior?

scenario-fit gate criteria diagnostics

A case attempt passes only when the case-level expectation passes. Generic low-level passes cannot dilute a user-visible failure.

Run health

Did the run produce trustworthy evidence?

trace coverage verdict coverage eval errors

An unhealthy run can still be useful for debugging, but it should not become the published scorecard without an explicit waiver.

Run health band

Trace coverage100%

all attempts linked

Missing verdicts0

required checks wrote

Eval exceptions0

judges completed

Response errors0

chat returned valid

Warning traces3

publish review needed

StatusReview

not silently green

This is the place for missing traces, zero-verdict valid responses, response validation errors, evaluator exceptions, and warning traces. Those are readiness signals, not Sort Accuracy failures.

Conversation detail

One session should be readable without a database tour.

Generated evaluation conversations get strict case fit. Organic conversations still get L1/L2/L3 analysis, but not overconfident scorecard math.

Generated case attempt

Scenario-fit: final product gate for the whole case
Expected behavior: loaded from the scenario/case
Lower-level verdicts: diagnostics that explain the pass/fail
Evidence: transcript, artifact, trace, snapshot, citations, raw verdict rows

Prompt features detected

MetricBA starting salary

Entityall covered districts

Limittop 10

Sortdescending salary

Thresholdnone

Citationsrequested/required

Follow-upno prior turn

Yearcurrent reviewed

This comes from evaluator-only request-feature extraction. It helps reviewers see what the prompt asked for; it must not become a second runtime planner.

Rule: scenario-fit decides generated case success. L1/L2/L3 remain useful for every conversation, but their lower-level passes do not erase a failed expected behavior.

QA crosswalk

Every failure should become an owned next step.

This is how scorecards, NCTQ feedback, punchlist rows, and evaluator calibration stop living in separate worlds.

Attempt

Expected vs actual

Evidence

Owner bucket

Sort Accuracy case 467

golden-v1 attempt 2

Expected BA salary ranking. Actual answer asked a blocking BA/MA clarification.

Scenario-fit failed; L2 sort checks skipped; trace shows clarification route.

product bug

Citation case 522

feedback replay

Expected source-backed answer. Actual table had values but no usable source panel.

Answer artifact present; citation verdict failed; source trace incomplete.

data/evidence gap

Selection case 611

dimension run

Expected exact district resolution. Actual answer chose wrong district variant.

Request features detected correct entity; resolver evidence selected wrong candidate.

product bug

Owner buckets: product bug, data gap, scenario revision, criterion revision, evaluator issue, expected limitation, or correct verdict. The UI should force that classification before a failure disappears into a percentage.

How to read one case file

The page should teach people where to look.

A reviewer should be able to learn the evaluation system by reading top-to-bottom and left-to-right.

Left rail: scope

Which evaluation run am I looking at? Golden scorecard, a dimension batch run, a bug-regression batch, or a feedback replay.

This teaches why denominator and membership matter.

Center: judgment

What was expected, what was observed, which criterion failed, and how the system classifies the failure.

This teaches the durable quality sentence.

Right rail: evidence

Transcript, artifact, source evidence, trace, turn snapshot, and the raw verdict reason.

This teaches why the answer is reviewable.

The review path: choose slice -> open conversation -> read expected behavior -> inspect actual answer -> check product and health verdicts -> classify next action.

Organic conversations

All conversations can be reviewed. Only known tests get official scores.

The system should distinguish known expectations from inferred expectations.

Generated or golden conversations

known scenario and case
known expected behavior
strict case fit allowed
eligible for scorecard denominator
case/attempt/product score is meaningful

Organic user conversations

task intent is inferred
quality signals still run
strict score requires a known expectation or reviewer link
best default is needs review, pass signal, or failure family
use for triage, not public scorecard math

Doctrine: every conversation can get evidence and analysis. Only conversations with known expectations can get strict product scores.

Scorecard as official run

The scorecard is the official report, not a ledger.

The scorecard still exists. It becomes the fixed snapshot over a versioned scorecard batch, with every number drilling into the same conversation case files.

Scorecard summary

Selection Accuracy

0%

Data Fidelity

3%

Sort Accuracy

100%

Citation Accuracy

73%

Scores link into the exact conversations, not separate report rows.

What the scorecard means

fixed batch membership defines the public denominator
scorecard history is just prior runs over scorecard batches
cases and attempts are plain-language counts
product score is separate from run and evaluator health
scenario-fit is the final product gate for generated cases
every dimension links to conversations, evidence, and review outcomes

Publication rule: a scorecard can be published only when batch membership, build/SHA, criteria set, attempts, final case-fit verdicts, and run-health evidence are present. Otherwise it is a diagnostic run, not the headline scorecard.

What kind of check is this?

Criteria are the standards. Evaluators apply them.

The UI should name the source of each judgment so reviewers know what kind of evidence they are reading.

Deterministic

Code proves something directly: sort order, citation markers, requested limit, response contract, artifact shape.

best for proof

Judge prompt

AI judges a specific qualitative criterion against bounded evidence, such as coverage-state language quality.

best for interpretation

Span assertion

Trace evidence proves the process happened: planner resolved candidates, execution dispatched, persistence succeeded.

best for process

Scenario-fit is separate: it judges the full generated case against its expected behavior. It should be visible on scorecard/evaluation-run conversations, not treated as a generic live user-turn judge.

Data architecture

Behind the scenes, connect the evidence into one read model.

Keep verdicts as check-level evidence. Add snapshots that summarize the conversation for product review.

Existing evidence

chat_messages, traces, snapshots, artifacts, feedback, scenarios, cases, criteria, verdicts, and evaluation-run metadata.

New read model

conversation_analysis_runs and conversation_analysis_snapshots roll this into a product-readable state.

Run and detail views

Scorecards, dimension batch runs, bug batches, feedback replays, and organic traffic all point into the same conversation snapshots.

Integration point: `frontend_contracts.get_debug_backfill()` should stop returning an empty verdict list and start returning conversation review snapshots plus linked verdict evidence.

Consolidation

Connect or retire anything that defines quality somewhere else.

The point is not to add a new page. The point is to remove ambiguity about where quality evidence lives.

Connect into Conversation Review

conversation/debug pages
scorecard dimensions
evaluation-run reports and runsets
scenario manager
NCTQ feedback rows
punchlist owner rows

Retire or demote

cumulative Sweep live as headline
verdict rows presented as trials
debug views without verdict context
scorecard drilldowns detached from transcripts
hidden one-off analysis scripts as source of truth

Implementation sequence

Build it in slices.

Start by reading existing evidence better. Add persistence and reviewer workflows only after the UX proves itself.

1. Read model

Build a `ConversationReviewSnapshot` over messages, verdicts, criteria, traces, scenarios, cases, request features, evaluation runs, and feedback.

2. Detail panel

Show expected behavior, scenario-fit gate, diagnostic verdicts, prompt features, transcript, artifacts, trace, and evidence.

3. Evaluation runs

Add run metadata, batch/build/criteria/K attempts, product score, health band, and publication eligibility.

4. Scorecard linkup

Make the scorecard the official published run and drill every score into the same conversation case file.

5. QA crosswalk

Map failed attempts to expected behavior, actual answer, scenario-fit reason, evidence, and owner bucket.

6. Review loop

Persist reviewer classifications and notes once the read shape proves useful for scorecards, feedback, and punchlist routing.

Product decision

Make Conversation Review the quality control plane.

This unifies the evaluation work instead of adding another surface people have to reconcile.

One conversation. One evidence file. Many views: organic review, evaluation-run debugging, golden scorecard, feedback triage, and punchlist routing.

What good looks like

Defined by scenarios, criteria, expected behavior, and approved product rules.

How we judge it

Layered validators, bounded judges, case-fit checks, and human review outcomes.

Where we see it

Evaluation Runs for generated batches, and Conversation Detail for the underlying session evidence.