How Compass answers, proves, and improves.
Before scorecards or evaluation runs, we need one shared picture of the product: a person asks, Compass interprets the task, uses approved data, returns an answer, and leaves a record we can inspect.
If a score changes, a reviewer should be able to open the conversation and see the answer, the evidence, and the reason in one place.
1. Ask
A user asks for a lookup, ranking, comparison, citation, follow-up, or explanation.
2. Answer
Compass resolves the task, uses approved data, and renders the answer the user sees.
3. Prove
Compass stores the transcript, citations, artifacts, traces, snapshots, and verdicts so the answer can be reviewed.
Compass is a policy-answering system, not a free-form chatbot.
For review purposes, the idea is simple: every answer should leave enough evidence behind that we can understand what Compass thought, what data it used, what it returned, and how it was judged.
User asks
A person sends a policy question: a district lookup, a ranking, a comparison, a follow-up, or a request for sources.
Compass answers
Compass interprets the task, resolves approved data, executes deterministic data work, and renders a user-facing answer.
The system records proof
Messages, artifacts, citations, traces, snapshots, and verdicts become the case file that reviewers inspect later.
Plain-language frame: the answer is the visible product; the evidence file is how Compass proves what happened; evaluation is how we decide whether that evidence met the expected behavior.
A user message moves through a few predictable stages before Compass replies.
This is the non-technical version of the path. The implementation can change, but the review model should always preserve these stages.
Compass receives the prompt and the conversation history.
Compass decides what kind of answer is needed.
Districts, metrics, years, peer sets, and source families are resolved.
The selected query, ranking, count, comparison, or lookup runs.
Compass writes the table, explanation, sources, and export-ready artifact.
The session stores messages, artifacts, snapshots, trace ids, and verdicts.
Why this matters for evaluation: when an answer is wrong, the review page should show which stage broke. Was the wrong district selected, the right data retrieved but sorted incorrectly, or the answer rendered without enough source support?
A conversation becomes a case file with four parts.
This is the bridge from normal use to evaluation. Reviewers are not just reading a transcript; they are reading an evidence package.
Assignment
The user ask, scenario, case, expected behavior, and any known criteria.
what good requires
Evidence
The actual answer, table, chart, CSV, citations, source panel, trace, and snapshot.
what happened
Judgment
Verdicts from deterministic checks, bounded judges, span assertions, and scenario-fit.
pass/fail/error
Follow-up
Human classification: correct verdict, product bug, data gap, criteria issue, scenario issue, evaluator issue, or expected limitation.
what we do next
Design implication: the interface should not make people jump between transcript, scorecard, trace, notes, and raw verdict rows. The case file should assemble them.
Evaluation reads the case file and decides whether Compass did the job.
Once Compass records the assignment, evidence, and traces, we can ask better questions than "did the model sound good?" We can ask whether the answer met the expectation and where the system broke if it did not.
Product question
Did Compass give the user the right answer, in the right shape, with the right data and source support?
accuracy
Process question
Did Compass follow the expected path: resolve the right thing, run the right data work, and preserve useful evidence?
trace and artifact evidence
Improvement question
If it failed, is the owner product logic, data coverage, scenario wording, criterion design, evaluator behavior, or observability?
actionable follow-up
Plain-language transition: evaluation is not a separate product from Compass. It is the review layer that reads Compass conversations and turns them into clear product learning.
Evaluation is layered because each moment asks a different question.
Now that we have a case file, the evaluation layers explain when each kind of check happens. L1 through L4 are not four names for the same thing; they are four checkpoints around the same message-and-evidence flow.
Before the answer returns
Can code prove the response is safe enough to send, or should Compass repair/block it first?
This can shape the visible answer.
After each turn
Did this assistant turn satisfy the criteria that apply to this answer and artifact?
This writes verdict rows.
Across the conversation
What is the quality state of the session so far, especially after follow-ups or drift?
This is provisional for real users.
After a generated case completes
Did the whole conversation satisfy the scenario's expected behavior?
This is the scorecard gate for generated cases.
Important distinction: organic user conversations can get L1, L2, and L3 analysis. L4 needs a known scenario/case expectation, so it belongs to generated evaluation runs and scorecard batches.
Use Batch for the group, and Evaluation Run for the execution.
This clears up the words that have been overlapping: scenarios, cases, attempts, runs, sweeps, and scorecards.
Batch
A selected group of scenarios and cases that we want to evaluate. A batch can be the official scorecard batch, one accuracy-dimension batch, a bug-focused batch, a feedback replay batch, or even one scenario.
defines what we are testing
Evaluation Run
One execution of a batch against a specific Compass build, target backend, criteria set, and K-attempt setting.
creates the conversations and scores
Attempt
One generated Compass conversation for one runnable case in an evaluation run. K=3 means three attempts per case.
the unit we inspect
Scorecard
The published history of evaluation runs over an official scorecard batch. It is not all historical verdict rows and it is not a loose dashboard rollup.
the official product report
Engineering note: sweep can stay as the runner/process word in CLI, logs, and table names. The product UI should say batch and evaluation run.
The interface should explain every judgment in plain language.
A reviewer should not need to infer how a score happened. The page should assemble the answer, expectation, evidence, and reason in one place.
This answer failed because it did not do what this scenario expected, and the transcript, trace, or artifact shows why.
Scenario context
What situation or user need are we testing? Which batch, run, feedback row, or organic session created it?
Expected behavior
What good answer shape was required: selection, data, coverage, filters, sort, citations, consistency, or process?
Observed evidence
What did Compass actually say or render, and what do the artifact, trace, snapshot, and verdict rows prove?
Evaluation can use AI, but the judgment must stay bounded.
The trust model should be visible in the interface, not buried in implementation notes.
Keep AI on rails
- deterministic checks come first when code can prove the answer
- AI selects from approved checks; it does not invent standards
- AI judges a specific criterion against specific evidence
- every AI verdict has a written reason and can be reviewed
Make the report honest
- scorecards use fixed batch membership and versioned runs
- denominators are cases and attempts, not loose verdict rows
- health and evaluator failures are shown separately
- published numbers are reproducible point-in-time report cards
Interface implication: the browser should always show whether a judgment came from deterministic code, judge prompt, span assertion, scenario-fit, or human review.
Use a small vocabulary and repeat it everywhere.
The deck should make Compass evaluation readable to product and QA reviewers without requiring anyone to know table names.
The evaluation pieces exist, but the story is scattered.
When a score looks surprising, a reviewer has to jump between the scorecard, run output, verdict rows, transcript, trace, and notes. The interface should put those pieces together.
Today, review feels harder than it should
- the score says pass or fail, but the reason is somewhere else
- the transcript shows what happened, but not what was expected
- the trace shows how Compass ran, but not whether the answer was good
- the evaluation run generated useful test conversations, but they are hard to browse
- health problems can look like product accuracy problems
The conversation browser fixes the shape
- show the user ask and Compass answer in the same place
- show what good behavior was supposed to be
- show the evaluator's assessment in plain language
- separate product failures from run or trace health
- let scorecards and evaluation runs open the exact conversations behind the number
The operating loop is simple: define, produce, judge, review, improve.
A scenario, check, or real conversation defines what to inspect. Compass records evidence. Evaluators write verdicts. Reviewers classify what those verdicts mean.
Scenario, criterion, or organic conversation.
Messages, artifacts, citations, trace, snapshot.
Code checks, bounded judges, span assertions, scenario fit.
Correct verdict, product bug, data gap, check issue.
Punchlist, scenario update, criterion update, product fix.
Key idea: a sweep is the backend process that creates conversations with batch/run metadata. The scorecard summarizes those conversations. Evaluation Review and Conversation Detail let reviewers inspect the evidence behind every number.
The four layers answer different questions.
This is the part the interface should teach: not every evaluation is the same kind of judgment, and not every judgment belongs in the scorecard.
Important: L4 is the whole-assignment judge. It is where the case's expected behavior language becomes the rubric for the full transcript. Organic sessions can receive L1, L2, and L3 analysis, but L4 strict case-fit needs a known scenario/case expectation.
Case fit asks: did Compass do what this scenario was testing?
Earlier layers check parts of the answer. Case fit reads the scenario's expected behavior and judges the whole conversation against that assignment.
What L1-L3 usually use
- L1: code checks on the response before it returns
- L2: criteria selected for this turn, plus artifacts, answer text, and trace evidence
- L3: the conversation so far, to summarize quality state or flag review needs
- These layers can use scenario metadata to choose checks, but they mostly judge specific criteria or session health.
What L4 adds
- the scenario/case expected behavior from the database
- the full case transcript, not just one assistant turn
- a plain-language judge question: did this conversation satisfy this test?
- a scorecard-ready result: pass, fail, or evaluator error for the whole case attempt
Example: L2 can check whether a table is sorted descending and has citations. L4 checks whether the whole answer satisfied the Sort Accuracy scenario: choose BA starting salary for a broad new-teacher prompt, return the requested top ranking, include salary values, citations, and denominator language, and avoid a blocking clarification.
Review the batch first, then open the conversations behind it.
A scorecard batch, an accuracy-dimension batch, or a bug-focused batch is a named group of scenarios/cases. An evaluation run executes that batch and generates the conversations behind the result.
Evaluation runs
Purpose of this evaluation run
Measure how the current Compass build performs against the fixed golden-v1 scorecard batch across the accuracy dimensions.
What this run contains
Each attempt is a generated Compass conversation with known expected behavior, a final case-fit judgment, and run-health evidence.
Conversations in this run
Sort Accuracy · expected BA starting-salary ranking · attempt 1
Selection Accuracy · expected exact district resolution · attempt 2
Coverage-State Labeling · expected not reviewed vs not applicable
Click a row to open the conversation detail view with transcript, L1-L3 analysis, trace, artifacts, and verdict evidence.
Selected conversation preview
Evaluation read
case fit pass L2 checks pass
Scenario-fit is the product gate for generated cases. Lower-level verdicts explain the result; run health says whether the evaluation can be published.
A run explains the group. Conversation detail explains one session.
This keeps the generated evaluation layer distinct from the always-on conversation review layer.
Evaluation Review
Use this when you want to understand a publishable evaluation artifact or a diagnostic run.
- Batch: the selected scenarios/cases, like golden-v1 scorecard, Sort Accuracy, or a bug-regression pack
- Run: one execution of that batch against a Compass build
- Attempt: one generated conversation for one case and one K trial
- Scorecard: the official run over a scorecard batch, publishable only when product score and run health are both known
Conversation Detail
Use this when you want to inspect one conversation deeply.
- actual user/assistant transcript
- L1/L2/L3 analysis that can exist for every conversation
- artifacts, citations, trace, and turn snapshot
- L4 case fit only when this conversation came from a scenario/case run
Naming recommendation: use Batches for selected scenario groups, Evaluation Runs for executions, and keep sweep as the backend/process word. Users should browse batches and runs, not raw sweeps.
A scorecard run needs two answers: did Compass perform, and did the run work cleanly?
Product accuracy and run health should never be collapsed into one percentage.
Product score
Did the generated conversation satisfy the case's expected behavior?
scenario-fit gate criteria diagnostics
A case attempt passes only when the case-level expectation passes. Generic low-level passes cannot dilute a user-visible failure.
Run health
Did the run produce trustworthy evidence?
trace coverage verdict coverage eval errors
An unhealthy run can still be useful for debugging, but it should not become the published scorecard without an explicit waiver.
Run health band
all attempts linked
required checks wrote
judges completed
chat returned valid
publish review needed
not silently green
This is the place for missing traces, zero-verdict valid responses, response validation errors, evaluator exceptions, and warning traces. Those are readiness signals, not Sort Accuracy failures.
One session should be readable without a database tour.
Generated evaluation conversations get strict case fit. Organic conversations still get L1/L2/L3 analysis, but not overconfident scorecard math.
Generated case attempt
- Scenario-fit: final product gate for the whole case
- Expected behavior: loaded from the scenario/case
- Lower-level verdicts: diagnostics that explain the pass/fail
- Evidence: transcript, artifact, trace, snapshot, citations, raw verdict rows
Prompt features detected
This comes from evaluator-only request-feature extraction. It helps reviewers see what the prompt asked for; it must not become a second runtime planner.
Rule: scenario-fit decides generated case success. L1/L2/L3 remain useful for every conversation, but their lower-level passes do not erase a failed expected behavior.
Every failure should become an owned next step.
This is how scorecards, NCTQ feedback, punchlist rows, and evaluator calibration stop living in separate worlds.
golden-v1 attempt 2
feedback replay
dimension run
Owner buckets: product bug, data gap, scenario revision, criterion revision, evaluator issue, expected limitation, or correct verdict. The UI should force that classification before a failure disappears into a percentage.
The page should teach people where to look.
A reviewer should be able to learn the evaluation system by reading top-to-bottom and left-to-right.
Left rail: scope
Which evaluation run am I looking at? Golden scorecard, a dimension batch run, a bug-regression batch, or a feedback replay.
This teaches why denominator and membership matter.
Center: judgment
What was expected, what was observed, which criterion failed, and how the system classifies the failure.
This teaches the durable quality sentence.
Right rail: evidence
Transcript, artifact, source evidence, trace, turn snapshot, and the raw verdict reason.
This teaches why the answer is reviewable.
The review path: choose slice -> open conversation -> read expected behavior -> inspect actual answer -> check product and health verdicts -> classify next action.
All conversations can be reviewed. Only known tests get official scores.
The system should distinguish known expectations from inferred expectations.
Generated or golden conversations
- known scenario and case
- known expected behavior
- strict case fit allowed
- eligible for scorecard denominator
- case/attempt/product score is meaningful
Organic user conversations
- task intent is inferred
- quality signals still run
- strict score requires a known expectation or reviewer link
- best default is needs review, pass signal, or failure family
- use for triage, not public scorecard math
Doctrine: every conversation can get evidence and analysis. Only conversations with known expectations can get strict product scores.
The scorecard is the official report, not a ledger.
The scorecard still exists. It becomes the fixed snapshot over a versioned scorecard batch, with every number drilling into the same conversation case files.
Scorecard summary
Scores link into the exact conversations, not separate report rows.
What the scorecard means
- fixed batch membership defines the public denominator
- scorecard history is just prior runs over scorecard batches
- cases and attempts are plain-language counts
- product score is separate from run and evaluator health
- scenario-fit is the final product gate for generated cases
- every dimension links to conversations, evidence, and review outcomes
Publication rule: a scorecard can be published only when batch membership, build/SHA, criteria set, attempts, final case-fit verdicts, and run-health evidence are present. Otherwise it is a diagnostic run, not the headline scorecard.
Criteria are the standards. Evaluators apply them.
The UI should name the source of each judgment so reviewers know what kind of evidence they are reading.
Deterministic
Code proves something directly: sort order, citation markers, requested limit, response contract, artifact shape.
best for proof
Judge prompt
AI judges a specific qualitative criterion against bounded evidence, such as coverage-state language quality.
best for interpretation
Span assertion
Trace evidence proves the process happened: planner resolved candidates, execution dispatched, persistence succeeded.
best for process
Scenario-fit is separate: it judges the full generated case against its expected behavior. It should be visible on scorecard/evaluation-run conversations, not treated as a generic live user-turn judge.
Behind the scenes, connect the evidence into one read model.
Keep verdicts as check-level evidence. Add snapshots that summarize the conversation for product review.
Existing evidence
chat_messages, traces, snapshots, artifacts, feedback, scenarios, cases, criteria, verdicts, and evaluation-run metadata.
New read model
conversation_analysis_runs and conversation_analysis_snapshots roll this into a product-readable state.
Run and detail views
Scorecards, dimension batch runs, bug batches, feedback replays, and organic traffic all point into the same conversation snapshots.
Integration point: `frontend_contracts.get_debug_backfill()` should stop returning an empty verdict list and start returning conversation review snapshots plus linked verdict evidence.
Connect or retire anything that defines quality somewhere else.
The point is not to add a new page. The point is to remove ambiguity about where quality evidence lives.
Connect into Conversation Review
- conversation/debug pages
- scorecard dimensions
- evaluation-run reports and runsets
- scenario manager
- NCTQ feedback rows
- punchlist owner rows
Retire or demote
- cumulative Sweep live as headline
- verdict rows presented as trials
- debug views without verdict context
- scorecard drilldowns detached from transcripts
- hidden one-off analysis scripts as source of truth
Build it in slices.
Start by reading existing evidence better. Add persistence and reviewer workflows only after the UX proves itself.
1. Read model
Build a `ConversationReviewSnapshot` over messages, verdicts, criteria, traces, scenarios, cases, request features, evaluation runs, and feedback.
2. Detail panel
Show expected behavior, scenario-fit gate, diagnostic verdicts, prompt features, transcript, artifacts, trace, and evidence.
3. Evaluation runs
Add run metadata, batch/build/criteria/K attempts, product score, health band, and publication eligibility.
4. Scorecard linkup
Make the scorecard the official published run and drill every score into the same conversation case file.
5. QA crosswalk
Map failed attempts to expected behavior, actual answer, scenario-fit reason, evidence, and owner bucket.
6. Review loop
Persist reviewer classifications and notes once the read shape proves useful for scorecards, feedback, and punchlist routing.
Make Conversation Review the quality control plane.
This unifies the evaluation work instead of adding another surface people have to reconcile.
One conversation. One evidence file. Many views: organic review, evaluation-run debugging, golden scorecard, feedback triage, and punchlist routing.
What good looks like
Defined by scenarios, criteria, expected behavior, and approved product rules.
How we judge it
Layered validators, bounded judges, case-fit checks, and human review outcomes.
Where we see it
Evaluation Runs for generated batches, and Conversation Detail for the underlying session evidence.