1 / 5

Compass quality scorecard follow-up

The case set is finally complete. The read is stricter.

A comparison between the May 26 briefing and the May 28 combined staging ledger. The short version: coverage and traceability are now stronger, while the complete scorecard gives us a more demanding work queue.

Bottom line

Scores shifted as the case set grew — Compass stays launch-ready, with room to improve.

The corrected read is narrower.
Using the same scorecard math, May 28 lands between 80% and 87%. The case set also grew from 296 to 323 active cases.
Selection +12 Sort +6 Data +4 Filter +2 Citation +2 Consistency +1
Dimension
Last score
This score
Cases then
Cases now
Plain read
Sort Accuracy
86%
83%
26
32
Still solid after adding six cases.
Filter Accuracy
91%
80%
27
29
Largest real drop; constraint handling needs review.
Selection Accuracy
87%
82%
77
89
Down modestly on a much larger case set.
Coverage-State Labeling
83%
85%
26
26
Improved on the same case count.
Citation Accuracy
84%
86%
37
39
Improved despite adding two cases.
Consistency
89%
87%
28
29
Mostly stable; still worth watching.
Data Fidelity
83%
80%
75
79
Slightly lower and tied for lowest score.

Current May 28 read

The complete scorecard is tighter: every dimension is between 80% and 87%.

Consistency
87%
Citation
86%
Coverage
85%
Sort
83%
Selection
82%
Data Fidelity
80%
Filter
80%
Best current score
87%
Consistency is strongest under the corrected scorecard math.
Lowest current score
80%
Data Fidelity and Filter Accuracy are tied at the floor.
Improved
+2
Citation and Coverage both improved versus May 26.
Largest real drop
-11
Filter Accuracy fell most after the case set expanded.

What we are learning

Most misses are about proof and task fit, not a random quality collapse.

613
Persistence proof misses
The answer may exist, but the run did not prove it was saved and traceable.
589
Case satisfaction misses
The response passed some checks but did not fully answer the user's scenario.
159
Execution trace misses
The system did not always show that the expected execution route happened.
140
CSV/render parity misses
The exported artifact and rendered table can drift from each other.
Proof

Make the proof chain first-class.

The biggest risk is not that every answer is wrong. It is that the system does not consistently prove what happened.

Fit

Judge the whole user ask, not only checklist fragments.

Scenario-fit misses are telling us where a technically plausible answer still falls short for the actual case.

Artifacts

Tables, CSVs, and citations need tighter parity.

Visible artifacts are where trust breaks fastest, especially when exports or citations do not match the rendered answer.

Behavior

Coverage and uncertainty language still need control.

The model sometimes asks for clarification when it should answer, or answers without enough caveat when coverage is limited.

Appendix

Seven-dimension scorecard — May 28, 2026.

Dimension
Cases
Missing
Sessions
Verdicts
Score
Sort Accuracy
32/32
0
96
2,771
83%
Data Fidelity
79/79
0
237
5,249
80%
Selection Accuracy
89/89
0
267
6,758
82%
Coverage-State Labeling
26/26
0
78
1,799
85%
Filter Accuracy
29/29
0
87
1,880
80%
Citation Accuracy
39/39
0
117
2,924
86%
Consistency
29/29
0
87
2,908
87%

Sources: May 26 briefing; May 28 combined artifacts at .context/compass/pa-eval-sweeps/20260528T165456Z-combined-scorecard-backfill/combined-scorecard-after.json and combined-scorecard-after.md.