Compass quality scorecard follow-up

The case set is finally complete. The read is stricter.

A comparison between the May 26 briefing and the May 28 combined staging ledger. The short version: coverage and traceability are now stronger, while the complete scorecard gives us a more demanding work queue.

Bottom line

Scores shifted as the case set grew — Compass stays launch-ready, with room to improve.

The corrected read is narrower.

Using the same scorecard math, May 28 lands between 80% and 87%. The case set also grew from 296 to 323 active cases.

Selection +12 Sort +6 Data +4 Filter +2 Citation +2 Consistency +1

Sort Accuracy

86%

83%

26

32

Still solid after adding six cases.

Filter Accuracy

91%

80%

27

29

Largest real drop; constraint handling needs review.

Selection Accuracy

87%

82%

77

89

Down modestly on a much larger case set.

Coverage-State Labeling

83%

85%

26

Improved on the same case count.

Citation Accuracy

84%

86%

37

39

Improved despite adding two cases.

Consistency

89%

87%

28

29

Mostly stable; still worth watching.

Data Fidelity

83%

80%

75

79

Slightly lower and tied for lowest score.

Current May 28 read

The complete scorecard is tighter: every dimension is between 80% and 87%.

Consistency

87%

Citation

86%

Coverage

85%

Sort

83%

Selection

82%

Data Fidelity

80%

Filter

80%

Best current score

87%

Consistency is strongest under the corrected scorecard math.

Lowest current score

80%

Data Fidelity and Filter Accuracy are tied at the floor.

Improved

+2

Citation and Coverage both improved versus May 26.

Largest real drop

-11

Filter Accuracy fell most after the case set expanded.

What we are learning

Most misses are about proof and task fit, not a random quality collapse.

613

Persistence proof misses

The answer may exist, but the run did not prove it was saved and traceable.

589

Case satisfaction misses

The response passed some checks but did not fully answer the user's scenario.

159

Execution trace misses

The system did not always show that the expected execution route happened.

140

CSV/render parity misses

The exported artifact and rendered table can drift from each other.

Proof

Make the proof chain first-class.

The biggest risk is not that every answer is wrong. It is that the system does not consistently prove what happened.

Fit

Judge the whole user ask, not only checklist fragments.

Scenario-fit misses are telling us where a technically plausible answer still falls short for the actual case.

Artifacts

Tables, CSVs, and citations need tighter parity.

Visible artifacts are where trust breaks fastest, especially when exports or citations do not match the rendered answer.

Behavior

Coverage and uncertainty language still need control.

The model sometimes asks for clarification when it should answer, or answers without enough caveat when coverage is limited.

Appendix

Seven-dimension scorecard — May 28, 2026.

Sort Accuracy

32/32

0

96

2,771

83%

Data Fidelity

79/79

0

237

5,249

80%

Selection Accuracy

89/89

0

267

6,758

82%

Coverage-State Labeling

26/26

0

78

1,799

85%

Filter Accuracy

29/29

0

87

1,880

80%

Citation Accuracy

39/39

0

117

2,924

86%

Consistency

29/29

0

87

2,908

87%

Sources: May 26 briefing; May 28 combined artifacts at .context/compass/pa-eval-sweeps/20260528T165456Z-combined-scorecard-backfill/combined-scorecard-after.json and combined-scorecard-after.md.