Operating model
From Pinocchio to a real boy
How Compass can regain warmth, judgment, and useful follow-up motion without giving up the evidence chain that makes it trustworthy.
Before: correct machinery, low warmth. The strings are visible, and sometimes the answer reads like the mechanism.
After: grounded answers with a source-backed voice layer. Still bounded. Still inspectable.
Plain language answer
A soul layer should be a values layer, not a vibes layer.
The goal is not to let a model free-write over Compass results. The goal is to give the model a bounded packet of approved evidence and ask for structured presentation choices: what context helps, what follow-up questions are useful, and how to sound like NCTQ without inventing NCTQ.
Why the wooden version exists
Compass became mechanical on purpose.
The active architecture has two LLM stages up front, then deterministic execution, rendering, and quality. That design gave us a clean chain of custody for data values, citations, coverage states, and snapshots.
Repo anchor: docs/ARCHITECTURE.md defines the five-stage pipeline and says everything after the Planner is deterministic today.
The trust gap
Correct rows are not the same as a satisfying answer.
Compass can be right and still feel wooden: over-caveated, procedural, or blind to why a policy question matters. That matters because NCTQ is not only a database. It is a research and advocacy organization with a point of view.
The data should stay deterministic. The conversation should not sound like an audit log.
Prior QA repeatedly found answers that were basically correct but needed clearer prose, cleaner caveats, and less internal terminology.
The soul was already in the acceptance criteria
The punchlist says warmth has to be operational.
The strongest Notion thread is not "make Compass chatty." It is: make Compass fast, trustworthy, table-first, cited, transparent about school-year scope, brief first, deeper on request, and useful for the next turn.
What the voice should sound like
NCTQ's public voice is evidence-forward and practical.
The public site frames NCTQ as a nonpartisan research and advocacy organization that roots recommendations in research, creates accountability, shares promising practices, and champions equity. The policy pages translate that into concrete language district leaders can use.
Source control for personality
NCTQ context has tiers. The finalizer has to know which tier it is using.
The Notion guidance page draws the clean source map: TCD data says what districts do; Pathfinder content says what NCTQ recommends and why; the Compass cache is the runtime copy; publications provide research context; punchlist and eval pages define behavior rules, not answer facts.
Airtable to Compass
The publications source is big enough to matter.
Airtable owns title, author, URL, published date, tags, and inclusion. Compass preserves backfilled summaries, key points, recommendations, data highlights, and AI tags. The Notion publication-bundle epic is clear about the constraint: use this as governed context, not generic RAG. The current chat path does not yet surface those publications.
Verified against the staging `compass` schema on May 22, 2026. Local anchors: sync_publications_airtable.py and docs/ARCHITECTURE.md.
Texture, not authority drift
Publications supply connective tissue.
The high-volume tags in staging align to Compass questions: Teacher Prep, Pay, Evaluation, Retention, Hiring, Teacher Diversity, Observations, Differentiated Pay, and Performance Pay. This is NCTQ-curated institutional memory.
- 1Use publication summaries to explain why a metric matters.
- 2Use recommendations only when the source row was selected and cited.
- 3Use data highlights to suggest next questions, not to replace answer rows.
Second-opinion synthesis
The finalizer should not be a second writer.
If an AI step directly rewrites the final answer, it can soften caveats, move citations, or hide validation failures. The safer pattern is a structured presentation plan that deterministic code applies, with a shadow mode first.
What the model sees
Give AI a packet, not a database.
The finalizer should receive the already validated response manifest plus a tiny set of approved context records. It should not be able to search arbitrary tables, call the web, or inspect raw user history beyond the current turn context.
CompassResponsePacket user_question route deterministic_body artifact_id result_snapshot coverage_speech_frame coverage_disclosures citation_refs[] selected_policy_guidance_ids[] selected_publication_refs[] claim_type_constraints[] allowed_followup_intents[]
The model chooses the shelf; it does not write the book. Any output references only IDs already in this packet.
PresentationPlan
warmth_level: Literal["plain", "friendly"]
lead_style: Literal["direct", "contextual"]
context_note:
text
source_tier
source_ids[]
suggested_questions[]:
question
route_hint
source_ids[]
forbidden_changes_detected[]
Pydantic AI fit
Structured output is the safety rail.
Pydantic AI's current best practices match the shape of this problem: typed dependencies for the approved packet, structured outputs for the plan, output validators with retry/fallback, TestModel and FunctionModel tests, and Logfire spans for request/response observability.
Primary docs: structured output, dependencies, testing, Logfire.
What remains untouchable
The strings are the guardrails.
- AThe finalizer cannot change route, metric, district, filter, sort, limit, year, row order, or values.
- BEvery source mention must point to an existing citation, policy guidance ID, or publication ID.
- CSuggestions must be framed as next questions, not hidden recommendations.
- DIf validation fails or latency is high, the deterministic response ships unchanged.
- ECoverage phrases, source labels, and artifact parity are deterministic surface-contract fields.
A real boy, but still grounded in the floorboards.
That means source-backed warmth, not free-floating personality.
The Notion warning label
Data first. Research second. Stance last.
The punchlist shows both failure modes: Compass sometimes omitted NCTQ research context entirely, and sometimes overreached by presenting NCTQ's national stance as a district's policy. The finalizer has to keep the hierarchy visible.
- 1Lead with the validated district or metric result when the user asked for data.
- 2Add research context only from admitted guidance or publication IDs.
- 3Label NCTQ stance as stance, never as a district-specific fact.
Voice is useful only when the chain of custody is still visible.
Concrete bans from Notion: no invented nctq.org URLs, no plausible publication titles, no generic "NCTQ recommends" language unless the source and claim type are admitted.
How this would feel in product
Three small moves make the idea tangible.
Salary lookup
Now: rank current salary rows, explain who was not ranked, cite source documents.
Future context: If the packet admits Far from home, add one sentence that housing costs have outpaced beginning salaries in large urban districts. Then suggest: "Want to compare salary with health-benefit costs?"
Observation counts
Now: show districts, observation counts, year, and coverage caveats.
Future context: If the packet admits Districts are facing hard choices, explain why multiple evaluation measures matter. Then suggest: "Which districts use student growth in evaluations?"
Leave and benefits
Now: answer whether districts offer paid parental leave or health coverage, with exact sources.
Future context: If the packet admits Does paid parental leave pay off?, note that 43% of large districts offered paid parental leave in 2024. Then suggest: "Show districts with the strongest parental-leave policies."
Show me Texas districts with the lowest formal observation counts.
I found current reviewed numeric data for 8 of 11 cells. The ranking below includes only districts with current numeric values.
I found the lowest current reviewed observation counts in the Texas districts Compass covers. Because observation policy is part of NCTQ's broader teacher-evaluation work, I am keeping districts without current numeric values visible in the availability note rather than ranking them as zero.
What changed
The facts did not move. The frame did.
The improved version makes the coverage caveat legible, names the relevant policy context, and invites a next question without smuggling in unsupported judgment.
- 1Same rows and rank order.
- 2Same citation markers.
- 3Better explanation of why missing data is handled carefully.
From static examples to contextual next steps
Suggested questions are where personality can be useful and safe.
The current first screen is data-forward: highest starting pay, Dallas ISD starting pay, Chicago vs. Denver starting pay. Notion also shows a reviewed wait-time copy surface with status lines and "Did you know" cards. A soul layer should extend that pattern by generating next questions from the actual route, topic, selected metrics, and approved source IDs.
After salary ranking
"How do these salaries look once health-benefit costs are considered?"
Route hint: execute. Optional source: Affording to stay healthy.
After hard-to-staff pay
"Which districts use differentiated pay for special education?"
Route hint: execute or policy_guidance. Optional source: Beyond one-size-fits-all.
After leave data
"Show me districts with strong paid parental leave."
Route hint: policy_guidance. Optional source: Does paid parental leave pay off?
Four slices
Ship the soul in layers.
1. Deterministic prose pass
Improve lead lines, coverage language, methodology copy, and policy-guidance section order.
2. Guidance and publication resolver
Admit Pathfinder guidance, rationales, exemplars, and `for_chatbot` publications through stable source IDs.
3. Evidence-bundle validators
Prove claim type, source tier, citation support, and surface parity before any polish is applied.
4. Shadow, then user-visible finalizer behind a flag
Run Pydantic AI after response as telemetry first. When enabled, allow only structured presentation plans; deterministic code applies them, revalidates the manifest, and falls back to today's renderer on any mismatch.
Measure the invisible part
Voice quality needs a ledger too.
The Accuracy Framework says every issue needs a layer owner: conversation memory, current-turn contract, data execution, canonical result, validation, or rendering. The soul layer should be evaluated as a rendering and next-turn product layer sitting on top of those existing dimensions.
Final frame
Compass becomes more human by becoming more explicit about where judgment comes from.
The soul is not an unbounded model. It is the original Policy Advisor promise made concrete: fast trustworthy tables, exact source trails, clear limits, NCTQ context, and useful next steps. The model can help choose presentation moves. The evidence still decides what can be said.
Evidence used