Compass
Research deck / May 22, 2026

Operating model

From Pinocchio to a real boy

How Compass can regain warmth, judgment, and useful follow-up motion without giving up the evidence chain that makes it trustworthy.

Compass
Thesis

Plain language answer

A soul layer should be a values layer, not a vibes layer.

The goal is not to let a model free-write over Compass results. The goal is to give the model a bounded packet of approved evidence and ask for structured presentation choices: what context helps, what follow-up questions are useful, and how to sound like NCTQ without inventing NCTQ.

Compass
Current system

Why the wooden version exists

Compass became mechanical on purpose.

The active architecture has two LLM stages up front, then deterministic execution, rendering, and quality. That design gave us a clean chain of custody for data values, citations, coverage states, and snapshots.

LLMPreplannerClassifies intent and thinking need. Not authoritative.
LLMPlannerEmits typed route and query shape.
PythonExecutorFetches approved rows and sources.
PythonRendererFormats deterministic markdown and tables.
Python + AIQualityWrites post-response verdicts.
LedgerTracePersists snapshots, manifests, and verdicts.

Repo anchor: docs/ARCHITECTURE.md defines the five-stage pipeline and says everything after the Planner is deterministic today.

Compass
Product gap

The trust gap

Correct rows are not the same as a satisfying answer.

Compass can be right and still feel wooden: over-caveated, procedural, or blind to why a policy question matters. That matters because NCTQ is not only a database. It is a research and advocacy organization with a point of view.

The data should stay deterministic. The conversation should not sound like an audit log.

Prior QA repeatedly found answers that were basically correct but needed clearer prose, cleaner caveats, and less internal terminology.

Compass
Notion synthesis

The soul was already in the acceptance criteria

The punchlist says warmth has to be operational.

The strongest Notion thread is not "make Compass chatty." It is: make Compass fast, trustworthy, table-first, cited, transparent about school-year scope, brief first, deeper on request, and useful for the next turn.

Original charter Plain-language Q&A, verified calculations, citations, publications, glossary context, and role-aware tone for district leaders, HR teams, researchers, journalists, and advocates.
Response surface contract New prose feedback should map to answer voice, coverage wording, universe/limit disclosure, source parity, export parity, or table/prose contradiction.
Coverage rules The system picks coverage wording. The model explains it. Canonical phrases are typed fields, not vibes.
Reviewed copy surfaces Status lines, "Did you know" cards, and suggested follow-ups are already part of the public personality layer.
Compass
Public NCTQ context

What the voice should sound like

NCTQ's public voice is evidence-forward and practical.

The public site frames NCTQ as a nonpartisan research and advocacy organization that roots recommendations in research, creates accountability, shares promising practices, and champions equity. The policy pages translate that into concrete language district leaders can use.

Mission Every child should have access to an effective teacher; every teacher should have the opportunity to be effective. Source
Method Collect unique data, compare it to research and best practice, and develop actionable recommendations. Source
Compensation Pay systems can help attract, retain, and strategically place effective teachers. Source
Evaluation Strong systems support growth, recognize excellence, and inform workforce decisions. Source
Compass
Approved source tiers

Source control for personality

NCTQ context has tiers. The finalizer has to know which tier it is using.

The Notion guidance page draws the clean source map: TCD data says what districts do; Pathfinder content says what NCTQ recommends and why; the Compass cache is the runtime copy; publications provide research context; punchlist and eval pages define behavior rules, not answer facts.

Tier
What it can say
Runtime use
Risk if mixed
TCD / Compass rows
District facts, values, years, citations.
Answer tables and exports.
None if validated.
Pathfinder guidance
NCTQ recommendations, rationales, exemplars.
Policy guidance sections.
Stance may impersonate district fact.
Publications
Research context, titles, URLs, summaries.
Evidence bundles and next questions.
Invented links or overbroad claims.
Punchlist / evals
Behavior rules and acceptance criteria.
Validators and product design.
Should not become answer content.
Compass
Approved content source

Airtable to Compass

The publications source is big enough to matter.

747
Rows currently marked `for_chatbot` in staging `compass.nctq_publications`.
698
Approved rows with summaries and key points available for bounded context.
655
Rows with data highlights, useful for context notes and suggested follow-ups.

Airtable owns title, author, URL, published date, tags, and inclusion. Compass preserves backfilled summaries, key points, recommendations, data highlights, and AI tags. The Notion publication-bundle epic is clear about the constraint: use this as governed context, not generic RAG. The current chat path does not yet surface those publications.

Verified against the staging `compass` schema on May 22, 2026. Local anchors: sync_publications_airtable.py and docs/ARCHITECTURE.md.

Compass
What the source can add

Texture, not authority drift

Publications supply connective tissue.

The high-volume tags in staging align to Compass questions: Teacher Prep, Pay, Evaluation, Retention, Hiring, Teacher Diversity, Observations, Differentiated Pay, and Performance Pay. This is NCTQ-curated institutional memory.

  • 1Use publication summaries to explain why a metric matters.
  • 2Use recommendations only when the source row was selected and cited.
  • 3Use data highlights to suggest next questions, not to replace answer rows.
Publication
Useful signal
Bounded use
Teacher salaries rose 24% since 2019; rents rose 51%.
Context after salary results.
Evaluation design matters for staffing decisions.
Next questions after observation tables.
10% special-ed pay increase reduced turnover 4 points.
Guidance for hard-to-staff pay.
Compass
Key decision

Second-opinion synthesis

The finalizer should not be a second writer.

If an AI step directly rewrites the final answer, it can soften caveats, move citations, or hide validation failures. The safer pattern is a structured presentation plan that deterministic code applies, with a shadow mode first.

Deterministic renderer only
AI writes final prose
AI suggests, renderer applies
Warmth
Medium
High
High
Data safety
High
Lower
High
Observability
Clear
Harder
Clear
Recommendation
Good first slice
Avoid for launch
Best target architecture
Compass
Evidence packet

What the model sees

Give AI a packet, not a database.

The finalizer should receive the already validated response manifest plus a tiny set of approved context records. It should not be able to search arbitrary tables, call the web, or inspect raw user history beyond the current turn context.

CompassResponsePacket
  user_question
  route
  deterministic_body
  artifact_id
  result_snapshot
  coverage_speech_frame
  coverage_disclosures
  citation_refs[]
  selected_policy_guidance_ids[]
  selected_publication_refs[]
  claim_type_constraints[]
  allowed_followup_intents[]

The model chooses the shelf; it does not write the book. Any output references only IDs already in this packet.

Compass
Pydantic AI shape
PresentationPlan
  warmth_level: Literal["plain", "friendly"]
  lead_style: Literal["direct", "contextual"]
  context_note:
    text
    source_tier
    source_ids[]
  suggested_questions[]:
    question
    route_hint
    source_ids[]
  forbidden_changes_detected[]

Pydantic AI fit

Structured output is the safety rail.

Pydantic AI's current best practices match the shape of this problem: typed dependencies for the approved packet, structured outputs for the plan, output validators with retry/fallback, TestModel and FunctionModel tests, and Logfire spans for request/response observability.

Primary docs: structured output, dependencies, testing, Logfire.

Compass
Rules of the soul

What remains untouchable

The strings are the guardrails.

  • AThe finalizer cannot change route, metric, district, filter, sort, limit, year, row order, or values.
  • BEvery source mention must point to an existing citation, policy guidance ID, or publication ID.
  • CSuggestions must be framed as next questions, not hidden recommendations.
  • DIf validation fails or latency is high, the deterministic response ships unchanged.
  • ECoverage phrases, source labels, and artifact parity are deterministic surface-contract fields.

A real boy, but still grounded in the floorboards.

That means source-backed warmth, not free-floating personality.

Compass
NCTQ stance without drift

The Notion warning label

Data first. Research second. Stance last.

The punchlist shows both failure modes: Compass sometimes omitted NCTQ research context entirely, and sometimes overreached by presenting NCTQ's national stance as a district's policy. The finalizer has to keep the hierarchy visible.

  • 1Lead with the validated district or metric result when the user asked for data.
  • 2Add research context only from admitted guidance or publication IDs.
  • 3Label NCTQ stance as stance, never as a district-specific fact.

Voice is useful only when the chain of custody is still visible.

Concrete bans from Notion: no invented nctq.org URLs, no plausible publication titles, no generic "NCTQ recommends" language unless the source and claim type are admitted.

Compass
Concrete examples

How this would feel in product

Three small moves make the idea tangible.

Salary lookup

Now: rank current salary rows, explain who was not ranked, cite source documents.

Future context: If the packet admits Far from home, add one sentence that housing costs have outpaced beginning salaries in large urban districts. Then suggest: "Want to compare salary with health-benefit costs?"

Observation counts

Now: show districts, observation counts, year, and coverage caveats.

Future context: If the packet admits Districts are facing hard choices, explain why multiple evaluation measures matter. Then suggest: "Which districts use student growth in evaluations?"

Leave and benefits

Now: answer whether districts offer paid parental leave or health coverage, with exact sources.

Future context: If the packet admits Does paid parental leave pay off?, note that 43% of large districts offered paid parental leave in 2024. Then suggest: "Show districts with the strongest parental-leave policies."

Compass
User-facing before / after
User
Show me Texas districts with the lowest formal observation counts.
Wooden answer
I found current reviewed numeric data for 8 of 11 cells. The ranking below includes only districts with current numeric values.
With bounded soul
I found the lowest current reviewed observation counts in the Texas districts Compass covers. Because observation policy is part of NCTQ's broader teacher-evaluation work, I am keeping districts without current numeric values visible in the availability note rather than ranking them as zero.

What changed

The facts did not move. The frame did.

The improved version makes the coverage caveat legible, names the relevant policy context, and invites a next question without smuggling in unsupported judgment.

  • 1Same rows and rank order.
  • 2Same citation markers.
  • 3Better explanation of why missing data is handled carefully.
Compass
Conversation lubricant

From static examples to contextual next steps

Suggested questions are where personality can be useful and safe.

The current first screen is data-forward: highest starting pay, Dallas ISD starting pay, Chicago vs. Denver starting pay. Notion also shows a reviewed wait-time copy surface with status lines and "Did you know" cards. A soul layer should extend that pattern by generating next questions from the actual route, topic, selected metrics, and approved source IDs.

After salary ranking

"How do these salaries look once health-benefit costs are considered?"

Route hint: execute. Optional source: Affording to stay healthy.

After hard-to-staff pay

"Which districts use differentiated pay for special education?"

Route hint: execute or policy_guidance. Optional source: Beyond one-size-fits-all.

After leave data

"Show me districts with strong paid parental leave."

Route hint: policy_guidance. Optional source: Does paid parental leave pay off?

Compass
Implementation path

Four slices

Ship the soul in layers.

1. Deterministic prose pass

Improve lead lines, coverage language, methodology copy, and policy-guidance section order.

2. Guidance and publication resolver

Admit Pathfinder guidance, rationales, exemplars, and `for_chatbot` publications through stable source IDs.

3. Evidence-bundle validators

Prove claim type, source tier, citation support, and surface parity before any polish is applied.

4. Shadow, then user-visible finalizer behind a flag

Run Pydantic AI after response as telemetry first. When enabled, allow only structured presentation plans; deterministic code applies them, revalidates the manifest, and falls back to today's renderer on any mismatch.

Compass
How to know it worked

Measure the invisible part

Voice quality needs a ledger too.

The Accuracy Framework says every issue needs a layer owner: conversation memory, current-turn contract, data execution, canonical result, validation, or rendering. The soul layer should be evaluated as a rendering and next-turn product layer sitting on top of those existing dimensions.

Criterion
Deterministic
LLM judge
Human spot check
Did values/citations change?
Yes
No
Sample
Are coverage phrases canonical?
Yes
No
Sample
Is the source tier correct?
ID check
Yes
Yes
Are next questions relevant?
Route check
Yes
Yes
Is NCTQ voice overreaching?
Source ID check
Yes
Yes
Compass
Takeaway

Final frame

Compass becomes more human by becoming more explicit about where judgment comes from.

The soul is not an unbounded model. It is the original Policy Advisor promise made concrete: fast trustworthy tables, exact source trails, clear limits, NCTQ context, and useful next steps. The model can help choose presentation moves. The evidence still decides what can be said.

Compass
Research sources

Evidence used

Source notes

Publication pipelinesync_publications_airtable.py, staging `compass.nctq_publications` schema/counts, and compass_schema.sql
Frontend citation contractdocs/frontend-integration.md and sources.js
Second opinionsRead-only sub-agent review on publication data flow, architecture risk, and presentation narrative. No files edited by sub-agents.