Early access — MustCheck v0.5

Code automations you can trust, benchmark, and optimize.

Build code agents for tasks like bug fixing, code review, and security scans on a set of powerful primitives. Designed for benchmarking and optimization, with rigorous guardrails that let you deploy smaller models reliably.

workflows / triage-bug-from-error.mc
running
repository
monorepo/api
error
OOM in worker-3
infrastructure
k8s · worker-3 node
standard_agent
Form root cause hypothesis
sandbox · lsp · observability
for_each hypothesis
for_each_agent
Hypothesis validation
sandbox · lsp
sub_agentInvestigate source
sub_agentInvestigate logs
sub_agentInvestigate deployment
aggregation
Compose incident report
engineering_ticket
Open INC-4821
slack
Alert #incidents
must-checkresampling 2/5
  • claim ⇒ log / metric citation
  • root cause ⇒ reproduced
  • consistent with timeline

The problem

Agent workflows for code are easy to demo and hard to trust. Output quality is unmeasured, regressions are invisible, and you're locked into whichever frontier model was best the week you built it.

How it works

Three pillars

01

Must-check properties

Every agent primitive carries assertions on its output. Treat them as typed semantic contracts. Outputs are evaluated by LLM-as-judge and re-sampled with adapted prompts until the properties hold or the budget is hit.

output · code_reviewattempt 2 / 5
claims ⇒ backed by source-code citation
diff applies cleanly
all tests pass
failed → resampling
Adapting prompt with judge feedback…
claims ⇒ backed by source-code citation
diff applies cleanly
all tests pass
contract satisfied
02

Native benchmarking

Measure must-check satisfaction rate plus your own evaluators so you can compare agent and model changes objectively. All primitives are designed with benchmarking in mind so you simply have to specify the properties of interest.

must-check satisfaction rate
Per-workflow version, across models
y: %, x: version

Weaker, cheaper models have the intelligence but they lack the reliability of frontier models. MustCheck's guardrails allow them to shine.

03

Online prompt optimization

When an environment is too complex to replicate (e.g. SRE tasks) or outputs are hard to score (e.g. bug finding), prompts are improved online using a text-gradient approach driven by both human and must-check feedback.

prompt · investigate_root_cause
v17  You are an SRE. Investigate the incident.
     Cite logs you actually read.
v18  + Before concluding, attempt to reproduce
     + with a synthetic load matching the
     + traffic shape seen at t-30m.
satisfaction0.71 → 0.88
humanmust-checktext-gradient · step 18/40

Model independence

Drop-in model replacement, with the benchmark to back it.

Every primitive is model-agnostic and auto-optimizes its prompts per model. Pair guardrails with cheaper open-weight or local models to cut automation cost, without giving up reliability, because the must-check rate proves it.

frontier-apiopen-weight-8bopen-weight-70bself-hostedlocal-gguf

Primitives

Composable, typed, and inspectable.

primitive
Code agent

Reasoning loop with tool access, sandbox, and LSP.

primitive
Sub-agents

Fan out scoped subtasks with their own assertions.

primitive
Hierarchical deduplication

Collapse overlapping findings for search tasks.

primitive
For Each

Run a set of actions for each item in a collection.

primitive
Aggregate

Combine a collection of items under contracts.

primitive
Persistent Store

Persist typed code-domain entities across runs.

primitive
RAG retrieval

Citation-aware retrieval scoped to the repository.

code-domain types
RepositoryBugErrorLogSupport TicketEngineering TicketTestsCommentsReview...

Get early access

Tell us what you'd build.

Early access is granted based on signup and fit. Specifics help us prioritize teams whose use case matches what's stable today.

  • No frontier-model lock-in
  • Inspect every assertion and resample
  • Self-hosted available

FAQ

Questions a skeptical engineer would ask.

It's code-native, primitives know about repositories, errors, logs, tests, and tickets and every step is verified. The output of each primitive is checked against assertions before it leaves the node, instead of hoping the next LLM call cleans up the mess.