Early access — MustCheck v0.5

Code automations you can trust, benchmark, and optimize.

Build code agents for tasks like bug fixing, code review, and security scans on a set of powerful primitives. Designed for benchmarking and optimization, with rigorous guardrails that let you deploy smaller models reliably.

Join the waitlist See how it works

workflows / triage-bug-from-error.mc

● running

The problem

Agent workflows for code are easy to demo and hard to trust. Output quality is unmeasured, regressions are invisible, and you're locked into whichever frontier model was best the week you built it.

How it works

Three pillars

Must-check properties

Every agent primitive carries assertions on its output. Treat them as typed semantic contracts. Outputs are evaluated by LLM-as-judge and re-sampled with adapted prompts until the properties hold or the budget is hit.

output · code_reviewattempt 2 / 5

claims ⇒ backed by source-code citation

diff applies cleanly

all tests pass

failed → resampling

Adapting prompt with judge feedback…

claims ⇒ backed by source-code citation

diff applies cleanly

all tests pass

contract satisfied

Native benchmarking

Measure must-check satisfaction rate plus your own evaluators so you can compare agent and model changes objectively. All primitives are designed with benchmarking in mind so you simply have to specify the properties of interest.

must-check satisfaction rate

Per-workflow version, across models

y: %, x: version

Weaker, cheaper models have the intelligence but they lack the reliability of frontier models. MustCheck's guardrails allow them to shine.

Online prompt optimization

When an environment is too complex to replicate (e.g. SRE tasks) or outputs are hard to score (e.g. bug finding), prompts are improved online using a text-gradient approach driven by both human and must-check feedback.

prompt · investigate_root_cause

v17  You are an SRE. Investigate the incident.
     Cite logs you actually read.
v18  + Before concluding, attempt to reproduce
     + with a synthetic load matching the
     + traffic shape seen at t-30m.

satisfaction0.71 → 0.88

humanmust-checktext-gradient · step 18/40

Model independence

Drop-in model replacement, with the benchmark to back it.

Every primitive is model-agnostic and auto-optimizes its prompts per model. Pair guardrails with cheaper open-weight or local models to cut automation cost, without giving up reliability, because the must-check rate proves it.

frontier-apiopen-weight-8bopen-weight-70bself-hostedlocal-gguf

Primitives

Composable, typed, and inspectable.

primitive

Code agent

Reasoning loop with tool access, sandbox, and LSP.

primitive

Sub-agents

Fan out scoped subtasks with their own assertions.

primitive

Hierarchical deduplication

Collapse overlapping findings for search tasks.

primitive

For Each

Run a set of actions for each item in a collection.

primitive

Aggregate

Combine a collection of items under contracts.

primitive

Persistent Store

Persist typed code-domain entities across runs.

primitive

RAG retrieval

Citation-aware retrieval scoped to the repository.

code-domain types

RepositoryBugErrorLogSupport TicketEngineering TicketTestsCommentsReview...

Get early access

Tell us what you'd build.

Early access is granted based on signup and fit. Specifics help us prioritize teams whose use case matches what's stable today.

No frontier-model lock-in
Inspect every assertion and resample
Self-hosted available

FAQ

Questions a skeptical engineer would ask.

It's code-native, primitives know about repositories, errors, logs, tests, and tickets and every step is verified. The output of each primitive is checked against assertions before it leaves the node, instead of hoping the next LLM call cleans up the mess.