Code automations you can trust, benchmark, and optimize.
Build code agents for tasks like bug fixing, code review, and security scans on a set of powerful primitives. Designed for benchmarking and optimization, with rigorous guardrails that let you deploy smaller models reliably.
The problem
Agent workflows for code are easy to demo and hard to trust. Output quality is unmeasured, regressions are invisible, and you're locked into whichever frontier model was best the week you built it.
How it works
Three pillars
Must-check properties
Every agent primitive carries assertions on its output. Treat them as typed semantic contracts. Outputs are evaluated by LLM-as-judge and re-sampled with adapted prompts until the properties hold or the budget is hit.
Native benchmarking
Measure must-check satisfaction rate plus your own evaluators so you can compare agent and model changes objectively. All primitives are designed with benchmarking in mind so you simply have to specify the properties of interest.
Weaker, cheaper models have the intelligence but they lack the reliability of frontier models. MustCheck's guardrails allow them to shine.
Online prompt optimization
When an environment is too complex to replicate (e.g. SRE tasks) or outputs are hard to score (e.g. bug finding), prompts are improved online using a text-gradient approach driven by both human and must-check feedback.
v17 You are an SRE. Investigate the incident.
Cite logs you actually read.
v18 + Before concluding, attempt to reproduce
+ with a synthetic load matching the
+ traffic shape seen at t-30m.Model independence
Drop-in model replacement, with the benchmark to back it.
Every primitive is model-agnostic and auto-optimizes its prompts per model. Pair guardrails with cheaper open-weight or local models to cut automation cost, without giving up reliability, because the must-check rate proves it.
Primitives
Composable, typed, and inspectable.
Reasoning loop with tool access, sandbox, and LSP.
Fan out scoped subtasks with their own assertions.
Collapse overlapping findings for search tasks.
Run a set of actions for each item in a collection.
Combine a collection of items under contracts.
Persist typed code-domain entities across runs.
Citation-aware retrieval scoped to the repository.
Get early access
Tell us what you'd build.
Early access is granted based on signup and fit. Specifics help us prioritize teams whose use case matches what's stable today.
- No frontier-model lock-in
- Inspect every assertion and resample
- Self-hosted available
FAQ