A new open-source project called Statewright is tackling the #1 pain point of AI coding agents — reliability — by constraining not what the model thinks, but *what tools it can reach at each phase of a workflow. The approach is strikingly simple: replace the open-ended tool free-for-all with a state machine that dictates exactly which tools the agent can see and use at each step.
The results speak for themselves. In a 5-task SWE-bench subset, two local models went from 2 of 10 attempts passing to 10 of 10 — a fivefold improvement — with zero changes to the models themselves.
The Problem: 40 Tools, One Agent, Zero Structure
Anyone who has watched an AI coding agent at work has seen the failure pattern. Give Claude Code or Codex a bug report and 40+ tools, and the agent often:
- Re-reads the same file five times
- Calls edit tools during its analysis phase
- Deploys code before tests pass
- Gets stuck in “read-loop death spirals”
- Calls destructive shell operations without guardrails
The conventional response is to use a bigger model or write a longer system prompt. This helps at the margins but doesn’t fix the root cause: the model is being asked to self-regulate its own tool use — a problem that the Forge guardrails framework also addresses from a different angle, and it’s not good at it.
Observability tools tell you what went wrong after the fact. They don’t prevent it.
The State Machine Approach: Structure Beats Reasoning
Statewright’s insight is elegant: instead of making the model bigger, make the problem space smaller.
“Agents are suggestions, states are laws.” — Statewright README
The system defines a workflow as a state machine. Each state has:
- Allowed tools — the model can only see and call tools permitted in the current phase
- Tool restrictions — fine-grained limits on edits, command execution, and file access
- Transitions — conditions that move the workflow to the next state
- Guards — programmatic checks (e.g., “test_result eq pass”) that gate transitions
A typical bugfix workflow looks like this:
| Phase | Allowed Tools | What Happens |
|---|---|---|
| Planning | Read, Grep, Glob | Agent analyzes code, identifies the bug |
| Implementing | Read, Edit, Write | Agent fixes the code with edit guards |
| Testing | Read, Bash (pytest only) | Agent runs tests; passes → completed, fails → back to implementing |
| Completed | — | Workflow ends |
Call a tool that’s not in the current phase and the agent gets rejected with a message explaining what IS available and how to transition. This is hard enforcement — the tool call is intercepted at the hook layer before execution.
Under the Hood: A Rust Engine with MCP Integration
The core of Statewright is a deterministic Rust engine that evaluates state machine definitions, part of a growing ecosystem of agent reliability tooling. No LLM in the loop — it’s pure rules.
On top of this sits a plugin layer that integrates with coding agents via MCP (Model Context Protocol). When a workflow is activated, hooks enforce tool restrictions per state. The model goes from seeing 40 tools to seeing 5 — and gets clear instructions about its current phase and how to progress.
The supported agents and their enforcement levels:
| Agent | Enforcement |
|---|---|
| Claude Code | Hard (hooks + MCP) |
| Codex | Hard (hooks + MCP) |
| Pi | Hard (with tool recovery for local models) |
| opencode | Hard (alpha) |
| Cursor | Advisory (MCP + rules) |
The guardrails go beyond simple tool allow/block. Among the most notable:
- Bash discernment — blocks
echo > file,rm -rf,sed -i, and scripting interpreters even when Bash itself is permitted - Edit guards — rejects diffs exceeding configurable line limits, caps files edited per state
- Conditional transitions — programmatic guards on context data like
test_result eq pass,coverage gt 80 - Approval gates — pauses for human review at critical decision points
- Interrupts — editing a file matching a glob pattern triggers auto-transition to a validation state
- Fork/join — run branches sequentially or in parallel
The Research Results: 20% → 100% on SWE-bench
Statewright’s team ran a 5-task SWE-bench subset against several local models:
| Model | Size | Without Statewright | With Statewright |
|---|---|---|---|
| gpt-oss:20b | 13.8GB | 2/10 | 10/10 |
| gemma4:31b | 19.9GB | 2/10 | 10/10 |
| llama3.3 | 42.5GB | 2/2 | 2/2 |
The 2-of-10 baseline on the unconstrained runs is not unusual — local models frequently get stuck in loops, call the wrong tools, or lose context after too many failed edits. With Statewright’s constraints, the same models on the same hardware achieved perfect scores.
The team also identified a floor around 13GB model size: below that, models can identify bugs correctly but can’t serialize surgical edits (they rewrite entire files). That’s a model limitation, not a guardrail limitation.
How to Use It
Getting started takes two commands in Claude Code:
/plugin marketplace add statewright/statewright
/plugin install statewright
This opens the visual workflow editor at statewright.ai, where users can drag states, draw transitions, and assign tools per phase — no JSON editing required unless desired.
The managed cloud at statewright.ai handles workflow storage, run history, and the MCP gateway. Pricing starts at free (3 workflows, 200 transitions/month), with Pro at $29/month and Team at $99/month. The Rust engine is Apache 2.0 and embeddable with no runtime dependencies.
Why This Matters: The Agent Reliability Crisis
Statewright arrives at a moment when the industry is grappling with a fundamental question: how do you make AI agents trustworthy enough to run autonomously?
The current answers fall into three camps:
- Better models — always works, but expensive and not available on local hardware
- Better prompts — fragile, model-specific, easy to jailbreak
- Observability — tells you what broke, doesn’t prevent it
Statewright proposes a fourth path: structural constraints enforced at the tool-call level, independent of the model’s reasoning ability. It’s closer to how real engineering works — you don’t trust the developer to remember not to rm -rf /; you configure the permissions system.
For teams running AI coding agents in production — especially with local models where reliability gaps are widest — this is a genuinely new category of solution. Not a prompt hack. Not a bigger model. A guardrail system that makes the agent follow the process, whether it wants to or not.
Sources: Statewright GitHub, statewright.ai, Hacker News discussion