Statewright: Visual State Machines That Finally Make AI Agents Reliable — No Prompt Engineering Required

“Agents are suggestions, states are laws.”

That’s the tagline of Statewright, and it might be the most important design philosophy to hit the agent reliability space this year. With 91 points on Hacker News and 175 GitHub stars on its first day, Statewright tackles the fundamental problem with AI agents: they have too many tools — a challenge explored across our AI agent glossary, too much freedom, and not enough structure.

The solution? State machines. Not prompt engineering, not bigger models, not observability dashboards that tell you what went wrong after the fact. Structural guarantees, enforced at the MCP protocol layer, that constrain which tools an agent can use in each phase of a workflow.

The Problem: Death By Forty Tools

If you’ve ever watched an AI coding agent flail — re-reading the same file five times without editing, calling list_directory on paths it already explored, or trying to git push in the planning phase — you’ve seen Statewright’s raison d’être.

The conventional wisdom says: bigger models + better prompts = fewer errors. But that’s scaling a flawed premise. Give a model 40+ tools and an open-ended problem, and even frontier models waste tokens navigating the tool space. The common fixes — longer system prompts, few-shot examples, post-hoc observability — all treat symptoms, not the root cause.

Statewright’s insight is brutally simple: instead of making the problem bigger, make it smaller.

How Statewright Works

Statewright lets you define a workflow as a state machine — a visual graph of phases (planning, implementation, testing, review) with explicit transitions between them. Each state has a whitelist of allowed tools. If the agent calls a tool that’s not in the current phase, it gets rejected with a message telling it what is available and how to transition.

The key design decisions:

Feature	What It Does
Per-state tool enforcement	Planning gets read-only tools. Implementation unlocks edit tools. Testing gets only test commands.
Decision checkpoints	Require explicit transitions between phases. The agent can’t skip from planning straight to production.
Edit guards	Write-via-redirect and destructive ops are blocked even when Bash is allowed.
Command guards	Only designated commands execute in each phase.
Transition audit log	Every state change is logged with the agent’s stated reason.

The engine is written in Rust and exposed via MCP (Model Context Protocol), making it compatible with any MCP-speaking client — Claude Code, Codex, opencode, Pi, Cursor, and others. For more on the MCP ecosystem, see our guide to the protocol — Claude Code, Codex, opencode, Pi, Cursor, and others.

The Research Results That Matter

Statewright’s creators validated the approach on local models — where the effect is most measurable — using a 5-task SWE-bench subset and a 26-line bug-fix benchmark. The results are striking:

Model	Size	Without Statewright	With Statewright
gemma3	3.3GB	FAIL	FAIL (floor)
gemma4:e2b	7.2GB	2/10	10/10
gemma4:31b	19.9GB	2/10	10/10
llama3.3	42.5GB	—	PASS (2/2 tasks)

Two models (7.2GB and 19.9GB) went from 2/10 to 10/10 with Statewright constraints. Same tasks, same hardware, same model weights. The only change was the structural guardrails.

The researchers note that below 13GB, models can produce tool calls but can’t retain enough file content to produce accurate edits — that’s the hardware floor, not a Statewright limitation.

“Frontier models with default system prompts handle the obvious catastrophic actions (database deletion, credential leaks) most of the time. The structural win is bigger: breaking read-loop death spirals where models re-read the same file 5+ times without ever editing, and keeping the tool space small enough that the model actually reasons instead of flailing.” — Statewright research brief

Visual Workflow Builder

Unlike traditional state machine frameworks that require YAML or code configuration, Statewright ships with a visual workflow editor. You drag states, draw transitions, and assign tools per phase — no coding required. The visual model is then enforced at the MCP level against any connected agent.

The workflow editor shows:

Current phase (highlighted)
Available transitions (context-sensitive)
Tools permitted in the current phase
Audit log of all state transitions

This makes Statewright accessible to non-engineers who need to define agent guardrails — a security architect could design a deployment workflow without writing code.

Quick Start: Two Commands

Getting started with Claude Code takes literally two commands:

/plugin marketplace add statewright/statewright
/plugin install statewright

Then start a workflow:

❯ start the bugfix workflow — fix the failing tests in calc.py

◆ statewright — statewright_start (workflow: bugfix)
◆ [statewright] Workflow activated: bugfix

◆ statewright — statewright_get_state (MCP)
◆ Current phase: planning. Let me read the code first.

  Read 2 files

  [statewright] planning => implementing

◆ statewright — statewright_transition (READY)

  Edit calc.py: 1 line changed

  [statewright] implementing => testing

◆ statewright — statewright_transition (DONE)

  Bash: pytest -x — 7 passed

  [statewright] testing => completed
◆ [statewright] Workflow complete. 46 seconds.

The agent’s actions are structurally constrained at each step — no need to trust that the prompt will keep it in line.

Pricing and Business Model

Statewright operates on a freemium model with the core engine open-source under Apache 2.0:

Tier	Transitions/month	Price
Free	200	$0
Pro	2,500	$29/month

The hosted service provides the visual workflow editor, audit logs, and team collaboration. The Rust engine and MCP plugin remain fully open-source.

Why This Matters for the Agent Ecosystem

Statewright represents a philosophical shift in how we think about agent reliability. The dominant approach has been:

Bigger models (scale up)
Better prompts (scale sideways)
Post-hoc monitoring (observe failures)

Statewright argues for structural enforcement — a fourth path that addresses agent behavior at the system level rather than the model level. This is closer to how safety-critical software has been built for decades: you don’t ask the pilot to “please not crash”; you build control surfaces with mechanical limits.

For the agent ecosystem, this means:

Smaller models become viable for complex tasks when their tool space is constrained — a 7.2GB model matches frontier performance with Statewright
Deterministic guarantees replace probabilistic hopes — the state machine enforces transitions, the model operates within bounds
Auditability by design — every transition is logged with the agent’s stated reason, creating a chain of accountability
Cross-platform — the same workflow works on Claude Code, Codex, Cursor, and Pi

The Bigger Picture

Statewright (by Ben Cochran) arrives at a moment when the agent community is grappling with reliability as the real barrier to deployment. The May 9 article in The Agent Report covered the broader safety tooling landscape — from kernel sandboxes to SQL proxies — but Statewright occupies a unique niche: it prevents errors by constraining the problem space, not by detecting bad behavior after the fact.

As agents move from demos to production workloads, the tools that provide structural guarantees — not just better prompts — will define the next generation of the agent stack. Statewright is the first framework to make that argument stick with published benchmark data.

The message for agent developers is clear: structure beats reasoning, and state machines beat system prompts.

Sources: Statewright on GitHub, Statewright Website, Hacker News Discussion (91 pts), Research Brief