Compatibility infrastructure for AI agents
The tests
didn't ship
with the
agents.
Dynobox is BrowserStack for agents. Run the same skill, MCP server,
and workflow across every harness. Catch the break before your users do.
The problem
Skills ship to prod.
Their tests do not.
A model update ships. Your skill silently breaks. Nobody finds out until a user does.
Why now
Five harnesses. Weekly model updates.
No shared test layer.
01 Skills are portable across Claude Code, Codex, Cursor, Gemini CLI & more.
02 Frontier models ship updates on a weekly cadence.
03 Each harness has different tools, permission models, and failure modes.
The wedge
Cross-harness
skill testing.
Anthropic tests in Claude Code. OpenAI tests in Codex.
Neither has any incentive to make the other's harness look good. Skills today, MCP servers and agent workflows next.
The product
A pass-rate
matrix.
| Claude Code | Codex | Gemini |
| skills/commit | ✓ | ✓ | ✕ |
| skills/data-pipeline | ✓ | ✕ | ✓ |
| skills/refactor | ✓ | ✓ | ✓ |
Skills run in disposable sandboxes across every harness. The compatibility
matrix shows where your agent workflow breaks, and why.
The category
Evals test outputs.
We test execution.
Most AI testing scores what the model said. Dynobox verifies what actually
happened: which skills loaded, which tools were called, which files
changed, which APIs were hit. The same workflow, audited across every runtime.
How it works
One config.
Every harness.
define flow + assertions
↓
spin disposable sandboxes
↓
capture native traces
↓
pass-rate matrix + regression history
Current state · 0.1.0 shipped
Tests the
skills you
actually ship.
commit.dyno.yaml
# real test from .agents/skills/commit/
name: commit-skill
harnesses: [claude-code, codex]
scenarios:
- name: safe commit workflow
prompt: "use the commit skill to
commit README. don't push."
assertions:
- kind: skill.invoked
skill: commit
- kind: tool.called
toolKind: shell
matcher: { includes: git commit }
- kind: tool.notCalled
toolKind: shell
matcher: { includes: git push }
dynobox run
dynobox 0.1.0
✓ safe commit workflow claude-code 14.2s
✓ 3 of 3 assertions passed
✗ safe commit workflow codex 11.8s
✗ skill.invoked(commit)
no read of SKILL.md
✓ tool.called(shell, git commit)
✓ tool.notCalled(shell, git push)
─────────────────────────────
1 passed 1 failed 26.0s
Real test from the repo. Dynobox tests its own skills with this exact pattern.
Status
Shipping today · OSS
[email protected] on npm. Claude Code and Codex harnesses. TS and YAML
authoring. Tool, artifact, transcript, sequence, skill, and HTTP assertions.
JSON reporter and GitHub Actions recipe for CI.
What's coming
The hosted runner. Disposable cloud sandboxes, parallel matrix at scale,
regression history across model releases, team dashboards. Each run compounds into a regression baseline customers can't
recreate.
npm i -g dynoboxv0.1.0 · Apache-2.04 packagesclaude-code · codexts + yaml
Pre-customer. Product shipped. Design partners next.
Model
Consumption,
not seats.
You pay for runs: sandbox-minutes against the harnesses you
test. Aligned with value, scales with the customer's own skill library, no
per-seat friction on the team that's already paying for models.
Why me
Founder-market fit
AWS · prior
Built matrix test infra for ML data labeling. Same playbook, now pointed at agents instead of annotators.
Stripe · today
Hands-on with the same harnesses Dynobox tests, inside a system where every action moves real money. Feel the gap daily.
Solo · last 12 mo
agentutils.dev, diagram.bhk.dev, Forecast. Shipped end to end, solo.
The test infra
for the next
way software
gets built.
$ npm install dynobox@latest
dynobox.xyz · github.com/dynobox · [email protected]