See what the best agents do differently.
One line of code. You get a live dashboard, public profile, and recipe card showing your model, framework, and tools. Compare against the leaderboard. Free, open source, built on Cloudflare.
Free. Get your API key in 10 seconds.
live
—
agents
—
tasks
—
avg score
SDK
TypeScript
Python
npm install @virajmishra1/bench-sdk
import { observe } from "@virajmishra1/bench-sdk" ;
const agent = observe ({
apiKey : process .env .BENCH_KEY ,
agent : "mcpify" ,
model : "claude-sonnet-4" ,
framework : "vercel-ai" ,
tools : ["web-search" , "calculator" ],
});
await agent .task ("search" , { query }, async (t ) => {
const result = await doSearch (query );
t .log ("found_chunks" , result .length );
t .cost (0.004 );
return result ;
});
pip install bench-observe
from bench import observe
agent = observe (
api_key = os .environ ["BENCH_KEY" ],
agent = "mcpify" ,
model = "claude-sonnet-4" ,
framework = "langchain" ,
tools = ["web-search" , "calculator" ],
)
result = await agent .task ("search" , {"query" : query }, search_fn )
async def search_fn (t ):
result = await do_search (query )
t .log ("found_chunks" , len (result ))
t .cost (0.004 )
return result
That's the whole API. TypeScript or Python. Zero dependencies.
Live
How it works
01
Instrument
Wrap agent tasks with agent.task(). SDK batches events in the background. Zero latency overhead.
02
Observe
Events stream to your dashboard in real time. Task durations, LLM costs, eval scores, failure patterns.
03
Share
Public profile at /u/you/agent. SVG badge for your README. Compare URLs. Go viral.
What you get
Live dashboard
WebSocket event stream, p50/p95 latency, per-task timeline
Public profile
Server-rendered, OG-optimized at /u/:login/:slug
README badge
KV-cached SVG, GitHub camo-friendly, updates every 60s
LLM-as-judge eval
Every task auto-scored 0–1 by Llama 3.3 70B via Workers AI
Failure clustering
k-means clustering + LLM describes your failure patterns
Compare URLs
/vs/@you/agent1/@them/agent2 — shareable activity comparisons
Embeddable widget
iframe-ready mini-dashboard, 3 sizes, dark/light
Leaderboard
Top agents by runs, success rate, eval score, or cost
Agent recipes
See the model, framework, tools, and architecture behind every top agent
vs. the rest
LangSmith
Helicone
Braintrust
Bench
Free tier △ △ ✕ ✓
Public profiles ✕ ✕ ✕ ✓
README badge ✕ ✕ ✕ ✓
Realtime streaming △ ✕ ✕ ✓
Compare agents ✕ ✕ ✕ ✓
Failure clustering △ ✕ ✕ ✓
Open source ✕ ✕ ✕ ✓
One-line SDK △ ✓ △ ✓
Edge-native ✕ ✕ ✕ ✓
Agent recipes ✕ ✕ ✕ ✓
Stack
Workers
Durable Objects
Workers AI
D1
KV
Workflows
Browser Rendering
9 Cloudflare products. Every one earns its place.