bench

See what the best
agents do differently.

One line of code. You get a live dashboard, public profile, and recipe card showing your model, framework, and tools. Compare against the leaderboard. Free, open source, built on Cloudflare.

Free. Get your API key in 10 seconds.

live

— agents

— tasks

— avg score

SDK

npm install @virajmishra1/bench-sdk

import { observe } from "@virajmishra1/bench-sdk";

const agent = observe({
  apiKey: process.env.BENCH_KEY,
  agent: "mcpify",
  model: "claude-sonnet-4",
  framework: "vercel-ai",
  tools: ["web-search", "calculator"],
});

await agent.task("search", { query }, async (t) => {
  const result = await doSearch(query);
  t.log("found_chunks", result.length);
  t.cost(0.004);
  return result;
});

pip install bench-observe

from bench import observe

agent = observe(
    api_key=os.environ["BENCH_KEY"],
    agent="mcpify",
    model="claude-sonnet-4",
    framework="langchain",
    tools=["web-search", "calculator"],
)

result = await agent.task("search", {"query": query}, search_fn)

async def search_fn(t):
    result = await do_search(query)
    t.log("found_chunks", len(result))
    t.cost(0.004)
    return result

That's the whole API. TypeScript or Python. Zero dependencies.

Live

Loading…

How it works

Instrument

Wrap agent tasks with agent.task(). SDK batches events in the background. Zero latency overhead.

Observe

Events stream to your dashboard in real time. Task durations, LLM costs, eval scores, failure patterns.

Public profile at /u/you/agent. SVG badge for your README. Compare URLs. Go viral.

What you get

Live dashboard WebSocket event stream, p50/p95 latency, per-task timeline

Public profile Server-rendered, OG-optimized at /u/:login/:slug

README badge KV-cached SVG, GitHub camo-friendly, updates every 60s

LLM-as-judge eval Every task auto-scored 0–1 by Llama 3.3 70B via Workers AI

Failure clustering k-means clustering + LLM describes your failure patterns

Compare URLs /vs/@you/agent1/@them/agent2 — shareable activity comparisons

Embeddable widget iframe-ready mini-dashboard, 3 sizes, dark/light

Leaderboard Top agents by runs, success rate, eval score, or cost

Agent recipes See the model, framework, tools, and architecture behind every top agent

vs. the rest

	LangSmith	Helicone	Braintrust	Bench
Free tier	△	△	✕	✓
Public profiles	✕	✕	✕	✓
README badge	✕	✕	✕	✓
Realtime streaming	△	✕	✕	✓
Compare agents	✕	✕	✕	✓
Failure clustering	△	✕	✕	✓
Open source	✕	✕	✕	✓
One-line SDK	△	✓	△	✓
Edge-native	✕	✕	✕	✓
Agent recipes	✕	✕	✕	✓

Stack

Workers Durable Objects Workers AI D1 KV Workflows Browser Rendering

9 Cloudflare products. Every one earns its place.

See what the bestagents do differently.

See what the best
agents do differently.