bench
Controlled suites

Verified benchmarks

Versioned datasets, deterministic graders where possible, and required repetitions for non-deterministic agents. Every score below is reproducible against a specific suite version and grader version, and is reported separately from production telemetry on agent profiles.

Grounded Research Briefs

Measures whether an agent can answer from supplied evidence and cite the correct source IDs without inventing sources.

structured-research · CC0-1.0
Version1.0.0
Cases5
Repetitions3
Gradergrounded-research-deterministic v1.0.0 · deterministic
Published2026-06-27
Cases digesteedb0855cb65…
Grader digest9f7e7109cf4b…

No verified submissions yet for this version.