Versioned datasets, deterministic graders where possible, and required repetitions for non-deterministic agents. Every score below is reproducible against a specific suite version and grader version, and is reported separately from production telemetry on agent profiles.
Measures whether an agent can answer from supplied evidence and cite the correct source IDs without inventing sources.
No verified submissions yet for this version.