ScientistOne autonomously generates research papers with verifiable evidence chains—every claim traces to code, data, or literature—while matching or exceeding human expert performance on frontier algorithm discovery tasks.
21 papers and their solver code, autonomously generated by ScientistOne across three benchmarks.
Loading PDF...
ScientistOne is an end-to-end autonomous research system whose pipeline—Problem Investigator, Discovery Engine, and Paper Writer with Claim Verifier—is designed to satisfy Chain-of-Evidence natively. The Problem Investigator reads up to 100 full-text PDFs per topic, producing grounded experiment briefs. The Discovery Engine uses a parallel explore-exploit search tree to discover high-performing algorithms. The Claim Verifier checks every claim in the draft against its declared evidence source before the final paper is produced.
CoE Audit is a post-hoc audit that checks whether claims in a completed paper are supported by the underlying artifacts—code, evaluator outputs, and bibliography. It comprises four integrity checks—Score Verification, Specification Violation, Reference Verification, and Method-Code Alignment—each targeting a specific way a claim can lose its grounding.
We apply CoE Audit to 75 papers from five autonomous research systems across five frontier systems-research tasks. Every baseline exhibits at least one systematic integrity failure: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method–code alignment ranges from 20% to 80%. ScientistOne is the only system to lead on all four checks—a direct consequence of maintaining evidence chains throughout the pipeline rather than retrofitting them at paper-writing time.
| System | Score Verif. ↑ | Spec. Violation ↓ | Ref. Verif. ↓ | Method-Code Align. ↑ |
|---|---|---|---|---|
| Sakana ASv2 | 5/12 (42%) | 10/15† | 0/159 (0%) | 5/15 (33%)† |
| AutoResearchClaw | 5/12 (42%) | 0/15 | 3/196 (1.5%) | 3/15 (20%) |
| DeepScientist | 11/12 (92%) | 0/15 | 42/201 (20.9%) | 5/15 (33%) |
| AI-Researcher | 9/12 (75%) | 1/15 | 21/222 (9.5%) | 12/15 (80%) |
| ScientistOne | 12/12 (100%) | 0/15 | 0/337 (0%) | 14/15 (93%) |
† Sakana's solution code for audit contains non-solver scaffolding by system design rather than a standalone solver, inflating I2/I4 counts. Cross-system comparison on these two checks should exclude Sakana. See paper for details.
On the ADRS benchmark's five frontier systems-research tasks, all systems match or exceed human expert baselines, consistent with prior observations that LLM-based agents rapidly converge to similar solution quality. ScientistOne achieves the best overall scores on Cloudcast and EPLB, and is on par with specialized algorithm-discovery systems (AdaEvolve, EvoX) that do not produce research papers. Critically, this performance comes with no integrity tradeoff—ScientistOne is the only system that pairs competitive solver scores with full evidence-chain verifiability.
| Task | Dir. | Human | AdaEvo* | EvoX* | Sakana | ARC | AIR | DS | ScientistOne |
|---|---|---|---|---|---|---|---|---|---|
| Prism | ↑ | 21.89 | 26.26 | 26.26 | 26.26 | 26.25 | 26.26 | 26.26 | 26.26 |
| Cloudcast | ↓ | 626.24 | 637.10 | 623.69 | 627.11 | 690.37 | 734.28 | 620.09 | 618.08 |
| EPLB | ↑ | 0.1265 | 0.1450 | 0.1453 | 0.1270 | 0.1266 | 0.1449 | 0.1284 | 0.1459 |
| LLM-SQL | ↑ | 0.6920 | 0.7520 | 0.7300 | 0.7320 | 0.6757 | 0.7148 | 0.7307 | 0.7222 |
| TXN | ↑ | 2724.8 | 4310 | 4310 | 4184 | 3247 | 4311 | 4286 | 3906 |
* Gemini-3.0-Pro; all other systems use Gemini-3.1-Pro. Sakana/ARC/AIR/DS/ScientistOne scores are from independent canonical evaluator re-runs. Human, AdaEvolve, and EvoX scores are from original publications.
To test whether the discovery loop transfers beyond ADRS, we evaluate ScientistOne unmodified on six tasks spanning medical imaging, fine-grained recognition, 3D perception, and parameter-constrained language modeling. Five tasks come from MLE-Bench (Kaggle competitions in Medium and High difficulty tiers); the sixth is Parameter Golf, a live competition requiring participants to train the highest-performing language model under strict size and performance constraints. Both systems are provided with a knowledge base of official leaderboard solutions up to a cutoff date of April 27, 2026.
| Task | Dir. | DeepScientist | ScientistOne | ||
|---|---|---|---|---|---|
| Score | Highlight | Score | Highlight | ||
| 3D Object Detection | ↑ | 0.0000 | Below Median | 0.1763 | Gold Medal |
| AI4Code | ↑ | 0.6964 | Below Median | 0.8356 | Above Median |
| iMet 2020 FGVC7 | ↑ | 0.6804 | Silver Medal | 0.6791 | Silver Medal |
| RSNA Brain Tumor | ↑ | 0.6377 | Gold Medal | 0.6518 | Gold Medal |
| iNaturalist 2019 FGVC6 | ↓ | 0.2158 | Silver Medal | 0.2445 | Silver Medal |
| Parameter Golf | ↓ | Invalid | Size limit exceeded | 1.0600 | SOTA¹ |
¹ As of leaderboard cutoff date April 27, 2026. The leaderboard has since been updated with newer results.
@article{meng2026scientistone,
title = {ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence},
author = {Meng, Rui and Dalvi Mishra, Bhavana and Chen, Jiefeng and Li, Chun-Liang and Goyal, Palash and Parmar, Mihir and Song, Yiwen and Song, Yale and Sinha, Rajarishi and Ranganathan, Parthasarathy and Gokturk, Burak and Yoon, Jinsung and Pfister, Tomas},
journal = {arXiv preprint},
year = {2026}
}