ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Rui Meng Bhavana Dalvi Mishra* Jiefeng Chen* Chun-Liang Li Palash Goyal Mihir Parmar Yiwen Song Yale Song Rajarishi Sinha Parthasarathy Ranganathan Burak Gokturk Jinsung Yoon Tomas Pfister
* Equal contribution
Google Cloud AI Research

ScientistOne autonomously generates research papers with verifiable evidence chains—every claim traces to code, data, or literature—while matching or exceeding human expert performance on frontier algorithm discovery tasks.

Generated Papers & Code

21 papers and their solver code, autonomously generated by ScientistOne across three benchmarks.

Loading PDF...

Key Results

0 / 337
Hallucinated References
All references verified against real publications
12 / 12
Score Verification
Every claimed result reproduces under re-evaluation
14 / 15
Method-Code Alignment
Paper descriptions match the submitted code
98%
Numerical CPR
Claim Provenance Rate: quantitative claims traceable to experiment logs

System Overview

ScientistOne is an end-to-end autonomous research system whose pipeline—Problem Investigator, Discovery Engine, and Paper Writer with Claim Verifier—is designed to satisfy Chain-of-Evidence natively. The Problem Investigator reads up to 100 full-text PDFs per topic, producing grounded experiment briefs. The Discovery Engine uses a parallel explore-exploit search tree to discover high-performing algorithms. The Claim Verifier checks every claim in the draft against its declared evidence source before the final paper is produced.

ScientistOne system overview

CoE Audit

CoE Audit is a post-hoc audit that checks whether claims in a completed paper are supported by the underlying artifacts—code, evaluator outputs, and bibliography. It comprises four integrity checks—Score Verification, Specification Violation, Reference Verification, and Method-Code Alignment—each targeting a specific way a claim can lose its grounding.

CoE Audit overview

Integrity Audit Results

We apply CoE Audit to 75 papers from five autonomous research systems across five frontier systems-research tasks. Every baseline exhibits at least one systematic integrity failure: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method–code alignment ranges from 20% to 80%. ScientistOne is the only system to lead on all four checks—a direct consequence of maintaining evidence chains throughout the pipeline rather than retrofitting them at paper-writing time.

System Score Verif. ↑ Spec. Violation ↓ Ref. Verif. ↓ Method-Code Align. ↑
Sakana ASv2 5/12 (42%) 10/15† 0/159 (0%) 5/15 (33%)†
AutoResearchClaw 5/12 (42%) 0/15 3/196 (1.5%) 3/15 (20%)
DeepScientist 11/12 (92%) 0/15 42/201 (20.9%) 5/15 (33%)
AI-Researcher 9/12 (75%) 1/15 21/222 (9.5%) 12/15 (80%)
ScientistOne 12/12 (100%) 0/15 0/337 (0%) 14/15 (93%)

† Sakana's solution code for audit contains non-solver scaffolding by system design rather than a standalone solver, inflating I2/I4 counts. Cross-system comparison on these two checks should exclude Sakana. See paper for details.

Solution Discovery Performance

On the ADRS benchmark's five frontier systems-research tasks, all systems match or exceed human expert baselines, consistent with prior observations that LLM-based agents rapidly converge to similar solution quality. ScientistOne achieves the best overall scores on Cloudcast and EPLB, and is on par with specialized algorithm-discovery systems (AdaEvolve, EvoX) that do not produce research papers. Critically, this performance comes with no integrity tradeoff—ScientistOne is the only system that pairs competitive solver scores with full evidence-chain verifiability.

Task Dir. Human AdaEvo* EvoX* Sakana ARC AIR DS ScientistOne
Prism 21.89 26.26 26.26 26.26 26.25 26.26 26.26 26.26
Cloudcast 626.24 637.10 623.69 627.11 690.37 734.28 620.09 618.08
EPLB 0.1265 0.1450 0.1453 0.1270 0.1266 0.1449 0.1284 0.1459
LLM-SQL 0.6920 0.7520 0.7300 0.7320 0.6757 0.7148 0.7307 0.7222
TXN 2724.8 4310 4310 4184 3247 4311 4286 3906

* Gemini-3.0-Pro; all other systems use Gemini-3.1-Pro. Sakana/ARC/AIR/DS/ScientistOne scores are from independent canonical evaluator re-runs. Human, AdaEvolve, and EvoX scores are from original publications.

Generalizability: MLE-Bench & Parameter Golf

To test whether the discovery loop transfers beyond ADRS, we evaluate ScientistOne unmodified on six tasks spanning medical imaging, fine-grained recognition, 3D perception, and parameter-constrained language modeling. Five tasks come from MLE-Bench (Kaggle competitions in Medium and High difficulty tiers); the sixth is Parameter Golf, a live competition requiring participants to train the highest-performing language model under strict size and performance constraints. Both systems are provided with a knowledge base of official leaderboard solutions up to a cutoff date of April 27, 2026.

Task Dir. DeepScientist ScientistOne
Score Highlight Score Highlight
3D Object Detection 0.0000 Below Median 0.1763 Gold Medal
AI4Code 0.6964 Below Median 0.8356 Above Median
iMet 2020 FGVC7 0.6804 Silver Medal 0.6791 Silver Medal
RSNA Brain Tumor 0.6377 Gold Medal 0.6518 Gold Medal
iNaturalist 2019 FGVC6 0.2158 Silver Medal 0.2445 Silver Medal
Parameter Golf Invalid Size limit exceeded 1.0600 SOTA¹

¹ As of leaderboard cutoff date April 27, 2026. The leaderboard has since been updated with newer results.

Highlights

  • Gold on 3D Object Detection — ScientistOne earns a Gold Medal (0.1763) on this High-difficulty task where DeepScientist scores 0.0000, failing entirely.
  • SOTA on Parameter Golf — On this live LLM training competition, ScientistOne meets all constraints and achieves top-1 leaderboard performance (1.0600) as of the knowledge cutoff (April 27, 2026). DeepScientist exceeds the 16MB artifact size limit and produces an invalid submission.
  • Genuine algorithmic novelty — On Parameter Golf, ScientistOne introduces Hessian-diagonal-weighted SVD initialization and an alternating-least-squares refinement loop with GPTQ—techniques novel to the leaderboard. Internal ablations isolate the ALS loop as the primary driver. DeepScientist introduces no algorithmic changes, limited to environment and portability adjustments.
  • Consistent across domains — Two Gold Medals, two Silver Medals, and one Above Median on MLE-Bench; SOTA on Parameter Golf. The same pipeline generalizes from systems optimization to medical imaging, fine-grained recognition, 3D perception, and LLM training without modification.

BibTeX

@article{meng2026scientistone,
  title     = {ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence},
  author    = {Meng, Rui and Dalvi Mishra, Bhavana and Chen, Jiefeng and Li, Chun-Liang and Goyal, Palash and Parmar, Mihir and Song, Yiwen and Song, Yale and Sinha, Rajarishi and Ranganathan, Parthasarathy and Gokturk, Burak and Yoon, Jinsung and Pfister, Tomas},
  journal   = {arXiv preprint},
  year      = {2026}
}