ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne autonomously generates research papers with verifiable evidence chains—every claim traces to code, data, or literature—while matching or exceeding human expert performance on frontier algorithm discovery tasks.

Generated Papers & Code

21 papers and their solver code, autonomously generated by ScientistOne across three benchmarks.

Loading PDF...

Key Results

0 / 337
Hallucinated References
All references verified against real publications

12 / 12
Score Verification
Every claimed result reproduces under re-evaluation

14 / 15
Method-Code Alignment
Paper descriptions match the submitted code

98%
Numerical CPR
Claim Provenance Rate: quantitative claims traceable to experiment logs

System Overview

ScientistOne is an end-to-end autonomous research system whose pipeline—Problem Investigator, Discovery Engine, and Paper Writer with Claim Verifier—is designed to satisfy Chain-of-Evidence natively. The Problem Investigator reads up to 100 full-text PDFs per topic, producing grounded experiment briefs. The Discovery Engine uses a parallel explore-exploit search tree to discover high-performing algorithms. The Claim Verifier checks every claim in the draft against its declared evidence source before the final paper is produced.

CoE Audit

CoE Audit is a post-hoc audit that checks whether claims in a completed paper are supported by the underlying artifacts—code, evaluator outputs, and bibliography. It comprises four integrity checks—Score Verification, Specification Violation, Reference Verification, and Method-Code Alignment—each targeting a specific way a claim can lose its grounding.

Integrity Audit Results

We apply CoE Audit to 75 papers from five autonomous research systems across five frontier systems-research tasks. Every baseline exhibits at least one systematic integrity failure: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method–code alignment ranges from 20% to 80%. ScientistOne is the only system to lead on all four checks—a direct consequence of maintaining evidence chains throughout the pipeline rather than retrofitting them at paper-writing time.

System	Score Verif. ↑	Spec. Violation ↓	Ref. Verif. ↓	Method-Code Align. ↑
Sakana ASv2	5/12 (42%)	10/15†	0/159 (0%)	5/15 (33%)†
AutoResearchClaw	5/12 (42%)	0/15	3/196 (1.5%)	3/15 (20%)
DeepScientist	11/12 (92%)	0/15	42/201 (20.9%)	5/15 (33%)
AI-Researcher	9/12 (75%)	1/15	21/222 (9.5%)	12/15 (80%)
ScientistOne	12/12 (100%)	0/15	0/337 (0%)	14/15 (93%)

† Sakana's solution code for audit contains non-solver scaffolding by system design rather than a standalone solver, inflating I2/I4 counts. Cross-system comparison on these two checks should exclude Sakana. See paper for details.

Solution Discovery Performance

On the ADRS benchmark's five frontier systems-research tasks, all systems match or exceed human expert baselines, consistent with prior observations that LLM-based agents rapidly converge to similar solution quality. ScientistOne achieves the best overall scores on Cloudcast and EPLB, and is on par with specialized algorithm-discovery systems (AdaEvolve, EvoX) that do not produce research papers. Critically, this performance comes with no integrity tradeoff—ScientistOne is the only system that pairs competitive solver scores with full evidence-chain verifiability.

Task	Dir.	Human	AdaEvo*	EvoX*	Sakana	ARC	AIR	DS	ScientistOne
Prism	↑	21.89	26.26	26.26	26.26	26.25	26.26	26.26	26.26
Cloudcast	↓	626.24	637.10	623.69	627.11	690.37	734.28	620.09	618.08
EPLB	↑	0.1265	0.1450	0.1453	0.1270	0.1266	0.1449	0.1284	0.1459
LLM-SQL	↑	0.6920	0.7520	0.7300	0.7320	0.6757	0.7148	0.7307	0.7222
TXN	↑	2724.8	4310	4310	4184	3247	4311	4286	3906

* Gemini-3.0-Pro; all other systems use Gemini-3.1-Pro. Sakana/ARC/AIR/DS/ScientistOne scores are from independent canonical evaluator re-runs. Human, AdaEvolve, and EvoX scores are from original publications.

Generalizability: MLE-Bench & Parameter Golf

To test whether the discovery loop transfers beyond ADRS, we evaluate ScientistOne unmodified on six tasks spanning medical imaging, fine-grained recognition, 3D perception, and parameter-constrained language modeling. Five tasks come from MLE-Bench (Kaggle competitions in Medium and High difficulty tiers); the sixth is Parameter Golf, a live competition requiring participants to train the highest-performing language model under strict size and performance constraints. Both systems are provided with a knowledge base of official leaderboard solutions up to a cutoff date of April 27, 2026.

Task	Dir.	DeepScientist		ScientistOne
		Score	Highlight	Score	Highlight
3D Object Detection	↑	0.0000	Below Median	0.1763	Gold Medal
AI4Code	↑	0.6964	Below Median	0.8356	Above Median
iMet 2020 FGVC7	↑	0.6804	Silver Medal	0.6791	Silver Medal
RSNA Brain Tumor	↑	0.6377	Gold Medal	0.6518	Gold Medal
iNaturalist 2019 FGVC6	↓	0.2158	Silver Medal	0.2445	Silver Medal
Parameter Golf	↓	Invalid	Size limit exceeded	1.0600	SOTA¹

¹ As of leaderboard cutoff date April 27, 2026. The leaderboard has since been updated with newer results.

Highlights

Gold on 3D Object Detection — ScientistOne earns a Gold Medal (0.1763) on this High-difficulty task where DeepScientist scores 0.0000, failing entirely.
SOTA on Parameter Golf — On this live LLM training competition, ScientistOne meets all constraints and achieves top-1 leaderboard performance (1.0600) as of the knowledge cutoff (April 27, 2026). DeepScientist exceeds the 16MB artifact size limit and produces an invalid submission.
Genuine algorithmic novelty — On Parameter Golf, ScientistOne introduces Hessian-diagonal-weighted SVD initialization and an alternating-least-squares refinement loop with GPTQ—techniques novel to the leaderboard. Internal ablations isolate the ALS loop as the primary driver. DeepScientist introduces no algorithmic changes, limited to environment and portability adjustments.
Consistent across domains — Two Gold Medals, two Silver Medals, and one Above Median on MLE-Bench; SOTA on Parameter Golf. The same pipeline generalizes from systems optimization to medical imaging, fine-grained recognition, 3D perception, and LLM training without modification.

BibTeX

@article{meng2026scientistone,
  title     = {ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence},
  author    = {Meng, Rui and Dalvi Mishra, Bhavana and Chen, Jiefeng and Li, Chun-Liang and Goyal, Palash and Parmar, Mihir and Song, Yiwen and Song, Yale and Sinha, Rajarishi and Ranganathan, Parthasarathy and Gokturk, Burak and Yoon, Jinsung and Pfister, Tomas},
  journal   = {arXiv preprint},
  year      = {2026}
}