AutoScientists changes the game by creating a decentralized “team” of AI agents. Rather than relying on a central planner, these digital scientists look at the shared data and self-organize into specialized groups around the most exciting hypotheses. Before they spend valuable computer processing power on an experiment, they ruthlessly critique each other’s proposals. Crucially, they keep a collective log of both their successes and failures, ensuring the entire system avoids redundant work.
Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision, often requiring researchers to explore multiple competing directions as evidence accumulates and priorities shift. LLM agents can automate parts of this process, but existing agents either concentrate reasoning within a single research thread or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration across research directions or reorganize as promising and unproductive directions emerge over time.
We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Rather than following decisions from a central orchestrator, agents independently interpret a shared experimental state, self-organize into teams around research directions, critique and filter proposals with a discussion phase before committing experimental compute, and exchange both successful and failed findings across teams to avoid redundant exploration.
Under matched experimental budgets, AutoScientists outperforms prior agentic systems across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest prior biomedical agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9× faster than autoresearch and continues discovering improvements from a stronger starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2–Spike binding that improves over the current state-of-the-art model by +12.5% Spearman correlation. Applied without modification to all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% in Spearman correlation.







