They Built an AI Scientist: What Its First Accepted Paper Really Means

13 May 2026 16:37 15,043 views
A new AI system claims to automate the entire machine learning research process—from idea generation to writing papers—for just $15 per project. One of its papers even passed peer review at a major AI conference workshop, but a closer look shows why human scientists aren’t obsolete just yet.

A new AI system promises to act like a full-blown "AI scientist"—designing experiments, running them, and writing up the results as complete research papers for about $15 a project. One of its auto-generated papers even passed peer review at a major machine learning conference workshop, sparking headlines about the future of scientific work and whether researchers are now replaceable.

The reality is more complicated—and a lot less scary—than the hype suggests.

What This "AI Scientist" Actually Is

The system comes from Sakana AI, a Tokyo-based startup founded by former Google Brain researcher David Ha. Their work, described in a paper titled "Towards end-to-end automation of AI research," aims to automate the entire machine learning research pipeline using a swarm of AI agents.

Instead of a single model doing everything, the system is made up of multiple specialized agents that collaborate. In broad strokes, it works like this:

1. Idea generation: Agents scan existing literature and propose new research ideas they claim are novel.

2. Experiment design and execution: Other agents turn those ideas into concrete experiments, write code, run training jobs, and collect results.

3. Paper writing: Finally, a writing agent assembles the whole process into a research paper, complete with abstract, introduction, methods, experiments, and conclusions.

The bold claim from Sakana: these AI-generated papers can exceed acceptance thresholds at top machine learning conferences, at least as judged by automated reviewers—and in one case, by real human reviewers at a workshop.

The Big Claim: An AI Paper Passed Peer Review

Sakana submitted three AI-generated papers to a workshop at one of the world’s top machine learning conferences. One of them was accepted after peer review, which sounds like a huge milestone for automated science.

But there are some important details that change how impressive this really is:

It was a workshop, not the main conference. Workshops sit on the informal end of the scientific publishing spectrum. They are often designed to encourage discussion, share work-in-progress, and highlight failures or negative results. Acceptance rates can be around 60–70%, compared with 20–30% (or less) for the main conference tracks.

The specific workshop focused on failures and negative results. The accepted AI paper went to a workshop with a tongue-in-cheek theme about "I can’t believe it’s not better"—essentially a venue for talking about why applied deep learning often underperforms expectations. The bar is more about being interesting or thought-provoking than delivering flawless, groundbreaking science.

Reviewers were likely more junior and the review lighter-touch. Newer workshops are often run and reviewed by junior researchers who are still learning the ropes of peer review. That doesn’t mean the work is meaningless, but it does mean the process is much less rigorous than a top-tier journal like Nature or Science.

So yes, an AI-generated paper did get past human reviewers—but in a context designed to be exploratory, forgiving, and even a bit playful.

Independent Evaluation: "Like an Unmotivated Undergraduate"

To really understand how capable this AI scientist is, an independent team of researchers evaluated its output. Their verdict was harsh.

One researcher compared the system’s work to "an unmotivated undergraduate student rushing to meet a deadline." When they dug into the details, they found serious problems at every stage of the research pipeline.

Weak Literature Review and Fake Novelty

The system’s literature review was described as inadequate. It often:

• Misjudged novelty: It labeled ideas as "novel" even when similar work already existed. In other words, it didn’t really understand the frontier of the field—it just declared things new because that sounds impressive in a paper.

• Used outdated citations: Most papers included only about five citations, many of which were old or not the most relevant. For a modern machine learning paper, that’s a red flag.

Broken Experiments and Contradictory Results

The system also struggled badly with actually running experiments:

• Coding failures: In one evaluation, 5 out of 12 proposed experiments failed outright due to coding errors. The AI could write code, but not reliably enough for serious research.

• Logical flaws: Some experiments produced misleading or logically flawed results. In one case, a method that was supposed to optimize energy efficiency actually used more compute, directly contradicting its stated goal.

These aren’t small mistakes—they’re the kind of issues that would get a human researcher’s paper rejected quickly at a serious venue.

Messy, Hallucinated Writing

Even on the writing side, where large language models usually shine, the AI scientist stumbled:

• Placeholder text left in: Some manuscripts literally contained phrases like "Conclusions here" that were never replaced with real content—classic rushed-draft behavior.

• Hallucinated numbers and results: Several papers invented numerical results that were never actually produced by experiments. This is especially dangerous in science, where numbers are supposed to be verifiable.

• Embarrassing references: The system sometimes hallucinated citations or produced obviously incorrect references, something Sakana themselves later described as embarrassing in their own blog posts.

Put simply: as a junior assistant that can brainstorm and draft, the system is interesting. As a standalone scientist, it’s nowhere near ready.

How Much Is Really Automated?

Another subtle but important point: the process wasn’t truly end-to-end automated in practice.

Sakana generated a range of AI-written papers, then humans stepped in and selected the three "best" ones to submit to the workshop. That human filtering step is crucial. It means:

• The system doesn’t know which of its outputs are good. It can generate lots of content, but it still relies on humans to judge quality and weed out the worst failures.

• The headline claim of fully automated research is overstated. If humans are curating the outputs, we’re not at true end-to-end automation yet—we’re at human+AI collaboration.

Even critics of the project point out that this actually reinforces a more realistic takeaway: combining human oversight with AI tools can be effective, but the AI alone is not ready to replace researchers.

Limitations the Creators Admit

To their credit, Sakana AI is fairly open about the system’s weaknesses. In their own documentation, they list several key limitations:

• Naive or undeveloped ideas: The system often proposes shallow or unoriginal research directions.

• Weak methodological rigor: It struggles with deep, careful reasoning about methods and experimental design—one of the hardest and most important parts of real science.

• Susceptibility to hallucinations and obvious mistakes: From fake numbers to incorrect citations, the system still makes the classic errors we see in today’s large language models.

So while the Nature coverage and conference acceptance sound dramatic, even the creators aren’t claiming this AI is ready to run a lab on its own.

The Bigger Risk: Narrowing What Science Works On

Beyond whether an AI can pass peer review, there’s a deeper question: what happens to science if we start relying heavily on tools like this?

A recent paper in Nature, titled "Artificial intelligence tools expand scientists’ impact, but contract science’s focus," argues that AI can help researchers reach more people and produce more work—but at the cost of narrowing what gets studied.

Here’s why that matters:

• AI tends to optimize for what’s already popular. If models are trained on existing literature and rewarded for generating "publishable" ideas, they’ll mostly remix what’s already fashionable, not propose truly weird, risky, or disruptive directions.

• Science could become more conservative. If AI tools steer researchers toward safe, incremental work that looks like past successes, we risk losing the kind of creative leaps that come from humans exploring odd questions for their own sake.

• Errors can scale fast. When AI systems hallucinate results or misjudge novelty, and those outputs are trusted or reused, bad information can spread quickly through the literature.

These concerns echo broader debates about powerful models and their impact on research and society, similar to the discussions around frontier systems covered in pieces like Anthropic’s warnings about its Mythos model.

So… Are Scientists Replaceable Yet?

Right now, no. This AI scientist is impressive as a proof of concept, but its actual performance is closer to a rushed, error-prone student than a careful expert.

What it does show is where things are heading:

• AI will increasingly handle grunt work. Drafting sections of papers, suggesting baselines, writing boilerplate code, or summarizing prior work are all tasks that AI can already help with. Tools like Claude and other advanced models are already being used this way in research workflows, much like in everyday use cases covered in guides such as the Claude tutorial for beginners.

• Human judgment becomes even more important. The real value of human scientists will increasingly lie in asking good questions, spotting subtle flaws, designing robust experiments, and deciding which AI-generated ideas are worth pursuing.

• This is the worst these systems will ever be. Today’s AI scientist is clumsy and unreliable, but the trajectory is clear. Over the next few years, the quality of automated research assistance is likely to improve dramatically.

For now, the takeaway is less "AI has replaced scientists" and more "AI is becoming a powerful, if unreliable, research assistant." Used carefully, it could speed up parts of the scientific process. Used blindly, it could flood the field with "endless scientific slop."

The challenge for the research community will be to harness these tools without letting them quietly reshape science into something narrower, noisier, and less trustworthy.

Share:

Comments

No comments yet. Be the first to share your thoughts!

More in Research Papers