GPT‑5.6 rumors, new benchmarks, and a big week for AI models and agents

17 Jun 2026 06:37 55,407 views

GPT‑5.6 leaks, Microsoft’s new frontier models, Alibaba’s Qwen 3.7 Plus, Hermes Agent Desktop, and fresh Claude and Codex updates made this a huge week for AI. Here’s what’s coming, why it matters, and how it could change the tools you use every day.

AI is moving fast, and this week delivered a wave of leaks, launches, and quiet but important upgrades. From a rumored GPT‑5.6 release to Microsoft’s new reasoning model, Alibaba’s latest Qwen variant, and powerful agent tools, the next generation of AI is starting to take shape.

GPT‑5.6: leaks, tests, and what to expect

All signs point to GPT‑5.6 being OpenAI’s next major model release, with growing hints that it could arrive very soon. A key OpenAI product lead recently replied “soon” to a discussion about whether OpenAI or Anthropic will pull ahead, suggesting the company believes its next model could shift the balance.

Over the past week, users have reported extensive A/B testing inside ChatGPT, including experimental text and image models and new behavior in the Canvas feature. Some reports suggest Canvas may already be routing to early GPT‑5.6 checkpoints for UI and app generation tasks.

Early whispers around GPT‑5.6 point to:

• Performance on par with Anthropic’s Mythos preview on many benchmarks, while being more token‑efficient and cheaper to run.
• Stronger coding and game generation, with demos showing full playable games (physics, collectibles, scoring, UI) generated from a single prompt rather than simple toy examples.
• Noticeably better UI generation, with cleaner layouts and fewer “UI slop” issues, similar to the jump seen between recent Claude Sonnet releases.

Nothing is confirmed yet, but between June release rumors, increased testing, and OpenAI’s internal confidence, GPT‑5.6 is shaping up to be a significant step rather than a minor tweak. For more context on how GPT‑5.x has been evolving, you can also look at earlier coverage of GPT‑5.5 and next‑gen image models.

Codex grows beyond coding with plugins and hosted apps

OpenAI also rolled out a major update to Codex, and the direction is clear: this is no longer just a tool for developers.

Non‑developers already make up around 20% of Codex’s users, and that segment is growing over three times faster than traditional devs. To support that, OpenAI is adding role‑specific plugins for analysts, marketers, designers, sales teams, investors, and more.

These new plugins let Codex connect directly to the tools people already use and then generate:

• Reports and dashboards
• Presentations and prototypes
• Creative assets and business workflows

OpenAI is also previewing a new “Sites” feature. This lets Codex generate and host interactive apps, dashboards, project hubs, planners, and websites that can be shared via a simple URL—no separate hosting or deployment required.

Long term, OpenAI is clearly aiming for a unified workspace where plugins, workflows, and hosted apps live across both Codex and ChatGPT. The boundary between “developer tool” and “AI assistant” is starting to blur.

A new benchmark and “vibe coding” platform

Alongside the big vendor announcements, a new independent benchmarking platform has launched with a focus on practical, real‑world evaluation rather than marketing claims.

The platform introduces what it calls the first “vibe coding” benchmark and evaluation system. The core idea is to help you figure out which model actually works best for your specific use case, instead of relying on cherry‑picked scores.

Key features include:

• Side‑by‑side model comparisons across multiple domains (reasoning, coding, design, creativity, and more).
• A library of nearly 4,000 prompts used for testing, so you can see exactly how outputs are generated.
• An AI judge system that scores outputs on functionality, design, code quality, creativity, and other criteria.
• Support for connecting your own model endpoints, with detailed feedback on why your model performed a certain way and how to improve prompts, context, or model choice.

Leaderboards and many insights are available for free, making it a useful resource if you’re trying to choose between GPT, Claude, Qwen, and other models for production work.

Microsoft re‑enters the frontier race with new models

At Microsoft Build 2026, Microsoft unveiled seven new AI models spanning reasoning, coding, image generation, image editing, speech‑to‑text, and text‑to‑speech. For the first time in a while, it feels like Microsoft is not just integrating other people’s models—it’s building serious frontier systems of its own.

MAI Thinking 1: a compact but powerful reasoning model

The standout is MAI Thinking 1, Microsoft’s new reasoning model. According to the company, it was trained entirely from scratch, without distillation from third‑party models. That matters because there has been ongoing debate about how independent many top‑tier reasoning models really are.

MAI Thinking 1 is a 35‑billion active parameter mixture‑of‑experts model, yet Microsoft claims:

• It performs toe‑to‑toe with Claude Opus 4.6 on software engineering benchmarks.
• In blind human evaluations, it was actually preferred over Claude Sonnet 4.6 for some tasks.
• It scores competitively on math and GPQA benchmarks versus other proprietary models.

If these claims hold up under independent testing, MAI Thinking 1 could become a strong alternative for reasoning‑heavy coding and analysis tasks.

MAI Code 1 Flash: a new coding workhorse

Microsoft also launched MAI Code 1 Flash, a coding‑focused model rolling out inside GitHub Copilot. On Microsoft’s internal benchmarks, it reportedly outperforms Claude Haiku 4.5 on several coding tests, including SwayBench Verified and BenchPro.

Highlights include:

• Strong performance on multilingual and terminal‑style coding benchmarks.
• Significant token savings—up to 60% fewer tokens in some scenarios—while maintaining quality.

While it may not be the absolute best coding model on the market, it looks like a credible, efficient option for Copilot users who want fast, low‑latency assistance without burning through tokens.

Beyond these, Microsoft also introduced new image generation and editing models, voice models, and a transcription model, rounding out a full multimodal stack.

Leaked compute estimates for Claude Mythos

During the same event, Microsoft appeared to accidentally reveal an estimate of the compute used to train Anthropic’s Claude Mythos. A slide listed Mythos at roughly 6.1 × 10²⁷ FLOPs (floating‑point operations), a measure of total compute used during training.

Many researchers believe that exact number is likely off, but even revised estimates still point to an enormous training run: potentially trillions of parameters trained on hundreds of trillions of tokens. That would place Mythos among the largest and most ambitious AI models ever trained.

This aligns with Anthropic’s positioning of Mythos as more than a small upgrade—closer to a major leap in reasoning, coding, and agentic capabilities. For more background on Anthropic’s trajectory and previous Claude releases, you can revisit our earlier roundup on new Claude features and other model launches.

Hermes Agent gets a native desktop app

Hermes Agent, one of the most capable open‑source agent platforms, now has an official desktop application. Instead of running only in the browser, Hermes can run natively on your machine, offering a smoother, more integrated experience.

Hermes Agent already supports:

• Multi‑agent workflows
• MCP (Model Context Protocol) integration
• Computer use and automation
• Image generation
• Memory systems and advanced orchestration

The new desktop app brings these capabilities closer to the polished feel of commercial AI products, while keeping the flexibility and openness that made Hermes popular. It’s available across major operating systems, including Linux.

Qwen 3.7 Plus: Alibaba’s multimodal, agent‑ready model

Alibaba has introduced Qwen 3.7 Plus, a new variant in the Qwen 3.7 family. Unlike Qwen 3.7 Max, which is primarily text‑focused, Qwen 3.7 Plus is fully multimodal and designed with agents in mind.

Qwen 3.7 Plus combines vision and language in a single model, allowing it to:

• See and analyze images
• Perform visual reasoning and grounding (tying answers directly to what it sees)
• Write and reason about code
• Act as an agent across GUI interactions and command‑line tasks

Alibaba is positioning it as both a coding agent and a productivity assistant, capable of handling mixed visual and text workflows while remaining more efficient than Qwen 3.7 Max. A deeper benchmark breakdown is expected soon, building on earlier analysis of Qwen 3.7 Max as a flagship agent model.

New Claude Code features: /fork and a powerful CLI

Anthropic quietly shipped useful upgrades to Claude Code that make it more agent‑like and easier to integrate into developer workflows.

The new /fork command now launches a background agent that inherits your exact context: tools, model settings, chat history, and even prompt cache. It runs in the background and returns results to your current session, effectively letting you spin up parallel work without leaving the main conversation.

The previous behavior of /fork—creating a separate session you can continue manually—has been renamed to /branch.

Anthropic also released a new CLI for the Claude platform. From the terminal, developers can now:

• Call the Messages API
• Launch and manage Claude agents
• Pipe outputs directly into shell workflows

The CLI is designed to work smoothly with coding agents like Claude Code, making it easier to build end‑to‑end AI workflows without leaving your development environment.

Google NotebookLM experiments with smarter video planning

On the Google side, NotebookLM appears to be testing a new “planning mode” for video overviews. Early sightings suggest this mode gives users more control over how video summaries and presentations are structured.

This may also hint that Google is upgrading NotebookLM’s video features to use the Gemini Omni model. If so, users could see:

• Better visual understanding and narration
• More coherent, customizable video summaries
• Higher‑quality, presentation‑ready outputs directly from NotebookLM

It’s a small but meaningful step toward turning NotebookLM into a true multimodal research and presentation hub.

Agent‑native hardware from Microsoft

In addition to new models, Microsoft also previewed handheld and desktop devices designed specifically around interacting with AI agents.

Instead of treating agents as just another app, these devices aim to provide dedicated hardware for:

• Delegating tasks to agents
• Monitoring ongoing workflows
• Interacting with AI systems throughout the day

This is very similar to what many expected from rumored AI hardware projects elsewhere: hardware that’s built from the ground up for an agent‑first world. It’s still early, but it signals that “agent‑native” devices may become a real category.

Hyperrealistic humanoid robots push the uncanny valley

To top off an already wild week, the World Intelligence Expo in China showcased hyperrealistic humanoid robots that can blink, nod, make eye contact, and mimic human expressions with unsettling accuracy.

These robots combine motion capture, synthetic skin, realistic hair, and highly detailed facial movements. The result is machines that look and move much closer to real humans than previous generations of humanoid robots.

The engineering is impressive and could have real applications in healthcare, education, research, and customer service. At the same time, it raises familiar questions about the uncanny valley and how society will adapt as it becomes harder to instantly tell humans and machines apart.

Overall, this week’s developments—from GPT‑5.6 rumors and new benchmarks to frontier‑level models, agent platforms, and lifelike robots—show just how quickly AI is evolving across software, hardware, and the physical world.