Qwen 3.7 Max: Alibaba’s new flagship model for agents and long-horizon coding

05 Jun 2026 06:37 37,504 views

Alibaba’s Qwen 3.7 Max is a new flagship AI model built for agents, long-horizon workflows, and complex coding tasks. It posts frontier-level benchmark scores, strong multilingual reasoning, and impressive real-world demos like a MacOS-style desktop, 3D scenes, and a Minecraft-like sandbox.

Alibaba’s Qwen line just took a big step forward. Qwen 3.7 Max is a new flagship model built specifically for the “agent era” – long-running, tool-using AI that can plan, code, debug, and ship complex projects with minimal hand-holding.

What Qwen 3.7 Max is designed for

Qwen 3.7 Max is positioned as a versatile foundation model with a strong focus on real-world development and automation. It’s built to handle:

• Advanced coding and debugging, including complex refactors
• Front-end prototyping and UI-heavy tasks
• Office workflow automation (documents, spreadsheets, routine tasks)
• Multi-agent orchestration, where several AI agents collaborate
• Long-horizon autonomous execution, where a single task can run for hours with many tool calls

Unlike some frontier models, Qwen 3.7 Max is text-only for now. It doesn’t support images, audio, or video input, but it focuses heavily on reasoning, code quality, and long-running workflows.

Benchmark performance: now in frontier territory

On paper, Qwen 3.7 Max is the closest Alibaba has come to the frontier race so far. It performs strongly on a range of benchmarks, including:

• Terminal Bench 2.0 (coding and command-line tasks)
• Swaybench, where it scores 60.6
• Multiple reasoning, coding, and Asian-language benchmarks

On the Artificial Analysis Intelligence Index, Qwen 3.7 Max scores 56.6, a 4.8-point jump over Qwen 3.6 Max Preview. The biggest gains are in scientific reasoning, coding, and agentic (tool-using, planning) capabilities.

In many tests it lands in the same performance band as models like Opus 4.6 Max and Google’s latest Gemini-class models, and in some cases it even surpasses them. Among Chinese models, it currently looks like one of the strongest options available.

If you want more context on how Qwen has evolved, it’s worth comparing it with the earlier Qwen 3.6 Max preview, which already showed strong coding and agent abilities.

Real-world agent test: long-horizon coding

One of the most interesting results comes from a long-horizon, agentic coding benchmark. Models were asked to iteratively improve a self-training Tetris bot over 10 autonomous loops. The goal: better performance at the lowest cost.

Results from that test:

• Qwen 3.7 Max: 56% improvement, about $1.30 in API cost
• Claude Opus 4.7: 28% improvement, about $12.15
• GPT 5.5: 7% improvement, about $2.85

Qwen 3.7 Max not only achieved the largest performance gain, it also did so at the lowest cost. That combination of capability and efficiency is exactly what matters for serious agent workflows.

Long-horizon planning and tool use

Qwen 3.7 Max is optimized for long, uninterrupted workflows where the model must remember context, call tools, and refine its own work over time.

In one stress test, it sustained coherent reasoning across a 35-hour autonomous execution. During that run, it made around 1,200 tool calls while:

• Debugging and profiling code
• Rewriting and improving implementations
• Maintaining context without drifting off-task

This kind of stability is critical if you’re building serious AI agents that run for hours, manage multiple tools, or coordinate complex pipelines.

Pricing and access

Qwen 3.7 Max is available both via a web chat interface and an API.

Pricing (at launch):

• $2.50 per 1 million input tokens
• $7.50 per 1 million output tokens

You can use it through a browser-based chat interface for free (with account signup), where you can switch between “thinking” and “fast” modes. For developers, the API gives full programmatic access for agents, apps, and back-end workflows.

Front-end generation: MacOS-style desktop and more

One of the standout demos is a MacOS-style desktop clone generated from a single prompt. Qwen 3.7 Max produced:

• A top menu bar with working controls like brightness adjustment, Spotlight, and Launchpad
• A dock with multiple apps, each with its own SVG icon
• Functional apps including Finder, App Store, System Preferences, Terminal, Calculator, Text Editor, a simple Paint app, a Snake game, Weather, Clock, and Preview

Not every app was perfect (for example, Safari wasn’t fully implemented), but the breadth and coherence of the entire desktop environment from one prompt is impressive.

Overall, for front-end tasks, Qwen 3.7 Max is “good but not best-in-class.” It tends to produce usable, if sometimes slightly tacky, UIs. When given detailed instructions about layout, typography, animations, and libraries, quality improves significantly.

Instruction following and UI cloning

Qwen 3.7 Max does well at following detailed instructions, especially for front-end work. When you clearly specify components, interactions, and styling, it can generate:

• Scroll-triggered animations
• Thoughtful typography choices
• Structured layouts with the right hierarchy

When given a screenshot as a reference, it can also clone UIs quite effectively. In one test, it recreated an Airbnb-style layout from an image reference, capturing most of the structural and visual details.

In another example, it generated an editorial SaaS-style landing page with typography and color styling that strongly resembled outputs typically associated with top-tier models like Claude. This raises interesting questions about training data sources, but from a user perspective, the key point is that Qwen 3.7 Max can now produce modern, polished marketing-style UIs.

3D, spatial reasoning, and creative coding

Qwen 3.7 Max shines in 3D and spatial reasoning tasks, especially when using libraries like Three.js.

Voxel pelican and low-poly landscapes

In one prompt, the model was asked to create a voxel pelican riding a bicycle. It produced a functional Three.js scene with:

• Coherent voxel-style geometry
• Proper camera setup and lighting
• A visually interesting, creative composition

In another test, it generated a Zelda-inspired low-poly landscape. The scene included terrain, atmosphere, and environmental elements that captured the requested mood and style, even if not every detail was perfect.

Realistic aquarium simulation

A more demanding benchmark asked Qwen 3.7 Max to build a realistic aquarium simulation in Three.js. The model handled:

• Multiple fish with individually moving fins
• A UI control panel for settings
• A rendering system with real-time optimization
• An interactive “feeding mode,” where clicking on the water surface drops food and fish swim up to eat it

All of this was generated from a single prompt. This test highlights the model’s ability to combine animation logic, physics-like behavior, UI, and rendering in one coherent codebase.

SVG generation and animations

Qwen 3.7 Max is particularly strong at SVG-based tasks, which many models still struggle with.

While it didn’t perform well on a very strict “SVG world map” benchmark, it did an excellent job on more creative SVG prompts, including:

• A stylized pelican illustration
• An animated New York City skyline
• Animated infographics
• SVG painting-style scenes
• A detailed butterfly illustration

In these cases, it produced clean SVG code, often with animations and thoughtful composition that matched the prompts closely.

3D solar system and physics-style scenes

When asked to build a 3D solar system, Qwen 3.7 Max generated a scene where:

• Each planet had distinct attributes (e.g., Saturn’s rings, Jupiter’s Great Red Spot)
• Lighting was physically plausible, with dark sides facing away from the sun
• An asteroid belt was included for extra realism

Some details, like the exact motion of Saturn’s rings, may not be scientifically perfect, but the overall architecture of the scene shows strong spatial reasoning and 3D thinking.

Minecraft-style sandbox clone

Another ambitious test asked Qwen 3.7 Max to build a Minecraft-like sandbox game.

The resulting project included:

• A voxel terrain with different block types
• Day/night or time-of-day changes
• The ability to place and break blocks
• Cave systems generated within the terrain

Water rendering and physics were not fully accurate – you could walk through water and visibility wasn’t like the real game – but for a single-prompt generation, the level of functionality is still notable.

How Qwen 3.7 Max compares in the model landscape

Across a broad benchmark suite covering front-end, gaming, 3D graphics, SVG, and more, Qwen 3.7 Max ranks near the top, sitting around eighth overall in one independent leaderboard. It doesn’t win every category, but it consistently delivers strong, usable outputs across many domains.

Compared to other recent frontier-leaning models like DeepSeek V4 and GPT-5.5, Qwen 3.7 Max looks especially compelling for long-horizon coding and agent workflows. For a deeper look at how these models stack up, see our coverage of GPT‑5.5 vs DeepSeek V4, which explores the broader compute and cost dynamics in this new wave of models.

Who should consider using Qwen 3.7 Max?

Qwen 3.7 Max is particularly well-suited for:

• Developers building AI agents that run for hours and use many tools
• Teams needing strong coding + debugging with good cost efficiency
• Front-end and creative coders working with Three.js, SVG, and interactive UIs
• Users who need solid multilingual reasoning, especially across Asian languages

If you can provide clear, detailed prompts and you care about long-horizon reliability and price-performance, Qwen 3.7 Max is one of the most interesting new models to experiment with right now.