Can a 35B local model really beat Claude Sonnet 3.5?

04 Jun 2026 13:07 51,611 views
Alibaba’s Qwen 3.5 35B model scores higher than Claude Sonnet 3.5 on many benchmarks, but how does it actually perform in real-world coding and front-end tasks? This article walks through hands-on tests across web design, 3D, physics simulations, and game logic to see where Qwen shines, where it fails, and whether it can really compete with larger cloud models.

Alibaba’s Qwen 3.5 series made waves when benchmarks suggested the 35B parameter model could outperform Claude Sonnet 3.5 in many categories, despite being dramatically smaller. But benchmarks don’t always match real-world use. How good is Qwen 3.5 35B when you actually ask it to build things—like landing pages, 3D sites, physics sandboxes, and simple games?

Test setup: running Qwen 3.5 35B locally

The tests were run with Qwen 3.5 35B in a 4-bit quantized configuration (KM) on an RTX 4060 Ti with 16 GB of VRAM. Qwen’s team claims the 3.5 series maintains near-lossless accuracy even at 4-bit and with quantized KV cache, making it attractive for local use on consumer GPUs.

For sampling, the model used Qwen’s own recommended settings for coding tasks from its model card. All prompts were run inside an existing boilerplate Next.js project to test not just code generation, but also how well the model understands and works within an existing project structure.

Simple landing page: early warning signs

The first test was straightforward: generate a simple landing page with some animation and GSAP ScrollTrigger text effects. The prompt also specified visual details like an animated blob and image backgrounds.

Qwen 3.5 35B’s result:

The model struggled more than expected for such a basic task:

  • GSAP animations were misconfigured or simply wrong across the page.

  • Explicit instructions—like the animated blob and image backgrounds—were ignored.

  • The visual design felt generic and uninspired.

Claude Sonnet 3.5’s result:

Sonnet handled this test better overall:

  • Typography and layout composition were noticeably stronger.

  • It even added a parallax scrolling effect on its own initiative.

  • However, it also struggled with ScrollTrigger animations, failing to fully nail the behavior.

Interestingly, when the same prompt was tested on the dense Qwen 3.5 27B model, the ScrollTrigger issues disappeared, though it still missed the image background requirement. This suggests the 35B MoE-style model may behave differently from its dense sibling on certain front-end tasks.

Complex landing page with custom animations

The second test ramped up the difficulty: a more detailed landing page with multiple custom animations and specific design instructions. This included elements like a spinning DNA helix and scanning laser effects.

Qwen 3.5 35B’s result:

This time, the model performed surprisingly well:

  • It followed the prompt closely and implemented nearly all requested features.

  • The generated page ran without major bugs.

  • Some custom animations (like the DNA helix and laser) looked a bit odd, but were functionally present.

Overall, the result was solid and much better than its performance on the simpler first test.

Claude Sonnet 3.5’s result:

Sonnet delivered a strong page as well, with some trade-offs:

  • There were visual glitches in the hero section.

  • Custom animations and overall page design—especially a vertical card carousel—were more polished than Qwen’s.

This test highlighted one of Qwen 3.5 35B’s main traits: it can be very capable, but also inconsistent from prompt to prompt.

Adding 3D with Three.js

Next, the models were asked to build a simple online portfolio site using Three.js, with 3D text, lighting effects, and particles. This is a good stress test for combining front-end structure with 3D scene setup.

Qwen 3.5 35B’s result:

Qwen handled the 3D part reasonably well:

  • It set up a working Three.js scene with 3D text and effects.

  • However, it added extra UI elements on top of the 3D canvas that weren’t requested and visually broke the composition.

Claude Sonnet 3.5’s result:

Sonnet made a similar mistake by introducing unnecessary elements on top of the 3D scene, but the overall layout and composition looked slightly more refined.

For comparison, a larger flagship model (Claude Opus 4.6) produced a more ambitious and coherent result—arguably what Qwen 3.5 35B and Sonnet 3.5 were “trying” to do but didn’t fully achieve.

If you’re specifically interested in front-end and agentic workflows with newer Qwen models, it’s worth checking out our look at Qwen 3.6 Max for coding and front-end apps.

Everything at once: 3D, particles, custom animations, and horizontal scroll

The final design test combined almost everything into a single landing page prompt:

  • Three.js 3D scene

  • Lighting and particle systems

  • Custom animations

  • Horizontal scrolling sections

  • Strong aesthetic design and typography

Qwen 3.5 35B’s result:

Here, the model essentially broke down. The output didn’t meaningfully satisfy the complex set of requirements, and the page fell apart under the combined constraints.

Claude Sonnet 3.5’s result:

Sonnet handled this demanding prompt impressively well:

  • The resulting page looked close to production-ready.

  • It missed some minor details, but the overall structure, visuals, and interactions were coherent.

On this test, Sonnet was the clear winner.

Logic and performance: building a physics simulator

Design is one thing; logic-heavy apps are another. The next test asked the models to build a simple physics sandbox where users can place elements like sand, water, wood, and fire into a grid-based simulation.

This tests:

  • Complex state management

  • Performance considerations

  • Correct physical behavior (e.g., sand falls, water flows, fire spreads realistically)

Qwen 3.5 35B’s result:

  • On the first attempt, the app didn’t respond to clicks at all.

  • Even after prompting the model to review and fix its own code, the issue persisted.

  • A manual fix was required just to get the simulation running.

  • Sand and wood behavior were acceptable, but water physics were clearly wrong and looked unnatural.

Claude Sonnet 3.5’s result:

Sonnet performed better overall:

  • The simulation worked out of the box.

  • Water physics still had some glitches, but were closer to the intended behavior.

This test exposed one of Qwen 3.5 35B’s weak spots: debugging its own non-trivial logic can be hit-or-miss, and it may require human intervention to get complex apps fully working.

Game logic: a simple Mario-style platformer

The final test asked the models to write a simple Mario-like game in Python with procedurally generated levels. The key requirement: the game should be playable, with no impossible jumps until the player loses by hitting an enemy or leaving the screen.

Qwen 3.5 35B’s result:

  • The game did run, but the player spawned above empty space and fell to their death immediately.

  • After being asked to fix the issue, Qwen added a platform under the player—but it then disappeared.

  • Other positioning and level generation bugs remained, making the game effectively unplayable.

Claude Sonnet 3.5’s result:

Sonnet produced an almost playable game:

  • The core loop and controls worked.

  • The main flaw was that the procedural generator didn’t fully account for the player’s jump distance, occasionally creating impossible gaps.

Both models struggled with robust procedural design, but Sonnet got closer to a usable prototype.

So, is Qwen 3.5 35B really better than Sonnet 3.5?

Looking beyond benchmarks, the picture is more nuanced.

Where Qwen 3.5 35B impresses:

  • For its size, it’s remarkably capable—especially considering it can run locally on a consumer GPU.

  • On some complex front-end prompts, it can follow instructions closely and produce working, bug-free code.

  • Its performance is strong enough that it can sometimes approach the quality of much larger cloud models.

Where it falls short:

  • Inconsistency: it can fail on simple tasks and then succeed on harder ones.

  • Reliability: logic-heavy apps, simulations, and games often need manual debugging.

  • Design sense: compared to Sonnet, its visual and interaction design choices are often less polished.

Across these tests, Claude Sonnet 3.5 emerged as the more reliable, consistent model, even though Qwen 3.5 35B may look stronger on paper in some benchmark suites.

Why this still matters for local and agentic workflows

Despite its flaws, Qwen 3.5 35B represents a major step forward for locally hosted models. Having a 35B model that can:

  • Run on a single consumer GPU, and

  • Attempt complex front-end, 3D, and logic-heavy tasks with results comparable at times to Claude Sonnet 3.5

is a big deal for developers who want privacy, control, or offline capability.

It also fits neatly into the growing ecosystem of local coding and agent setups. For example, Qwen models are already being used to power tools like local Claude-style coding experiences; you can see how in our guide on running Claude-like code assistance locally with Qwen 3.5 and Ollama.

The takeaway: Qwen 3.5 35B is likely the best model in its size class right now, especially for local deployment. Just don’t expect it to consistently beat flagship cloud models in real-world workflows, no matter what the benchmark charts say.

Share:

Comments

No comments yet. Be the first to share your thoughts!

More in LLM Models