Self‑improving AI, Opus 4.8, Nvidia’s new models, and robots that juggle
AI development isn’t slowing down, and this week’s releases prove it. We’ve got a new flagship model from Anthropic, powerful open-source tools from Nvidia, game-ready 3D generators, self-organizing research agents, and humanoid robots that can juggle in real time. Here’s a breakdown of the most interesting launches and research.
Nvidia’s new vision and image tools
Nvidia dropped several open-source projects focused on vision, image quality, and interactive worlds.
Locate Anything: fast, accurate object grounding
Locate Anything is a vision-language grounding model that can find and segment almost anything in an image or video. You give it a prompt like “all the dogs” or “the red car,” and it returns precise bounding boxes—even in crowded scenes or when there are many instances of the same object.
Instead of predicting box coordinates one number at a time (which is slow and error-prone), it uses parallel box decoding to predict entire bounding boxes in a single step. This makes it faster and more geometrically consistent than typical approaches.
The model is relatively small at around 3 billion parameters (about 7.8 GB), trained on a huge dataset of 103 million language queries and 785 million bounding boxes across object detection, UI elements, OCR, layout understanding, and more. Nvidia has released the code, training scripts, and instructions to run it locally.
Control Light: AI-powered image relighting
Control Light is an AI tool for fixing dark images and editing lighting in a much more natural way than a basic brightness slider. Instead of simply boosting exposure (which often adds noise and washes out details), it generatively brightens the scene while preserving structure and texture.
You can adjust how bright the scene is and keep details intact, making it useful for surveillance footage, night scenes, or restoring underexposed photos. It’s built on Flux 2 and is fully open source, including code, training scripts, and the dataset.
PID: ultra-fast, high-quality image upscaling
Nvidia also released PID, a pixel diffusion decoder designed to generate or upscale images directly in high resolution. Instead of decoding to a small image and then running a separate upscaler, PID outputs high-res images in one shot.
It can upscale a 512×512 image to 2K resolution in under a second, making it almost six times faster than leading upscalers like SDXL’s SD-VR2 while winning most head-to-head quality comparisons. It already supports Flux 2, Z-image, SD3, and more, with additional model support planned.
Gamma World: multi-agent world simulation
Gamma World is Nvidia’s new world model that simulates multiple agents in the same environment. Most interactive video models assume a single player or camera; Gamma World is built for two to four agents moving and acting at once while sharing a consistent world.
It uses a simplex rotary agent encoding to give each agent a distinct identity without hard-coding a fixed number of players. The model can generate real-time video at 24 FPS and generalizes from two-player to four-player scenes. Nvidia plans to release code, training scripts, and dataset tools.
3D and game-ready content: from rooms to robots
Triclat: fast, simulation-ready 3D reconstruction
Triclat is a 3D reconstruction model that turns sets of images into full 3D scenes that are immediately usable in simulations. Unlike many methods that rely on Gaussian splats (great for visuals but awkward for physics and collisions), Triclat represents scenes directly as triangle primitives.
That means you get mesh-like output from the start, suitable for robotics, physics engines, and game engines without extra conversion steps. It’s significantly faster than splatting-based pipelines and the models are only about 4.4 GB, making them practical for consumer GPUs.
GenRecon: 3D rooms from casual phone videos
GenRecon focuses on indoor scenes. It can take a casual smartphone video or a set of photos of a room and reconstruct a PBR-ready 3D mesh, complete with materials. Once reconstructed, you can relight the room, change surface colors, or edit objects.
Under the hood, GenRecon builds the scene in chunks to keep everything consistent and uses a model called Trellis 2 as a generative shape prior, giving it a strong sense of realistic 3D geometry beyond what’s directly visible. A technical paper is available, and open-source code is listed as “coming soon.”
Scope: AI-generated FPS worlds you can play
Scope is a generative world model for first-person shooter games. You feed it a starting frame and controller inputs (move, aim, fire, reload, switch weapons, interact), and it generates video that responds to those actions in real time.
The visuals are still imperfect—objects and weapons can warp over time—but Scope is one of the first systems to handle such a wide range of actions in a single model. It was trained on nearly 70,000 clips from seven FPS games with 10 different controller signals, and it outperforms other game generators like Matrix Game 3 and Hunyan World 5 on visual and motion quality. Both the model and dataset are open source, though you’ll need a high-end GPU (around 30 GB) to run it.
PhysX Omni: 3D models that actually work in physics
PhysX Omni tackles a big gap in 3D generation: most models create objects that look good but don’t behave correctly in simulations. This system generates “simulation-ready” assets with proper geometry, scale, material properties, and joint structure.
Cars come with moving wheels, articulated robots have joints in the right places, and objects can be dragged, rotated, and animated realistically. Unlike previous methods that focus on just rigid bodies or a narrow class of objects, PhysX Omni unifies multiple object types in one framework and beats earlier models like Articulate Anything and PhysX Gen on standard benchmarks. Code, training scripts, and datasets are fully open source.
CubePart: text-to-3D with editable parts
CubePart is a text-to-3D model that generates objects as multiple separate parts instead of one solid mesh. Ask for a car, and you get distinct meshes for wheels, body, doors, and more; ask for a robot, and you get separate arms, legs, and torso.
This makes the outputs immediately usable in games and simulations, where parts need to move independently. CubePart uses a two-stage process: first it creates the overall shape, then it decomposes it into part-level meshes using a diffusion transformer with cross-part attention so all parts fit together cleanly. You can even choose how many parts you want. It outperforms other part-aware 3D generators and is lightweight (under 10 GB) for local use.
New image and video generation breakthroughs
Instruct AV2AV: edit video and audio with a prompt
Instruct AV2AV is a system for editing both the audio and visuals of a video from a simple text instruction. You can change what a person says, alter their voice (for example, from female to male), and adjust lip sync to match the new speech.
It can also keep the original words but swap the speaker’s apparent gender or voice characteristics. The project page lists code and model weights, but the GitHub repository is not yet populated.
Relightable holoported characters
Google introduced a method for capturing a moving person with multiple cameras and then inserting that person into new scenes with realistic lighting. After capture, you get a full-body avatar that can be relit to match different environments—changing light direction, warmth, and brightness while remaining consistent with the new background.
The system currently requires four cameras and is aimed at realistic telepresence and mixed reality. Code is marked as “coming soon.”
Pantheon 360: 360° video for digital twins
Pantheon 360 is a 360° video generation model designed for digital twins and robotics. It takes several 360° images plus a camera path and produces a panoramic video that stays consistent as the virtual camera moves through the scene.
Instead of treating this like a standard video problem, Pantheon 360 reconstructs a 3D point cloud from the input images and uses that geometry to keep the video stable and grounded. This is especially useful for autonomous driving, robotics training, and any application where the environment must remain coherent from all angles. Open-source code is planned.
Pixel Relights: intuitive relighting from a single photo
Pixel Relights lets you take a single 2D image and control lighting interactively. You can drag a cursor to set light direction, adjust how hard or soft the light is, and see the entire scene relit—tables, chairs, and other objects included.
The system first estimates a rough 3D structure of the scene, pushes that into Blender for relighting, and then uses the result as guidance to generate the final relit image. The code is available, so you can run it locally and experiment with complex relighting from just one photo.
Sega: ultra high-resolution image generation
Sega is a method for generating very high-resolution images—4K and beyond—with impressive sharpness. It works with models like Flux and Qwen Image, producing large images (for example, 6144×6144 pixels) that remain detailed even when zoomed in.
Compared to other upscalers such as DYP, Sega tends to produce more consistent details and fewer artifacts, making it one of the strongest options right now for high-res image generation. The code is open source and supports multiple base models.
Bonsai Image: Flux 2 on your phone
Bonsai Image brings high-quality image generation to mobile devices. It compresses Flux 2 Klein—originally around 8 GB—down to roughly 1 GB using 1-bit and ternary weight quantization.
The result is a model that can run fully offline on devices like an iPhone 17 Pro Max, generating 512×512 images in about 9.4 seconds. It’s not a brand-new architecture, but a clever compression of one of the best open-source image models, making local, private image generation much more accessible.
Language models, coding agents, and self-improvement
Anthropic Opus 4.8: strong, honest, but not a huge jump
Anthropic released Opus 4.8, their latest flagship model. According to Anthropic’s own benchmarks, it slightly edges out both Opus 4.7 and OpenAI’s GPT 5.5 on agentic coding, reasoning, computer use, knowledge, and financial analysis, though GPT 5.5 still leads on terminal-based coding.
One of the biggest claimed improvements is honesty: Opus 4.8 is more likely to flag uncertainty, less likely to make unsupported claims, and four times less likely to let flaws in its code slip by without noticing. It also pushes back more on weak plans during multi-step agent workflows.
Independent leaderboards paint a more mixed picture. On some evaluations, like Artificial Analysis’s leaderboard, Opus 4.8 Max is only one point ahead of GPT 5.5, and open-source models like Kimi K2.6 aren’t far behind. On other benchmarks like LiveBench, GPT 5.5 and Gemini 3.1 Pro still lead, especially in coding and math. If you want a deeper dive into how Opus 4.8 behaves in practice, we’ve covered it in more detail in our hands-on Opus 4.8 review.
MiniCPM 5 1B: a surprisingly strong 1B local model
MiniCPM 5 1B from OpenBML is a tiny 1 billion-parameter dense model designed for local deployment. Despite its size—about 2 GB—it outperforms other similarly small models across general knowledge, coding, math, reasoning, and basic agentic tasks.
Because of its footprint, it can run on most laptops and potentially even on phones, making it a good candidate for lightweight, privacy-preserving assistants.
DeepSuite: a tougher benchmark for coding agents
DeepSuite is a new benchmark designed to test coding agents on realistic software engineering work, not just cherry-picked GitHub issues that models may have already seen during training.
Tasks are written from scratch across 91 active open-source repositories in languages like TypeScript, Go, Python, JavaScript, and Rust. Prompts are short and natural, like what a real developer might type, and agents must explore the repo, decide where to change code, implement it, and avoid breaking existing functionality.
Solutions are evaluated with handwritten behavioral verifiers that check whether the final software behaves correctly, not whether a specific implementation was reproduced. On DeepSuite, GPT 5.5 currently leads, followed by Anthropic’s Claude models and then Gemini 3.5 Flash, while open-source models like Kimi K2.6 and GLM lag significantly—highlighting the gap between lab benchmarks and real-world engineering tasks.
BEES: bidirectional evolutionary search for better reasoning
The “self-improving language models with bidirectional evolutionary search” (BEES) project explores a new way to help models solve complex problems. The “self-improving” label is a bit overstated, but the search method is interesting.
Instead of sampling many full solutions and hoping one works, BEES runs a bidirectional search. Forward search builds candidate solutions step by step, recombining partial attempts like evolutionary algorithms. Backward search breaks the overall goal into subgoals, guiding the model from the top down. This provides richer feedback than a simple “right or wrong” signal at the end and leads to better performance on hard reasoning and post-training tasks where other methods stalled. The code is available for experimentation.
Agentic systems for science and productivity
Autoscientist: AI research teams for biomedical ML
Autoscientist is a new agentic framework for automating scientific research, especially in biomedical machine learning. Instead of a single agent running one experiment at a time, it organizes multiple agents into a virtual research team that explores ideas in parallel over long runs.
All agents share a global state that includes the current best solution, experiment logs, a discussion forum, and a registry of dead ends (ideas that didn’t work). This prevents the system from repeatedly chasing the same failed directions.
Some agents act as analysts, reading past experiments and discussions, writing hypotheses, and tracking dead ends. Others act as experimenters, proposing new experiments, applying code changes, training models, and reporting results back to the shared state. On the BioML benchmark—24 biomedical ML tasks across imaging, drug discovery, protein engineering, and more—Autoscientist outperforms other agentic frameworks. The code is open source, so researchers can adapt it to their own domains.
Robotics: from home helpers to juggling humanoids
Astrobot T1: a relatively affordable home humanoid
Astrobot unveiled the T1, a humanoid-style robot aimed at home and industrial use. It can assist in the kitchen, load and unload washing machines, iron clothes, act as a bartender, play with kids, and handle warehouse-style tasks.
The T1 uses a wheeled base instead of walking on two legs, so it’s limited to flat surfaces and can’t handle stairs. But the trade-off is cost: it’s rumored to be around $13,000, which is low compared to many humanoid robots with similar capabilities. It’s an early glimpse of what practical, semi-affordable home robots might look like.
Athena Zero: a robot that learns to juggle in minutes
The Rye Institute showcased Athena Zero, a humanoid robot that can juggle three balls in multiple complex patterns. According to the team, it learned these patterns in under 10 minutes of real-world interaction.
Juggling is a classic robotics benchmark because it demands tight coordination between hardware and software: tracking multiple moving objects, predicting their trajectories, and adjusting to imperfect throws in real time. Athena Zero doesn’t just repeat one routine; it switches between five juggling styles on the fly, demonstrating adaptability rather than memorization.
Game and creator tools: Roblox and beyond
Roblox’s open-source 3D asset generator
Roblox released an open-source 3D model generator that creates game-ready assets from text prompts. You describe what you want—a chair, a vehicle, a prop—and the system outputs a 3D model that can be dropped directly into Roblox experiences.
This fits into a broader trend of tools that turn natural language into production-ready 3D assets. If you’re interested in similar systems outside Roblox, we’ve also covered how other tools can turn images into production-ready 3D models in minutes.
Local, multimodal, and agent-focused models
Step 3.7 Flash: an efficient multimodal agent model
Step 3.7 Flash is a multimodal model built specifically for real-world agent use. It can read text, images, documents, charts, and interfaces, then act on them—like analyzing screenshots, automating browser workflows, or working with office-style tools over long sessions.
Despite being a “flash” (efficiency-focused) model, it performs close to GPT 5.5 and Opus 4.7 on benchmarks like SweepBench Pro and does well on multimodal and agentic tasks. The downside is size: as a unified multimodal model, it’s around 400 GB and requires multiple GPUs or a DGX-class machine. The upside is that it’s open source, giving teams full control over deployment.
What this week’s AI news tells us
Across all these releases, a few patterns stand out. First, more models are being designed from day one for real-world use: simulation-ready 3D assets, physics-aware objects, multi-agent world models, and agent frameworks that can manage messy research workflows.
Second, open source continues to be a major theme. Nvidia, academic labs, and startups are all releasing not just models but training scripts and datasets, lowering the barrier for others to build on their work.
Finally, we’re seeing rapid progress at both ends of the spectrum: massive multimodal models for complex agents, and aggressively compressed models that run offline on phones and tiny devices. Together, they point toward a near future where AI is both more capable and more embedded in everyday tools, games, and hardware.
Comments
No comments yet. Be the first to share your thoughts!