How to Run Ernie Image: The New Best Local AI Image Generator

13 May 2026 17:00 129,433 views

Ernie Image is a powerful new open-source AI image generator that rivals top closed models, especially for text, posters, comics, and detailed scenes. Here’s how it compares to Zimage and how you can run it locally for free using ComfyUI—even on low-VRAM GPUs.

There’s a new open-source image model in town, and it might be the best local AI image generator you can run on your own PC today. It’s called Ernie Image, and it stands out for its prompt understanding, realistic style, and surprisingly strong text rendering inside images.

If you’ve been relying on models like Zimage or Flux for local generation, Ernie Image is absolutely worth a look—especially if you care about posters, infographics, comics, and detailed, multi-element scenes.

What Makes Ernie Image Special?

Ernie Image is an open-source text-to-image model focused on high prompt adherence and flexible style control. It’s particularly strong at:

1. Complex prompt understanding
Ernie handles long, detailed prompts with many objects, relationships, and constraints. Examples include:

Vintage-style photos with recursive elements (e.g., an artist painting an image that appears on a screen in the same scene)
City tableaus with specific landmarks, like Kyoto maps with torii gate rows, cherry blossoms, rickshaws, and kimono-clad pedestrians
Scenes with multiple characters and props (ballerinas, rabbits, elephants, pianos, and more in one frame)

In many of these tests, Ernie captured more of the requested details and kept sizes, composition, and relationships more consistent than Zimage.

2. Realistic and natural-looking photos
Compared to some earlier open-source models that can look plasticky or over-smoothed, Ernie’s photorealistic outputs tend to feel more natural and imperfect in a good way. Skin, lighting, and textures often look closer to real photography rather than CGI.

3. Strong text rendering inside images
One of Ernie’s biggest advantages is generating legible text on posters, signs, infographics, and comics. It’s not perfect, but it often:

Spells most words correctly
Handles longer sentences better than many open models
Keeps layout and typography close to what you describe

In diary-style tests and bakery window prompts with specific prices, fonts, and labels, Ernie produced more accurate and readable text than Zimage.

4. Versatile artistic styles
Ernie can switch between:

Photorealistic portraits and lifestyle shots
Comics and manga (including panel layouts and speech bubbles)
Impressionist paintings (e.g., Manet-inspired brushwork)
Minimalist Chinese watercolor scenes
Abstract dot-based illustrations and stylized posters

While some style tests (like true Manet-level abstraction) aren’t perfect, Ernie still delivers convincing stylistic variations for most creative use cases.

Ernie Image vs Zimage: Head-to-Head Comparison

Ernie Image was tested directly against Zimage on a wide range of prompts. Here’s how they stack up in key areas.

Photorealism and Scene Complexity

On complex, realistic scenes, Ernie often comes out ahead:

Recursive artist scene: Ernie captured the recursive painting concept with richer detail and more believable lighting than Zimage.
Kyoto map on a table: Ernie generated more accurate torii gate rows, trees, and character proportions. Zimage struggled with consistent scale and some landmark details.
Ballet studio with animals: Both models made small mistakes, but Ernie followed the prompt more closely overall (correct placement of most elements, fewer major composition errors).

Text, Posters, and Infographics

This is where Ernie really shines.

Diary text test: Ernie reproduced a long sentence with only a couple of small errors (missing a word, one misspelling). Zimage produced far more gibberish.
Bakery window sign: Ernie’s scene looked more realistic and visually pleasing, though it repeated the word “workshop” and mis-rendered one pastry. Zimage nailed more literal details but looked more plasticky.
Holiday cookie swap poster: Here Zimage actually did better, correctly including more of the requested layout details (like sponsor logos and overflowing cookie tin) that Ernie missed.
Machine learning pipeline infographic: Ernie produced a clean, readable UI-style infographic with correct section titles and icons. Zimage devolved into repeated labels and gibberish text.

For marketing graphics, UI diagrams, and educational visuals, Ernie’s text and layout capabilities make it especially useful. If you’re into AI-generated visuals for presentations, you might also like tools covered in guides like running ACE 1.5 XL locally for AI music, which pairs nicely with visual content workflows.

Comics and Manga

Ernie performed impressively on structured comic page prompts:

Correct panel layout (right-to-left manga format)
Accurate placement of specific characters in specific panels
Legible, correctly positioned speech bubbles and large bottom text

The main flaw was some unintended color in what was supposed to be a pure black-and-white manga page. Zimage, by contrast, struggled more with panel consistency, character duplication, and text quality.

Where Zimage Still Wins

Ernie isn’t perfect. Zimage still has advantages in some areas:

Anatomy: On poses like a King Pigeon yoga pose, Zimage produced more anatomically correct bodies. Ernie’s attempt was noticeably distorted.
Special reflection logic: In a bathroom mirror prompt where only the reflection should be glitchy and pixelated, Zimage handled the concept better. Ernie pixelated both the reflection and the real person.
Certain layout-specific scenes: In a Taj Mahal split-view prompt, Zimage aligned the architecture and water more accurately, though its text labels were gibberish. Ernie’s labels were better, but composition was slightly off.

Both models also failed on a notoriously hard prompt: “11:15 on a clock and a wine glass filled to the top.” Neither got the time or the glass level correct.

Benchmarks: How Good Is Ernie Image Really?

According to published benchmark results, Ernie Image ranks as the top open-source image model in overall image generation quality, ahead of:

Zimage
Quen Image
Flux 2 Klein

It even comes close to some leading closed-source models like Nano Banana 2 in overall scores, especially when it comes to prompt following and visual detail.

There are two main variants:

Ernie Image (base): Slightly better quality, but needs many more sampling steps, so it’s slower.
Ernie Image Turbo: Optimized for speed with fewer steps (around 8). Quality is very close to the base model but much faster to generate.

For most users, the Turbo variant is the best balance of speed and quality.

How to Run Ernie Image Locally with ComfyUI

Ernie Image is designed to run locally, but it’s a large model. You can use it with ComfyUI, a popular node-based interface for running open-source image and video models on your own machine. If you’re already experimenting with local video models, it fits nicely into workflows like those used for running local AI video generators.

Hardware and Model Sizes

For the full models:

Ernie Image / Ernie Image Turbo model file: ~16 GB each
Text encoder (Mistral 3B): ~7.5 GB
Flux 2 VAE: ~300 MB

In total, you should plan for around 20 GB of VRAM or system memory to run the original (non-compressed) models comfortably. ComfyUI helps by dynamically managing VRAM/RAM, but you still need enough capacity to load everything.

Step 1: Update ComfyUI

Assuming you already have ComfyUI installed:

Open your ComfyUI folder.
Go into the update folder and run update_comfy.bat (or the equivalent script on your OS).
Wait for the update to complete.

Step 2: Launch ComfyUI and Load the Ernie Workflow

Run run.bat (or your platform’s launch script) to start ComfyUI.
In the left sidebar, click Templates.
Search for "Ernie".
You’ll see two templates: one for Ernie Image and one for Ernie Image Turbo.
Select the Ernie Image Turbo workflow for faster generation.

The workflow will appear with some nodes outlined in red, indicating missing models.

Step 3: Download the Required Models

ComfyUI can download the models for you directly from within the workflow:

Click See errors to view missing models.
Use the built-in Download buttons where available.

Or download them manually as indicated in the nodes:

Ernie Image Turbo model → place in ComfyUI/models/diffusion_models
Mistral 3B text encoder → place in ComfyUI/models/text_encoders
Flux 2 VAE → place in ComfyUI/models/vae

After downloading:

Press R in ComfyUI to refresh the model list.
Select Ernie Image Turbo as the diffusion model.
Select Mistral 3B as the text encoder.
Select Flux 2 VAE as the VAE.

If the red outlines remain, refresh the page—once the models are selected, the workflow is ready.

Step 4: Basic Settings and Prompt Enhancement

The main controls you’ll use are:

Prompt: Your text description of the image.
Width & Height: Output resolution.
Prompt enhancement toggle: If enabled, an AI helper rewrites your prompt to be more detailed and Ernie-friendly. This can improve results but uses more VRAM and time.

If you expand the workflow, you’ll see more advanced settings:

Steps (KSampler): For Turbo, ~8 steps is recommended.
CFG (Classifier-Free Guidance): Controls how strictly the model follows your prompt. Around 1.0 is a good default; try 0.8–1.2 if you want to experiment.
Sampler type: The specific sampling algorithm. The default usually works well.

Once set, collapse the workflow back to the main view and click Run. ComfyUI will load the models and generate your image, usually in under 10 seconds with Turbo on a capable GPU. Outputs are saved automatically in your ComfyUI output folder.

Running Ernie Image on Low-VRAM GPUs (GGUF)

If you don’t have 20+ GB of VRAM, you can still use Ernie Image thanks to compressed GGUF versions provided by the community.

Step 1: Choose a GGUF Variant

A community maintainer has released multiple quantized Ernie Image Turbo GGUF files with different sizes:

Smallest (e.g., Q2K): ~3.2 GB – lowest VRAM usage, but most quality loss.
Larger variants (e.g., Q6_1): ~6–7 GB – better quality, still much lighter than the full 16 GB model.

Pick the largest file that fits comfortably within your GPU VRAM. For example, with an 8 GB GPU, a ~6–7 GB GGUF is a good starting point.

Download your chosen GGUF and place it in:

ComfyUI/models/unet

Step 2: Install the GGUF Loader Extension

To use GGUF models in ComfyUI:

Open the Extensions tab in ComfyUI.
Search for "ComfyUI-GGUF" by city96.
Install or update the extension.
Apply changes and restart ComfyUI if prompted.

Step 3: Swap the Diffusion Model Node

In the Ernie Image Turbo workflow:

Expand the workflow to see all nodes.
Find the node that loads the diffusion model.
Double-click on the canvas and search for "UNet Loader GGUF" (or similar).
Add the UNet Loader GGUF node.
Connect its output to the model input of the KSampler node.
Disable or bypass the original diffusion model loader (Ctrl + B).
In the GGUF loader node, select your downloaded Ernie Image Turbo GGUF file.

Now click Run. The model will generate images using the compressed GGUF version. Quality will be lower than the full model—especially with the smallest quantizations—but still usable, and it makes Ernie Image accessible on much more modest hardware.

Should You Switch to Ernie Image?

Ernie Image isn’t flawless, especially in anatomy and some tricky logical setups, but it’s currently one of the strongest open-source image generators you can run locally. It’s particularly compelling if you:

Create posters, infographics, or marketing visuals with lots of text.
Need comics or manga-style layouts with multiple panels and speech bubbles.
Care about prompt adherence and detailed, multi-element scenes.
Prefer natural-looking photorealism over plasticky renders.

If you rely heavily on perfect anatomy or very precise physical logic, you may still want to keep Zimage in your toolkit. But for many everyday creative, marketing, and design tasks, Ernie Image is a new top contender for local, free, and unlimited AI image generation.

As open-source models keep improving and new variants roll out (including planned image editing capabilities for Ernie Image), local generation is quickly catching up to closed, cloud-only tools—while giving you more control over your own hardware and data.