Never Use ChatGPT Again? How to Run Powerful AI Locally Instead
Most people think using AI means logging into ChatGPT, Claude, or Gemini and handing over their data. But you can get surprisingly powerful AI on your own machine—no cloud, no tracking, and no monthly subscription.
This guide walks you through how to run local AI models, hook them up to your own search engine, and even let them control apps on your desktop, all while keeping your data in your hands.
Why You Might Want to Ditch Cloud AI
Modern AI assistants are getting more powerful, but they’re also getting more invasive. Desktop apps from big AI companies increasingly ask for deep access to your computer so they can “click around” for you, open files, browse the web, and automate tasks.
That convenience comes with a cost: you’re effectively giving a large tech company a remote control for your machine, plus a detailed log of what you search, write, and work on. If you care about privacy, that’s a huge trade-off.
The good news is that you don’t have to choose between using AI and giving up your data. With today’s hardware and open models, you can run capable AI systems locally and keep everything on your own device.
Running AI Models Locally: The Basics
Local AI means the model runs on your own hardware—your CPU and GPU—rather than on a company’s servers. You download a model once, load it into memory, and interact with it through a simple app or API.
Two popular ways to get started are:
Option 1: Ollama (local AI server)
Ollama is a lightweight AI server you run on your machine. Once installed, you can download and run different models with simple commands.
How it works in practice:
• You install Ollama via a one-line script in your terminal (on macOS or Linux).
• Then you choose a model (for example, Qwen 3.6) from their catalog and follow the instructions to download it.
• Each model lists its size, context length (how much text it can consider at once), and what inputs it supports (text, images, etc.).
Ollama runs as a local service, so other tools on your machine can talk to it like they would talk to an online AI API—except everything stays on your computer.
Option 2: LM Studio (user-friendly desktop app)
LM Studio is a desktop app that makes local models much easier to use, especially if you’re not comfortable with the command line.
Key points:
• It’s not open source, but it’s one of the most user-friendly options.
• You download the app for your OS (Windows, macOS, Linux). On Linux, it’s a single AppImage file you can run directly.
• Inside LM Studio, you can browse, download, and run models with a graphical interface.
LM Studio also exposes a local API endpoint that mimics the OpenAI API, which is important later when you want to plug it into agents and other tools.
Understanding Model Sizes, VRAM, and Performance
When you browse models, you’ll see names like “Qwen 3.5 9B” or “Qwen 27B”. That “B” stands for billion parameters—roughly the “size of the brain” the model is using.
In simple terms:
• Smaller models (e.g., 7B–9B parameters) are faster and lighter, but less capable.
• Larger models (e.g., 27B+ parameters) are smarter and more detailed, but need more GPU memory (VRAM).
To run these models well, you need enough VRAM on your GPU:
• A common gaming GPU like an RTX 3060 often has 12 GB of VRAM.
• Many users still have 8 GB of VRAM, which limits you to smaller or more heavily compressed models.
• Bigger models plus large context windows can easily use 12+ GB of VRAM.
You can check your VRAM using tools like GPU-Z on Windows or your system monitor on Linux/macOS. Once you know your VRAM, you can pick models that fit your hardware.
For example, a 9B model might use around 7 GB of VRAM at a small context size, but if you increase the context window (so it can read more text at once), usage can jump to 10–12 GB. A 27B model will use significantly more, and you’ll see your GPU fans spin up like you’re playing a heavy game.
Where Local Models Come From: Hugging Face
Most open models live on Hugging Face, a huge repository of AI models and datasets. Think of it as the “download hub” for AI.
On Hugging Face you’ll find:
• Text-only models (chat, coding, writing)
• Text-to-image and video models
• Multimodal models that handle text and images
• Variants of the same base model with different sizes or quantizations (compressed versions)
LM Studio and other tools can search Hugging Face directly, so you don’t usually have to download models manually. You just search inside the app (e.g., for “Qwen 3.5 9B”), click download, and then load the model.
Adding Private Web Search with SearXNG
Running a local model is powerful, but it’s limited to whatever was in its training data. To make it truly useful for research, you want it to browse the current web.
Instead of giving your AI direct access to Google or Bing, you can run your own self-hosted search engine with SearXNG:
• SearXNG is a meta-search engine you can run locally (for example, in Docker).
• It queries multiple search providers on your behalf and returns results in a clean, ad-free format.
• You can configure it to return results as JSON, which makes it easy for AI tools to consume.
Once SearXNG is running on your machine (e.g., at http://127.0.0.1:4000), you can treat it as the “eyes” of your local AI. The model doesn’t call Google directly; it calls your private SearXNG instance, which then fetches results and sends them back.
This setup is similar in spirit to what big AI assistants do when they “browse the web” for you—but here, you control the search engine and the data never leaves your machine beyond the actual web requests.
If you’re interested in how different AI frontends and wrappers shape your experience, it’s worth comparing this approach to cloud tools in something like ChatGPT vs Claude vs Gemini: The Hidden “Wrapper” That Actually Matters.
Connecting Local Models to SearXNG in LM Studio
LM Studio includes a “Developer” section where you can define tools and endpoints that the model is allowed to use. This is where you plug in your SearXNG instance.
The high-level steps:
1. Enable JSON output in your SearXNG configuration so it can return machine-readable results.
2. In LM Studio’s developer tab, configure a tool (via an mcp.json or similar configuration) that points to your SearXNG URL.
3. Give the model permission to call this tool when it needs to search the web.
Once set up, you can ask your local model things like:
• “Look up everything you can about the recent Vercel hacking incident.”
The model will:
• Call your SearXNG endpoint
• Fetch relevant links and snippets
• Summarize and structure the information into a readable report
Even a 9B model can do a decent job here, though a 27B model will usually be more thorough and less likely to hallucinate.
9B vs 27B: What You Actually Notice
When you compare a smaller model (like Qwen 3.5 9B) with a larger one (like a 27B variant), you’ll notice a few things:
• Speed: 9B is noticeably faster. Tokens appear quickly, and responses feel snappy.
• Detail: 27B tends to be more verbose and structured, often adding sections like “Why this matters” or more nuanced analysis.
• Tool use: Larger models are better at reliably using tools (like search or file operations) without getting confused.
However, the bigger model also:
• Takes longer to load into VRAM
• Consumes much more GPU memory
• Runs your GPU hotter and louder
There’s always a trade-off between capability and resource usage, and for many everyday tasks, a well-tuned 9B model is “good enough.”
From Chatbot to Agent: Letting AI Control Your Computer
Search and chat are just the beginning. The next step is agentic AI—systems that can plan tasks and take actions on your computer, like opening apps, editing files, and generating reports while you’re away.
One example of this is Hermes Agent, an open-source tool that lets you connect an AI model to a set of tools on your system, including:
• Web browsing via SearXNG
• File operations (read, write, save)
• Running shell commands (if you allow it)
How Hermes Agent Works
At a high level:
1. You install Hermes Agent via a script in your terminal.
2. During setup, you choose which model endpoint to use.
3. Instead of pointing it at a cloud provider, you select a custom endpoint and paste in the local API URL from LM Studio (usually something like http://127.0.0.1:1234/v1).
Hermes then detects the models available from LM Studio and lets you pick one (for example, your 9B or 27B Qwen-based model).
Once configured, you can give it higher-level instructions like:
• “Research the Vercel hacking incident, summarize everything you find, and save the report as a text file on my desktop.”
The agent will:
• Call your local model
• Use SearXNG to search the web
• Read results, compile a summary
• Create a file and save it where you asked
This is the same kind of “AI does the work while you take a break” workflow that cloud tools advertise—but now it’s entirely local.
Safety and Limits of Agents
Giving an AI the ability to run commands or modify files is powerful but risky. Even with safety layers, smaller models can misunderstand instructions or hallucinate dangerous commands.
For example:
• If you ask an agent how to delete everything on your system, it might list destructive commands like rm -rf /home or low-level disk wipes.
• But a reasonably aligned model will refuse to actually run them and warn you instead.
Uncensored or poorly aligned models, however, might not refuse. That’s why you should:
• Be very careful which tools you expose to the agent (especially shell access).
• Review commands before allowing them to run.
• Treat agents as helpful but fallible assistants—not autonomous admins.
Real Example: Generating a Full Website Automatically
To see how far this can go, imagine asking your local agent to not just research an incident, but also build a website about it:
• “Look up everything you can about the Vercel hack, research best practices for clean, Apple-style web design, then compile your findings into a single HTML file that looks like Apple’s site and save it on my desktop.”
With a strong local model (like a compressed 27B Qwen-based variant) and SearXNG browsing enabled, the agent can:
• Search for up-to-date information on the incident
• Read documentation and examples of Apple-like design
• Generate HTML, CSS, and basic layout
• Save a complete, working web page to your desktop
The result won’t be pixel-perfect Apple, but it will be a clean, structured site with sections, headings, and styling—created in a few minutes while your GPU fans spin like a game is running.
This is exactly the kind of workflow companies are selling with cloud-based AI agents. The difference here is that you’re not streaming your entire project, file system, and browsing history to a third party.
The Bigger Picture: Why Local AI Matters
Cloud AI isn’t going away. In fact, many companies are moving toward:
• Tiered access to “intelligence” based on how much you pay
• Increasingly strict verification and identity requirements
• Locking advanced capabilities behind expensive subscriptions or enterprise deals
All of this is built on top of massive amounts of data—often scraped from the open web without clear consent. As these systems get more capable, there’s a real risk that powerful AI becomes something only large organizations truly control.
Local AI pushes in the opposite direction:
• You own the hardware and the models you download.
• You decide what data the model sees.
• You can still get strong capabilities—research, coding help, automation—without being surveilled.
For power users and teams, combining local models, self-hosted search (like SearXNG), and agent frameworks (like Hermes) can get you surprisingly close to what the big players offer, especially for focused workflows. If you’re curious how top companies are already using AI in practice, it’s worth comparing this local-first approach to the strategies in how the best companies really use AI.
Should You Stop Using ChatGPT Completely?
You don’t have to swear off cloud AI forever. There are still good reasons to use ChatGPT, Claude, Gemini, or similar tools:
• They’re extremely capable for complex reasoning and coding.
• They run on huge clusters you can’t realistically replicate at home.
• They’re convenient when you’re on the go or using low-powered devices.
But you also don’t need to give them everything:
• For private projects, sensitive notes, or local automation, a self-hosted setup is safer.
• For repeatable workflows (reports, research, code scaffolding), local agents can be cheaper long-term once you own the hardware.
• For experimentation and learning, local models give you full control.
AI isn’t going away. The real question is whether you want to be entirely dependent on a few giant companies for access to it—or whether you’re willing to take a bit of time to set up your own stack and keep some of that power for yourself.
If you’re already a gamer or power user with a decent GPU, you probably have enough hardware to start. From there, it’s just a matter of installing a local server (Ollama or LM Studio), wiring in SearXNG, and experimenting with agents like Hermes to see how far you can push local AI.
Comments
No comments yet. Be the first to share your thoughts!