Run LLMs on Your Own Hardware with Ollama
How to run open-source LLMs locally — Ollama setup, llama.cpp, GPU acceleration, model picks, and practical use cases.

Every time you use ChatGPT or Claude, the thought crosses your mind: "Could I just run this on my own machine?" No API costs, no internet dependency, no data leaving your network. With open-source LLMs reaching practical quality in 2026, local execution is a real option now.
Don't expect GPT-5 or Claude Opus 4.6 performance. But for code completion, quick summaries, and draft generation, local models hold their own. 2025 was the year open-source models closed the gap with commercial ones. In 2026, they've reached parity in many tasks.
Why Run Locally
Cost — No per-token billing. Hardware costs something upfront, but heavy usage pays for itself.
Privacy — Sensitive code and internal docs never leave your machine. In security-conscious environments, this might be the only acceptable approach.
Offline access — Works on a plane, on a train, in a basement with no wifi.
Customization — Fine-tune models, set custom system prompts, build RAG pipelines from scratch.
The downsides are real too. You need decent hardware, there's a performance gap on complex tasks, and setup takes some effort.
Hardware Requirements
The bottleneck is VRAM (GPU memory). Model size determines how much you need.
| Model Size | VRAM Needed (4-bit quantization) | Suggested GPU |
|---|---|---|
| 7–8B | 4–6GB | RTX 3060 12GB, RTX 4060 |
| 13–14B | 8–10GB | RTX 3080, RTX 4070 |
| 30–34B | 18–20GB | RTX 3090, RTX 4090 |
| 70B | 36–40GB | A100, or CPU offloading |
No GPU? CPU inference works — slow, but functional. With 16GB+ RAM, a 7B model runs at usable speeds on CPU alone. Apple Silicon Macs are surprisingly good for local LLMs since their unified memory doubles as GPU memory.
Ollama — Up and Running in 5 Minutes
The fastest path to a local LLM. No Docker, no Python environment required.
Installation
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows — download the installer from the official site
Running a Model
# download and run (auto-downloads on first use)
ollama run qwen3.5
# other models
ollama run llama4:scout
ollama run qwen3.5:9b # specify the 9B variant
ollama run mistral
ollama run gemma
That's it. One command and you get a model download plus an interactive chat interface. Under the hood it's llama.cpp, so performance is solid. Ollama also supports cloud models now — add a :cloud tag and it connects to the cloud instead of downloading.
Using It as an API Server
Ollama exposes a REST API on localhost:11434 by default.
curl http://localhost:11434/api/generate -d '{
"model": "qwen3.5",
"prompt": "Write a quicksort implementation in Python",
"stream": false
}'
It also supports an OpenAI-compatible endpoint, which means you can point existing OpenAI SDK code at your local model with just a base URL change.
# OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions -d '{
"model": "qwen3.5",
"messages": [{"role": "user", "content": "Hello"}]
}'
If you've built anything against the OpenAI API, switching to a local model is essentially a config change.
llama.cpp — When You Want Full Control
Ollama is the "install and go" tool. llama.cpp is the "build it yourself and tune every knob" tool. Written in C/C++, lightweight and fast.
Building
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# CPU only
make
# NVIDIA GPU acceleration (CUDA)
make GGML_CUDA=1
# Apple Silicon (Metal)
make GGML_METAL=1
Download and Run
Models use the GGUF format. Grab quantized models from Hugging Face.
# run in server mode
./llama-server -m models/llama-4-scout-q4_k_m.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 99 # GPU layers (more = more VRAM used, faster)
The -ngl (number of GPU layers) flag is key. Enough VRAM? Load all layers onto the GPU for maximum speed. Short on VRAM? Load some layers on GPU, rest on CPU — hybrid mode works fine, just slower.
More setup than Ollama, but you get granular control over quantization, context size, batch size, and other parameters.
Recommended Models (2026)
Model selection is the confusing part, so here's a breakdown by use case.
Code assistance — Qwen 3.5 reshuffled the landscape when it dropped in early 2026. The lineup ranges from tiny (0.8B, 2B, 4B, 9B) through mid-size (27B, 35B-A3B, 122B-A10B) to flagship (397B-A17B). The 9B model punches well above its weight, competing with significantly larger models. The 35B-A3B variant (3B active parameters) outperforms the previous-gen Qwen 3 235B-A22B in efficiency. For code completion, 4B or 9B is enough. Complex code generation benefits from the 27B model.
General chat and summarization — Qwen 3.5 performs well as a general-purpose model too. Llama 4 is another solid option — ollama run llama4:scout gets you going quickly. Qwen 3.5 supports 201 languages and has built-in tool calling, which makes it viable for agent workflows.
Multilingual needs — Qwen 3.5's 201-language coverage handles most languages well. For specialized language needs, look for fine-tuned variants on Hugging Face.
For quantization level, Q4_K_M is the widely recommended balance between quality and file size. Q8 is higher quality but nearly doubles the size. Q2 degrades too much.
Practical Use Cases
Code Completion in VS Code
Install the Continue extension. It connects to Ollama models for inline completions. Configure the model in ~/.continue/config.json and you're set.
Local RAG
Build a system that answers questions using your own documents. The pipeline: convert docs to embeddings, store in a vector DB, retrieve relevant chunks on query, pass them as context to the LLM. All of this can run locally.
Team API Server
Set up Ollama on a machine with a good GPU and let the team hit it over the network. One GPU server serving an entire team is cost-effective.
Setting Realistic Expectations
Local LLMs aren't a replacement for top-tier commercial models. Compared to GPT-5 or Claude Opus 4.6, they lag in multi-step reasoning, instruction following, and long-context coherence. Creative writing and complex problem-solving still show a noticeable gap.
But for repetitive, structured tasks — code completion, log analysis, text classification, summarization — local models are genuinely practical. If API costs are a concern or you can't send data to external servers, they're worth serious consideration.
Try ollama run qwen3.5. Five minutes. That's all it takes to see if local inference fits your workflow. From there, dial in the model and configuration that works for you.