AI — Running AI Locally

AI — Running AI Locally

AI — Running AI Locally

Setup guide for running LLM models locally, from a basic laptop through a multi-user team server.

See also: AI — Foundational Concepts, AI — Learning Resources & Roadmap


Why Run Locally

  • Privacy — prompts stay on your machine; no third-party data policy exposure
  • Cost — no per-token billing; one-time hardware cost amortises over heavy use
  • Offline — no internet dependency, no rate limits, no service outages
  • It works now — Qwen3, Llama 4, DeepSeek, Gemma 3 handle most everyday tasks on consumer hardware

The Critical Hardware Metric: VRAM

VRAM (GPU memory) is what matters — not CPU or total RAM. The model must fit in VRAM during inference. If it overflows to system RAM, speed drops from ~35 tokens/sec to ~3 tokens/sec.

Quick check: visit CanIRun.ai — detects your GPU/CPU/RAM via browser and shows which models your machine can run. No sign-up.


Tier 1: Laptop (8–16GB RAM, no dedicated GPU)

Models that fit:

  • 8GB RAM: 3B–7B parameter models (Phi-4 Mini, Mistral 7B — ~4–6GB when loaded)
  • 16GB RAM: up to 12B (Gemma 3 12B handles general chat well)

Speed: CPU-only runs at 3–8 tokens/sec — usable for drafting, slow for chat.

Tool: LM Studio (lmstudio.ai)

  • GUI, works on Windows / Mac / Linux
  • Searches and downloads models from HuggingFace
  • Auto-detects hardware; falls back to CPU if no GPU
  • Easiest starting point for non-technical users

Alternative: GPT4All — even simpler, fewer model options.


Tier 2: Desktop with Dedicated GPU

GPUVRAMModel rangeSpeed
RTX 3060 12GB12GB7B–8B comfortably30–50 tokens/sec
RTX 4060 Ti 16GB16GB7B–13B30–50 tokens/sec
RTX 3090 / 409024GB30B models; compressed 70B40–60 tokens/sec

Warning: avoid the RTX 4060 Ti 8GB — the 8GB VRAM fills up once model + context grows. The 16GB version is worth the premium.

Tool: Ollama (CLI)

1
ollama run llama3.3    # pulls model if not local, starts chat
  • Single install, works immediately
  • Runs a local API server on port 11434 (OpenAI-compatible)
  • Any app that talks to OpenAI’s API can point at localhost:11434 instead

Optional UI on top of Ollama: Open WebUI — browser-based chat interface, feels like a proper chat app.


Tier 3: Advanced / Team Use

Apple Silicon Macs (unified memory)

  • GPU and CPU share the same memory pool — no VRAM ceiling
  • Mac Mini M4 (16GB): handles 14B models smoothly
  • Mac Studio M3 Ultra (96GB): multiple large models in memory simultaneously
  • Best consumer option for large models without a discrete GPU

vLLM (for serving a team)

1
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-8B
  • Handles concurrent requests (Ollama doesn’t)
  • OpenAI-compatible API — drops in as a replacement
  • More setup work; right choice when more than one person needs access

Model Format: GGUF

GGUF is the standard format for quantised models on consumer hardware (available on Hugging Face).

QuantisationSize vs quality trade-off
Q4_K_MGood balance — recommended starting point
Q5_K_MBetter quality, ~25% more memory
Q8_0Near-original quality, ~2× Q4 memory

For most local use: Q4 or Q5. Q8 if you have VRAM to spare and want maximum fidelity.


Tool Comparison

ToolInterfaceBest for
LM StudioGUILaptop / beginner / visual model browser
OllamaCLI + APIGPU desktop / developers / app integration
Open WebUIBrowserChat UI layer on top of Ollama
GPT4AllGUISimplest possible local setup
vLLMServerTeam/multi-user deployment

Known Limitations

  • Quality ceiling — local models are behind GPT-4o and Claude on complex reasoning and very long documents; the gap narrows quickly each generation
  • Speed on CPU — 3–8 tokens/sec on a CPU-only laptop; acceptable for drafting, not for interactive chat
  • Manual updates — new model versions don’t auto-install; you pull and switch manually
  • Context window — most local models cap at 8K–32K tokens vs 200K+ for cloud models

Trending Tags