The Hardware Reality: What GPU You Actually Need for Local AI

This is the article that saves you from buying the wrong GPU. The local AI community has a hardware fetish — every other post on r/LocalLLaMA is someone asking "will my [GPU] run [model]" or showing off their multi-GPU server build. The actual answer to "what hardware do I need" depends on what model sizes you want to run, at what speed, and how much you're willing to spend. The math is less complicated than the forums make it seem, and VRAM is the only number that really matters.

The Only Number That Matters: VRAM

Local AI inference is a VRAM problem. Not a CPU problem, not a system RAM problem, not a storage speed problem — a VRAM problem. VRAM (video memory on your GPU, or unified memory on Apple Silicon) determines the largest model you can run entirely on your GPU. When a model fits in VRAM, inference is fast. When it doesn't fit, parts of the model spill to system RAM ("offloading"), and inference gets dramatically slower — we're talking 5-10x slower, sometimes more.

The rule of thumb is simple: a quantized model needs roughly 0.5-1GB of VRAM per billion parameters, depending on quantization level. A 7B parameter model at Q4 quantization needs about 4-5GB. A 13B model needs 8-10GB. A 30B model needs 16-20GB. A 70B model needs 35-45GB. These are rough numbers — context length, quantization choice, and runtime overhead affect the actual requirement — but they're close enough to guide purchasing decisions.

Here's what that means in practice: your VRAM ceiling determines your maximum model size, and your maximum model size determines the quality ceiling for your local AI setup. A 7B model is useful for quick tasks, code completion, and simple chat. A 13B model is noticeably smarter for reasoning and complex instructions. A 30B model competes with cloud APIs on many tasks. A 70B model is the current frontier for local inference and approaches GPT-4-class performance on many benchmarks [VERIFY]. Every step up in model size is a real quality improvement, and every step requires more VRAM.

Apple Silicon: The Accidental AI Platform

Apple didn't design their M-series chips for AI inference. They designed them for battery life and general-purpose performance. The fact that they're excellent for local AI is a happy accident of their unified memory architecture — the CPU and GPU share the same memory pool, which means your "VRAM" is whatever unified memory your Mac has.

M1/M2/M3 base (8GB unified memory): You can run 7B models at Q4 quantization. This is the bare minimum for useful local AI. Responses at maybe 15-25 tokens per second [VERIFY]. It works, it's not fast, and you'll be limited to smaller models. Don't buy an 8GB Mac for local AI in 2026 — it's the floor.

M1/M2/M3 Pro (18GB unified memory): Comfortable for 7B, usable for 13B models. The 18GB Pro chips hit a sweet spot where 13B Q4 models fit with some room for context. This is where local AI starts feeling practical rather than experimental. Expect 20-35 tokens per second on 7B models [VERIFY].

M2/M3/M4 Max (32-64GB unified memory): The local AI sweet spot for Mac users. 32GB runs 30B models comfortably. 64GB handles 70B models at Q4 quantization — this is genuinely impressive, because a year ago running a 70B model required an NVIDIA A100 or multiple consumer GPUs. The Max chips with 64GB are the single most convenient path to running frontier-class open models today. Token speeds for 30B models land around 15-25 tokens/second, and 70B models around 8-15 tokens/second [VERIFY] — slower than an NVIDIA GPU but fast enough for interactive chat.

M2/M3/M4 Ultra (128-192GB unified memory): Overkill for most local AI use cases, but if you're running multiple large models simultaneously or working with 70B+ models at high quantization levels (Q8, FP16), the Ultra has headroom nothing else in the consumer space matches. The Mac Studio with M4 Ultra and 192GB unified memory is — absurdly — one of the most capable single-box AI inference machines available at any price, because you simply cannot get 192GB of unified GPU-accessible memory any other way without going to datacenter hardware [VERIFY on M4 Ultra availability and specs].

The Apple Silicon catch: Metal acceleration is well-supported in llama.cpp, Ollama, and LM Studio. Performance is good. But it's not as fast per-VRAM-dollar as NVIDIA CUDA. Apple's advantage is unified memory capacity at consumer prices — not raw inference speed. If you already have a Mac, you're in great shape. If you're buying hardware specifically for AI, the math gets more interesting.

NVIDIA: The Performance Standard

NVIDIA GPUs with CUDA are the reference platform for AI inference. The software stack is the most mature, the optimization is the deepest, and the community support is the broadest. If raw inference speed per dollar is your metric, NVIDIA wins.

RTX 3060 12GB (~$200-250 used): The budget entry point. 12GB of VRAM runs 7B models easily and 13B models at Q4 with some room. This is the card the r/LocalLLaMA community recommends most often for beginners, because the 12GB VRAM variant (not the 8GB version — avoid that) offers the best VRAM-per-dollar of any card on the used market. Token speeds around 35-50 tokens/second on 7B models [VERIFY]. The card is old and slow by gaming standards, but VRAM capacity matters more than compute speed for inference.

RTX 3090 / 4090 24GB (~$700 used / $1,600-2,000 new): The serious hobbyist tier. 24GB of VRAM runs 13B models comfortably at Q5 or Q8 quantization, and can squeeze in 30B models at aggressive Q4 quantization. The 4090 is significantly faster than the 3090 for inference — maybe 40-60% more tokens per second [VERIFY] — but the 3090's 24GB of VRAM at its used market price makes it remarkable value. If you're buying one GPU for local AI and you're willing to buy used, a 3090 is the most pragmatic choice available.

RTX 4090 24GB: The single-card ceiling for consumer NVIDIA hardware. Fastest inference speeds in the consumer space — 100-140 tokens/second on 7B models, 50-80 on 13B [VERIFY]. The same 24GB VRAM as the 3090, so the model size ceiling is identical. You're paying more for speed, not capacity. Worth it if inference speed is critical to your workflow; overkill if you're chatting interactively and 50 tokens/second would be fine.

Multi-GPU and datacenter (A100, H100, RTX 6000 Ada): For running 70B+ models on NVIDIA, you need more than 24GB of VRAM. Options: two 3090s or 4090s with model parallelism [VERIFY compatibility], an NVIDIA A100 80GB ($8,000-15,000 used), or cloud GPU rental. This is where local AI stops being a hobby project and starts being a capital investment. Most people should rent cloud GPUs at this tier rather than buying.

AMD: The Complicated Third Option

AMD GPUs are technically supported for local AI through ROCm, AMD's compute platform. In practice, the experience ranges from "it works fine" to "you'll spend a weekend debugging driver issues."

The ROCm support situation in 2026 is better than it was in 2024, but still uneven. Newer AMD cards — the RX 7900 XTX (24GB), RX 7900 XT (20GB) — have the best ROCm support and offer competitive VRAM per dollar. llama.cpp and Ollama both support ROCm, and when it works, performance is respectable — maybe 70-85% of equivalent NVIDIA cards.

The catch: when it doesn't work, troubleshooting AMD GPU issues for AI inference is a significantly worse experience than NVIDIA. Fewer community members run AMD, fewer guides exist, fewer bugs get caught and fixed quickly. If you already own an AMD GPU, it's worth trying — Ollama's ROCm support has gotten much more reliable. If you're buying a GPU specifically for local AI, NVIDIA remains the safer bet unless you're comfortable debugging compute stack issues.

CPU-Only Inference: It Works, It's Slow

No GPU at all? You can still run local AI. llama.cpp — and therefore Ollama and LM Studio — support CPU-only inference. A 7B Q4 model on a modern CPU (Intel 12th gen or AMD Ryzen 5000 series or newer) generates around 3-8 tokens per second [VERIFY]. That's slow enough to be painful for interactive chat but fast enough to be usable for background tasks, batch processing, or just proving to yourself that local AI works before investing in hardware.

CPU inference uses system RAM instead of VRAM, which means you can technically run larger models — a machine with 32GB of RAM can load a 30B Q4 model. It'll generate tokens at walking speed (1-3 tokens/second), but it'll generate them. For people who want to experiment before buying a GPU, this is a valid starting point.

The performance hierarchy is clear: GPU with the model fully in VRAM > GPU with partial offloading > CPU only. There's no trick or optimization that closes these gaps. If you want fast local inference, you need a GPU with enough VRAM. If you just want to try it, your CPU will get you started.

Quantization: The Quality-VRAM Trade-off

Quantization is how you shrink a model to fit in less VRAM. The original model weights are in FP16 (16-bit floating point) or BF16 — a 7B model in FP16 needs about 14GB. Quantization reduces the precision of those weights, making the model smaller at the cost of some quality.

The common quantization levels for GGUF models:

  • Q4_K_M: The default recommendation. Roughly 4 bits per weight. A 7B model drops from ~14GB to ~4.5GB. Quality loss is minimal for most tasks — most users can't tell the difference from FP16 in blind tests. This is the quantization level you should start with.
  • Q5_K_M: Slightly better quality, slightly larger. A 7B model is about 5.5GB. The quality improvement over Q4 is measurable on benchmarks but hard to notice in practice for casual use.
  • Q8_0: Near-original quality. A 7B model is about 7.5GB. Use this if you have the VRAM headroom and want the best quality the model can offer at a given size.
  • Q3_K_M and below: Aggressive compression. Noticeable quality degradation. Use this only if you're trying to fit a model that's too big for your VRAM at Q4 — running a 13B at Q3 in full VRAM is sometimes better than running a 7B at Q8.
  • FP16/BF16: Full precision. Double the VRAM of Q8. Rarely worth it for inference — the quality difference from Q8 is negligible, and you're paying a steep VRAM tax for it.

The practical advice: start with Q4_K_M. If you have spare VRAM, try Q5_K_M or Q8. Don't go below Q4 unless you have to. The biggest quality improvement comes from running a larger model at Q4 rather than a smaller model at Q8 — a 13B Q4 model is smarter than a 7B Q8 model for virtually every task.

The Honest Budget Guide

$0 — CPU only: Use whatever computer you have. Ollama runs on anything. You'll get 3-8 tokens/second on a 7B model. Enough to test, experiment, and decide if local AI is worth investing in. Not enough for daily use unless you're patient.

$200-350 — Used RTX 3060 12GB: The minimum viable GPU investment. Runs 7B models fast and 13B models adequately. If you have a desktop PC with a PCIe slot and a power supply with enough wattage [VERIFY — 3060 power requirements], this is the best dollar-per-capability entry point.

$700-1,000 — Used RTX 3090 24GB: The hobbyist sweet spot. 24GB of VRAM handles everything up to 30B models. Inference is fast. The used market price has dropped significantly as miners and gamers move to newer cards [VERIFY current used pricing]. This is the card you buy when you know local AI is something you'll use regularly.

$1,500-2,000 — RTX 4090 or Mac Mini M4 Pro: Two different philosophies at the same price point. The 4090 gives you the fastest single-card NVIDIA inference. The Mac Mini M4 Pro with 24GB unified memory gives you a silent, power-efficient, always-on inference box that doubles as a general-purpose computer. The Mac has less VRAM-equivalent but zero noise and minimal electricity cost. Pick based on whether you value raw speed (4090) or convenience and efficiency (Mac).

$3,000-5,000 — Mac Studio or multi-GPU: The Mac Studio with M4 Max and 64GB or M4 Ultra with 128-192GB unified memory is the easiest path to running 70B models at home. The alternative is a multi-GPU NVIDIA setup, which offers more raw speed but requires more power, cooling, noise management, and troubleshooting. At this price tier, you're running models that compete with cloud APIs on quality.

$5,000+ — Datacenter hardware or cloud: Beyond this point, buying hardware only makes financial sense if you're running inference hours per day, every day, for months. For everyone else, cloud GPU rental is the right answer.

When Renting Makes More Sense Than Buying

Cloud GPU rental — Lambda Labs, Vast.ai, RunPod, and others — charges by the hour for GPU access. An A100 80GB rents for roughly $1-2/hour [VERIFY current pricing]. An H100 is $2-4/hour [VERIFY]. These prices fluctuate, and spot pricing can be significantly cheaper.

The break-even math is straightforward. If you'd spend $2,000 on a 4090 and you rent an equivalent GPU for $0.50/hour, you break even at 4,000 hours of use — roughly 5.5 hours per day for two years. If you're running inference 8 hours a day for work, buying makes sense. If you're running it 2 hours a day for personal projects, renting is cheaper for any GPU you can't justify as a general-purpose purchase.

Cloud GPUs also let you access hardware tiers — A100s, H100s, multi-GPU clusters — that don't make sense to buy. Running a 70B model at full speed for a weekend project? Rent an A100 for $20 and return it. That math works out much better than a $10,000 hardware purchase you use occasionally.

The hybrid approach is the most practical for most people: own a modest local setup (Mac with decent unified memory, or a PC with a 3060/3090) for daily use and privacy-sensitive work, and rent cloud GPUs for the occasional large-model run or heavy workload. This gives you the benefits of local inference — privacy, no per-token costs, always available — without requiring a hardware investment that matches your peak demand.

The Verdict

The hardware you need depends on the models you want to run, and the models you want to run depend on what you're trying to do. Here's the decision tree:

If you're experimenting and don't want to spend anything: use your current hardware with Ollama. CPU inference on 7B models is the free on-ramp.

If you want local AI to be a usable daily tool: you need a GPU with at least 12GB of VRAM (RTX 3060) or a Mac with at least 16GB of unified memory (M-series Pro). This is the minimum for a good experience with 7B-13B models.

If you want local AI to compete with cloud API quality: you need 24GB+ of VRAM (RTX 3090/4090) or 32-64GB of unified memory (M-series Max). This gets you into 30B territory, where model quality becomes genuinely impressive.

If you want to run 70B models locally: Mac Studio with 64-128GB unified memory is the path of least resistance. Multi-GPU NVIDIA is the path of most performance. Cloud rental is the path of least cost for occasional use.

The one piece of advice that applies regardless of budget: buy VRAM, not compute speed. A slower GPU with more VRAM will run better models than a faster GPU with less VRAM, and model size is the single biggest determinant of output quality. The 3060 12GB exists for a reason — it's not a fast card, but 12GB of VRAM at that price is unbeatable.


This is part of CustomClanker's Open Source & Local AI series — reality checks on running AI yourself.