Hardware

What hardware runs what models — a quick VRAM guide plus hands-on notes and reviews.

VRAM → Model-size guide

GPU / Platform	Memory	What it runs (4-bit)
RTX 3060 / 4060	8–12 GB	Up to ~8B in 4-bit (Llama-3 8B, Gemma 7B, Phi-3)
RTX 4070 / 4080	12–16 GB	~13B comfortably, ~30B with offloading
RTX 4090 / 3090	24 GB	~30B in 4-bit (Gemma-3 27B, Qwen 32B); 70B needs offload
A100 / H100	40–80 GB	70B in 4-bit on one card; larger with multi-GPU
Apple Silicon (unified)	32–192 GB	Memory = headroom: 64GB runs 70B, M5 Max runs 100B+ on one machine via MLX/Ollama (bandwidth ~550 GB/s, well under a 5090)

Rule of thumb for 4-bit quantization: ~0.6 GB of memory per billion parameters, plus context overhead.

Notes & Reviews

Apple M5 Max vs RTX 5090 for local AI — capacity vs bandwidth, the same trade-off again

Jul 9, 2026

The recurring split in local-inference hardware shows up cleanly in an M5 Max Mac Studio against a 32GB RTX 5090. The 5090 wins on raw memory bandwidth — roughly 1800 GB/s versus ~550 GB/s on the Mac — so for a model small enough to fit its 32GB, the NVIDIA card generates tokens faster and is the better pick for latency-sensitive work like fine-tuning, image generation and real-time serving. But 32GB is a hard wall: crossing ~70B parameters means multi-GPU rigs, which get expensive and awkward fast. Apple's unified memory is the opposite bet. There's no separate VRAM pool, so a single M5 Max machine can hold models north of 100B parameters — the thing the 5090 physically can't do without more cards. It's slower per token, but it fits what won't fit elsewhere, on one quiet, low-power box. For a solo developer or a small team whose constraint is 'will the model even load,' that capacity plus lower running cost and stronger resale usually wins. Software still tilts NVIDIA's way at scale: CUDA and its library ecosystem remain the default for serious training and fine-tuning. Apple's side has closed a lot of ground for inference — MLX, Ollama and vLLM-MLX make running large open-weight models locally straightforward. So the honest rule is workload-shaped: pick the 5090 when speed on a model that fits is the point, pick Apple Silicon when the model is too big to fit anything cheaper. Same capacity-vs-bandwidth trade-off we've noted on the DGX Spark and the Ryzen AI Max+ — just the consumer-desktop version of it. https://www.geeky-gadgets.com/apple-silicon-vs-rtx-5090/

applem5-maxnvidiartx-5090unified-memorymemory-bandwidthlocal-inference

AMD's Ryzen AI Max+ 395 (codenamed Strix Halo) is an APU where the CPU and GPU share a single 128GB LPDDR5X-8000 pool — no separate VRAM. On Linux, 110GB of that pool is addressable by the GPU; Windows gets up to 96GB via AMD Variable Graphics Memory. That ceiling is what matters: an RTX 5090 caps out at 32GB, a 4090 at 24GB. This chip gives a mini PC more than triple either — enough to load a full 235B-parameter open-weight model from a single unified address space. The demo model is Qwen3-235B-A22B, a Mixture-of-Experts: only ~22B parameters activate per token, which is why the math works. Loading it requires Q3 quantization (~101GB). Real-world token generation on the GMKtec EVO-X2 (the most widely available implementation) lands around 11 tokens/second — usable for local development, not cloud-competitive. The bandwidth ceiling is the honest constraint: 256 GB/s versus ~1,008 GB/s on a desktop RTX 4090 or ~800 GB/s on an Apple M2 Ultra. Token speed scales with bandwidth. AMD's own first-party device (Ryzen AI Halo mini PC) opened pre-orders in June 2026 at $3,999. The GMKtec EVO-X2 uses the same chip starting around $1,499 (64GB) to $1,800 (128GB). The "$1,499 kills the $4,000 Nvidia box" headline circulating online conflates the two: Lisa Su's CES 2026 demo used AMD's own unit, while the accessible price point lives in the third-party market. One common misconception: the chip's 50-TOPS NPU rating does not apply to LLM inference. As of mid-2026, Ollama, llama.cpp, and LM Studio route LLM workloads to the iGPU (Radeon 8060S, 40 RDNA 3.5 CUs), not the NPU. The NPU handles fixed-function tasks like video upscaling and image classification. Don't choose a config based on NPU TOPS for language model work. Worth considering if you need 70B–235B models running locally for privacy, compliance, or offline use, and you're not running high-concurrency inference where memory bandwidth becomes the hard limit. At 11 tok/s it's a developer and low-traffic deployment tool, not a production serving node. What actually fits on a 128GB unit (110GB usable on Linux): Qwen3-235B-A22B Q3 (~101GB, ~11 tok/s, MoE) is the headline; Llama 3.3 70B Q8 (~75GB, ~6–8 tok/s) is likely the best quality/speed tradeoff; Qwen2.5 72B Q8, Mistral Large 2 123B Q4 (~73GB), DeepSeek-R1-Distill 70B Q8, Mixtral 8x22B Q4 (~88GB) and Command R+ 104B all fit; anything ≤34B runs fast (15–30 tok/s). What does NOT fit: full DeepSeek V3/V4 and R1 (671B MoE, ~168GB even at Q2), Llama 3.1 405B at any useful quant, and the big cluster models (Kimi K2.6, Qwen3.5 397B, GLM-5.1) which need 500GB+. AMD's CES 2026 stage demo claimed roughly 3× an RTX 5080 on DeepSeek R1 — that figure is AMD's own benchmark and has not been independently replicated at scale. Everything else here is corroborated across multiple independent teardowns. https://www.amd.com/en/products/processors/laptop/ryzen-ai-max/395.html

Hardware

VRAM → Model-size guide

Notes & Reviews

Apple M5 Max vs RTX 5090 for local AI — capacity vs bandwidth, the same trade-off again

AMD Ryzen AI Max+ 395 — 128GB unified memory mini PC runs 235B-parameter models locally

NVIDIA DGX Spark — a 128GB unified-memory box that's a capacity play, not a speed play

Mac (Apple Silicon, 64GB+)

RTX 4090 (24GB)