What hardware runs what models — a quick VRAM guide plus hands-on notes and reviews.
| GPU / Platform | Memory | What it runs (4-bit) |
|---|---|---|
| RTX 3060 / 4060 | 8–12 GB | Up to ~8B in 4-bit (Llama-3 8B, Gemma 7B, Phi-3) |
| RTX 4070 / 4080 | 12–16 GB | ~13B comfortably, ~30B with offloading |
| RTX 4090 / 3090 | 24 GB | ~30B in 4-bit (Gemma-3 27B, Qwen 32B); 70B needs offload |
| A100 / H100 | 40–80 GB | 70B in 4-bit on one card; larger with multi-GPU |
| Apple Silicon (unified) | 32–192 GB | Memory = headroom: 64GB runs 70B in 4-bit via MLX/Ollama |
Rule of thumb for 4-bit quantization: ~0.6 GB of memory per billion parameters, plus context overhead.
The DGX Spark is NVIDIA's desktop-class local-inference machine: a GB10 Grace Blackwell superchip with 128GB of unified CPU/GPU memory, running an Ubuntu-based DGX OS. Pricing starts around $3,000 for the base unit and ~$4,000 for the Founders Edition, with OEM configs (e.g. Dell) closer to $4,750. The spec that actually governs its behavior is memory bandwidth: ~273 GB/s. Token generation is bandwidth-bound, and 273 GB/s sits only marginally above a high-end consumer integrated chip and roughly an order of magnitude below a datacenter GPU — an H100 is ~3.35 TB/s. So the Spark is not where you go for fast tokens. Its real advantage is that the 128GB of unified memory lets you hold models that won't fit on a typical discrete GPU. It's a capacity play, not a throughput play. NVFP4 is the precision that makes that capacity usable, and NVFP4 requires the Blackwell generation. Rough capacity: one Spark handles local inference up to ~200B parameters; two units linked over the built-in ConnectX interconnect reach ~405B. A post-launch software update (TensorRT-LLM optimizations plus speculative decoding) claimed up to ~2.5x throughput gains over the launch firmware. Reach for it for development and prototyping against large open-weight models in a CUDA-native environment, or for privacy-bound local inference where fitting the model matters more than raw speed. Skip it for anything latency-sensitive at production scale — a single datacenter GPU's bandwidth and batching win decisively there. https://www.nvidia.com/en-us/products/workstations/dgx-spark/
Unified memory lets M-series Macs run 70B models in 4-bit comfortably via Ollama/MLX. Slower than a 4090 but huge memory headroom.
Runs most ~30B models in 4-bit quant (gemma-3-27b, qwen-32b). Sweet spot for local. 70B needs offloading or two cards.