K8sCalc

ai-gpu

AI Model VRAM Calculator

Calculate the GPU VRAM required to run any LLM locally or on cloud GPU. Supports all quantization levels — FP32, FP16, INT8, INT4, and GGUF variants.

GPU VRAM Requirements for LLM Inference

Running large language models requires GPUs with sufficient VRAM to hold the model weights and runtime state. VRAM is the #1 constraint when deploying LLMs.

VRAM Formula

Total VRAM = Model weights + KV cache + Runtime overhead

Model weights (GB) = Parameters (B) × bytes per parameter

  • FP32: 4 bytes → 7B model = 28 GB
  • FP16/BF16: 2 bytes → 7B model = 14 GB
  • INT8: 1 byte → 7B model = 7 GB
  • INT4/GGUF-Q4: 0.5 bytes → 7B model = 3.5 GB

GPU VRAM Reference

GPUVRAMMax Model (FP16)Max Model (INT4)
RTX 4060 Ti16 GB~7B~30B
RTX 409024 GB~12B~46B
A100 40 GB40 GB~20B~70B
A100 80 GB80 GB~40B~140B
H100 80 GB80 GB~40B~140B

KV Cache Scaling

KV cache grows with context length and batch size. At 128K context, a 70B model needs 40+ GB for KV cache alone. This is why inference servers like vLLM use PagedAttention to manage KV cache more efficiently.

Recommended Deployment Paths

  • ≤8 GB VRAM: GGUF Q4 models via Ollama (7B class, personal use)
  • 16–24 GB: 7B FP16, 13B INT8 models, small production deployments
  • 40–80 GB: 70B INT4, 34B FP16, serious inference workloads
  • Multi-GPU: 70B+ models in FP16, distributed inference

Frequently Asked Questions

How much VRAM does Llama 3 70B need?

At FP16, Llama 3 70B requires ~140 GB VRAM — requiring 2× A100 80 GB or 4× H100 40 GB. At INT4 quantization (GGUF Q4), it fits in ~35–40 GB VRAM, making a single A100 40 GB or 2× RTX 4090 workable.

What is the minimum GPU to run Llama 3 8B?

Llama 3 8B at FP16 requires ~16 GB VRAM — fitting on an RTX 4080, RTX 3090, or A4000. At INT4/GGUF Q4, it requires only ~4–5 GB, running on consumer GPUs like the RTX 3060 12 GB.

How does quantization affect VRAM requirements?

FP16 uses 2 bytes per parameter. INT8 uses 1 byte (2× reduction). INT4 uses 0.5 bytes (4× reduction). A 7B model needs ~14 GB at FP16 but only ~3.5 GB at INT4, at the cost of slight accuracy degradation.

How does context length affect VRAM?

Longer context windows increase KV (key-value) cache memory usage. For a 7B model at 4K context: ~1 GB KV cache. At 32K context: ~8 GB KV cache. At 128K context: ~30+ GB. This is significant for large models with long contexts.

Related Tools

Related Guides