How Much VRAM Do You Need to Run LLMs? A Practical Guide

GPU VRAM is the #1 constraint when deploying large language models. Not compute, not bandwidth — memory. If your model doesn't fit in VRAM, it doesn't run.

Use the AI Model VRAM Calculator to get exact requirements for your model and quantization.

The Core Formula

VRAM (GB) = Parameters (B) × bytes_per_param + KV cache + overhead

Where bytes_per_param depends on quantization:

Quantization	Bytes/param	7B model	13B model	70B model
FP32	4.0	28 GB	52 GB	280 GB
FP16 / BF16	2.0	14 GB	26 GB	140 GB
INT8	1.0	7 GB	13 GB	70 GB
INT4 / GGUF Q4	0.5	3.5 GB	6.5 GB	35 GB

The KV Cache Problem

The KV (key-value) cache is often the biggest surprise. For long context windows, it can exceed the model weights themselves.

KV cache size scales with:

›Context length — 4K tokens vs 128K tokens is a 32× difference
›Batch size — each concurrent request needs its own KV cache
›Model size — larger models have wider attention layers

For a 70B model at 128K context with FP16:

›Model weights: ~140 GB
›KV cache: ~40 GB
›Total: ~180 GB — requiring 3× A100 80GB

At 4K context (the same model):

›KV cache: only ~1.3 GB
›Total: ~141 GB — 2× A100 80GB is enough

Which GPU For Which Model

Model	Quantization	Fits on
Llama 3 8B	FP16	RTX 4090 (24GB)
Llama 3 8B	INT4	RTX 3060 12GB
Llama 3 70B	FP16	2× A100 80GB
Llama 3 70B	INT4	A100 40GB
Mistral 7B	FP16	RTX 4080 16GB
Gemma 27B	INT4	RTX 4090 24GB
CodeLlama 34B	INT4	A10G 24GB

Quantization Quality Tradeoffs

INT4 quantization reduces VRAM by 4× with surprisingly small quality degradation for most tasks. GGUF Q4_K_M (the most common Ollama format) is the sweet spot:

›Instruction following: minimal degradation
›Code generation: minor degradation (INT8 preferred)
›Math/reasoning: noticeable degradation (FP16 preferred)

For production inference where accuracy matters, use BF16 or INT8. For local experimentation and chat, GGUF Q4 is fine.

Cloud GPU Costs

Once you know your VRAM requirement, the GPU Hosting Cost Calculator shows you what it costs to rent that GPU across RunPod, Lambda Labs, and Vast.ai.

Rough guide:

›RTX 4090 (24GB): ~$0.44/hr on RunPod — good for 7B FP16 or 13B INT4
›A100 40GB: ~$1.89/hr — good for 70B INT4
›A100 80GB: ~$2.49/hr — good for 70B FP16 or 2× 70B INT4 batched
›H100 80GB: ~$3.99/hr — fastest inference for production serving

Running Multiple Models

If you're serving multiple models simultaneously (model routing), add VRAM requirements together. vLLM supports model parallelism across multiple GPUs to serve large models efficiently.

For a production inference server handling mixed traffic (7B and 70B models):

›1× A100 80GB: can hold 70B INT4 + 7B FP16 simultaneously
›2× A100 40GB: better for redundancy, fits 70B INT4 with room for batching

Optimizing Inference

Beyond quantization, inference efficiency tools:

›vLLM — PagedAttention for KV cache management, 2–5× throughput improvement
›llama.cpp — CPU+GPU mixed inference (offload some layers to RAM)
›TGI (Text Generation Inference) — Hugging Face's production server
›Ollama — simplest setup for GGUF models locally