K8sCalc
ai-gpu12 May 2026

How Much VRAM Do You Need to Run LLMs? A Practical Guide

Calculate exactly how much GPU memory you need to run Llama 3, Mistral, Gemma, and other LLMs at different quantization levels. FP16, INT4, GGUF explained.

GPU VRAM is the #1 constraint when deploying large language models. Not compute, not bandwidth — memory. If your model doesn't fit in VRAM, it doesn't run.

Use the AI Model VRAM Calculator to get exact requirements for your model and quantization.

The Core Formula

VRAM (GB) = Parameters (B) × bytes_per_param + KV cache + overhead

Where bytes_per_param depends on quantization:

QuantizationBytes/param7B model13B model70B model
FP324.028 GB52 GB280 GB
FP16 / BF162.014 GB26 GB140 GB
INT81.07 GB13 GB70 GB
INT4 / GGUF Q40.53.5 GB6.5 GB35 GB

The KV Cache Problem

The KV (key-value) cache is often the biggest surprise. For long context windows, it can exceed the model weights themselves.

KV cache size scales with:

  • Context length — 4K tokens vs 128K tokens is a 32× difference
  • Batch size — each concurrent request needs its own KV cache
  • Model size — larger models have wider attention layers

For a 70B model at 128K context with FP16:

  • Model weights: ~140 GB
  • KV cache: ~40 GB
  • Total: ~180 GB — requiring 3× A100 80GB

At 4K context (the same model):

  • KV cache: only ~1.3 GB
  • Total: ~141 GB — 2× A100 80GB is enough

Which GPU For Which Model

ModelQuantizationFits on
Llama 3 8BFP16RTX 4090 (24GB)
Llama 3 8BINT4RTX 3060 12GB
Llama 3 70BFP162× A100 80GB
Llama 3 70BINT4A100 40GB
Mistral 7BFP16RTX 4080 16GB
Gemma 27BINT4RTX 4090 24GB
CodeLlama 34BINT4A10G 24GB

Quantization Quality Tradeoffs

INT4 quantization reduces VRAM by 4× with surprisingly small quality degradation for most tasks. GGUF Q4_K_M (the most common Ollama format) is the sweet spot:

  • Instruction following: minimal degradation
  • Code generation: minor degradation (INT8 preferred)
  • Math/reasoning: noticeable degradation (FP16 preferred)

For production inference where accuracy matters, use BF16 or INT8. For local experimentation and chat, GGUF Q4 is fine.

Cloud GPU Costs

Once you know your VRAM requirement, the GPU Hosting Cost Calculator shows you what it costs to rent that GPU across RunPod, Lambda Labs, and Vast.ai.

Rough guide:

  • RTX 4090 (24GB): ~$0.44/hr on RunPod — good for 7B FP16 or 13B INT4
  • A100 40GB: ~$1.89/hr — good for 70B INT4
  • A100 80GB: ~$2.49/hr — good for 70B FP16 or 2× 70B INT4 batched
  • H100 80GB: ~$3.99/hr — fastest inference for production serving

Running Multiple Models

If you're serving multiple models simultaneously (model routing), add VRAM requirements together. vLLM supports model parallelism across multiple GPUs to serve large models efficiently.

For a production inference server handling mixed traffic (7B and 70B models):

  • 1× A100 80GB: can hold 70B INT4 + 7B FP16 simultaneously
  • 2× A100 40GB: better for redundancy, fits 70B INT4 with room for batching

Optimizing Inference

Beyond quantization, inference efficiency tools:

  • vLLM — PagedAttention for KV cache management, 2–5× throughput improvement
  • llama.cpp — CPU+GPU mixed inference (offload some layers to RAM)
  • TGI (Text Generation Inference) — Hugging Face's production server
  • Ollama — simplest setup for GGUF models locally