GPU VRAM is the #1 constraint when deploying large language models. Not compute, not bandwidth — memory. If your model doesn't fit in VRAM, it doesn't run.
Use the AI Model VRAM Calculator to get exact requirements for your model and quantization.
The Core Formula
VRAM (GB) = Parameters (B) × bytes_per_param + KV cache + overhead
Where bytes_per_param depends on quantization:
| Quantization | Bytes/param | 7B model | 13B model | 70B model |
|---|---|---|---|---|
| FP32 | 4.0 | 28 GB | 52 GB | 280 GB |
| FP16 / BF16 | 2.0 | 14 GB | 26 GB | 140 GB |
| INT8 | 1.0 | 7 GB | 13 GB | 70 GB |
| INT4 / GGUF Q4 | 0.5 | 3.5 GB | 6.5 GB | 35 GB |
The KV Cache Problem
The KV (key-value) cache is often the biggest surprise. For long context windows, it can exceed the model weights themselves.
KV cache size scales with:
- ›Context length — 4K tokens vs 128K tokens is a 32× difference
- ›Batch size — each concurrent request needs its own KV cache
- ›Model size — larger models have wider attention layers
For a 70B model at 128K context with FP16:
- ›Model weights: ~140 GB
- ›KV cache: ~40 GB
- ›Total: ~180 GB — requiring 3× A100 80GB
At 4K context (the same model):
- ›KV cache: only ~1.3 GB
- ›Total: ~141 GB — 2× A100 80GB is enough
Which GPU For Which Model
| Model | Quantization | Fits on |
|---|---|---|
| Llama 3 8B | FP16 | RTX 4090 (24GB) |
| Llama 3 8B | INT4 | RTX 3060 12GB |
| Llama 3 70B | FP16 | 2× A100 80GB |
| Llama 3 70B | INT4 | A100 40GB |
| Mistral 7B | FP16 | RTX 4080 16GB |
| Gemma 27B | INT4 | RTX 4090 24GB |
| CodeLlama 34B | INT4 | A10G 24GB |
Quantization Quality Tradeoffs
INT4 quantization reduces VRAM by 4× with surprisingly small quality degradation for most tasks. GGUF Q4_K_M (the most common Ollama format) is the sweet spot:
- ›Instruction following: minimal degradation
- ›Code generation: minor degradation (INT8 preferred)
- ›Math/reasoning: noticeable degradation (FP16 preferred)
For production inference where accuracy matters, use BF16 or INT8. For local experimentation and chat, GGUF Q4 is fine.
Cloud GPU Costs
Once you know your VRAM requirement, the GPU Hosting Cost Calculator shows you what it costs to rent that GPU across RunPod, Lambda Labs, and Vast.ai.
Rough guide:
- ›RTX 4090 (24GB): ~$0.44/hr on RunPod — good for 7B FP16 or 13B INT4
- ›A100 40GB: ~$1.89/hr — good for 70B INT4
- ›A100 80GB: ~$2.49/hr — good for 70B FP16 or 2× 70B INT4 batched
- ›H100 80GB: ~$3.99/hr — fastest inference for production serving
Running Multiple Models
If you're serving multiple models simultaneously (model routing), add VRAM requirements together. vLLM supports model parallelism across multiple GPUs to serve large models efficiently.
For a production inference server handling mixed traffic (7B and 70B models):
- ›1× A100 80GB: can hold 70B INT4 + 7B FP16 simultaneously
- ›2× A100 40GB: better for redundancy, fits 70B INT4 with room for batching
Optimizing Inference
Beyond quantization, inference efficiency tools:
- ›vLLM — PagedAttention for KV cache management, 2–5× throughput improvement
- ›llama.cpp — CPU+GPU mixed inference (offload some layers to RAM)
- ›TGI (Text Generation Inference) — Hugging Face's production server
- ›Ollama — simplest setup for GGUF models locally