AI Model Size / VRAM Calculator
Calculate how much VRAM you need to run any LLM locally. Supports popular models, GGUF quantization levels, and custom architectures.
About AI Model Size / VRAM Calculator
Running large language models locally requires knowing exactly how much GPU memory you need. This VRAM calculator takes a model's parameter count, quantization level, and context length, then estimates the total memory required so you can figure out which GPU will handle it.
How It Works
Select a popular model preset like Llama 3.1, Mistral 7B, or Gemma 2, or enter custom architecture details. The calculator computes the raw weight size based on your quantization choice, adds the KV cache for your desired context window, and applies a 1.2x overhead multiplier for the inference runtime. You get a clear breakdown of weight memory, KV cache, and total VRAM needed.
Quantization Levels Explained
Quantization reduces model precision to save memory. FP32 uses 4 bytes per parameter while FP16 uses 2. GGUF formats like Q4_K_M compress further to around 0.57 bytes per parameter. The quantization comparison table lets you see the size difference across all levels at a glance, making it easy to pick the right trade-off between quality and memory.
GPU Compatibility at a Glance
The GPU compatibility section shows whether your model fits on popular cards from the RTX 3060 to the H100. Each GPU gets a color-coded bar and badge so you can instantly see if it will fit easily, be a tight squeeze, or not fit at all. If you need help with other technical details, the subnet calculator or screen resolution checker might also be useful.
All calculations run entirely in your browser. No data is sent anywhere.
Frequently Asked Questions
How is the VRAM estimate calculated?
The calculator multiplies the model's parameter count by the bytes per parameter for your chosen quantization level. It adds a 1.2x overhead multiplier for CUDA context, activations, and framework memory, then adds the KV cache needed for your context length. KV cache is calculated as 2 (keys and values) times layers times KV heads times head dimension times 2 bytes (FP16) times context length.
What quantization level should I choose?
It depends on your hardware and quality needs. FP16 gives full precision but uses the most memory. Q4_K_M is a popular middle ground that keeps most of the model's quality while cutting memory roughly 3.5x. Q2_K saves the most space but noticeably reduces output quality. For most home setups, Q5_K_M or Q4_K_M are good starting points.
Why does the total VRAM include a 1.2x overhead?
Running a model requires more memory than just its weights. The CUDA runtime context, activation tensors during inference, and the inference framework itself all consume GPU memory. The 1.2x multiplier is a conservative estimate that accounts for this extra usage.
Can I run a model that barely fits in my VRAM?
Technically yes, but performance suffers. When VRAM usage is near 100%, the system may need to swap memory which slows inference dramatically. If a model is a tight fit, try a more aggressive quantization level or reduce the context length to free up headroom.
Does this account for Apple Silicon unified memory?
Apple Silicon Macs share RAM between CPU and GPU, so the full system RAM is available for model loading. The estimates here still apply, but you can use the total system RAM as your available memory rather than a dedicated GPU VRAM figure.