Blog • March 2026
By Cemhan Biricik — Founder of ZSky AI
VRAM is the most precious resource in AI inference. More than compute, more than bandwidth, more than CPU cycles. When you are running multiple AI models on consumer GPUs like the RTX 5090, every megabyte of VRAM matters. Here are the techniques I use to squeeze maximum utility out of every GPU at ZSky AI.
If you are serving AI models at FP16 or FP32, you are wasting half or more of your VRAM. FP8 quantization cuts memory usage nearly in half with minimal perceptible quality loss for inference. The key word is "perceptible" — there are measurable differences in benchmark metrics, but users cannot tell the difference in the final output.
I run every production model at FP8. A 14-billion parameter model that would require 28GB at FP16 fits comfortably in 16GB at FP8. This is the single most impactful optimization in my entire stack.
Rather than time-sharing GPUs between models, I dedicate specific GPUs to specific models. GPU 0-1 might handle image generation, GPU 2-3 handle video, GPU 4 handles upscaling, and GPUs 5-6 serve as flexible overflow. This eliminates the catastrophic overhead of model loading and unloading, which can take 10-30 seconds for large models.
PyTorch's CUDA memory management is lazy by default. It allocates memory eagerly but frees it reluctantly, leaving fragmented VRAM that cannot be used for new allocations. After every inference call, I force explicit garbage collection and CUDA cache clearing. This adds maybe 50 milliseconds of latency but prevents the memory fragmentation that causes out-of-memory errors under load.
For high-resolution image generation, attention computation can consume more VRAM than the model weights themselves. Attention slicing breaks the attention computation into smaller chunks that fit in available memory. VAE tiling processes the final decode step in tiles rather than all at once. Both techniques trade compute time for memory efficiency — usually a worthwhile trade.
My queue system is not just about ordering requests. It is a memory-aware scheduler. Each model's VRAM requirements are known. The queue checks GPU memory status before dispatching each request. If a GPU is too loaded, the request waits for the next available GPU rather than causing an OOM crash. This reliability is worth more than raw throughput.
VRAM management is not glamorous. There are no TED talks about garbage collection timing. But it is the difference between an AI service that crashes under load and one that runs smoothly. Every trick in this post was learned from a production outage that taught me the hard way.