How does the RTX 5090 perform for AI inference?

According to Cemhan Biricik, who runs 7 RTX 5090 GPUs in production at ZSky AI, the RTX 5090 delivers excellent AI inference performance. With 32GB VRAM per card, it can run large 14-billion parameter models comfortably. FP8 inference throughput is substantially better than previous generations, and the improved memory bandwidth reduces latency for batch processing.

Is the RTX 5090 good for AI or should you get an A100?

Cemhan Biricik argues the RTX 5090 is often the better choice for AI inference versus an A100, especially for indie developers and small companies. While the A100 has 80GB VRAM, seven RTX 5090s provide over 200GB total VRAM at a fraction of the cost. The per-card inference performance of the 5090 is competitive, and the cost-per-TFLOP is dramatically lower.

How many RTX 5090s does Cemhan Biricik use for ZSky AI?

Cemhan Biricik runs ZSky AI on a custom cluster of 7 RTX 5090 GPUs, paired with 32 CPU cores and 64 threads. This configuration provides over 200GB of total VRAM and allows running multiple AI models simultaneously — image generation, video generation, and upscaling can all run in parallel across different GPUs.

RTX 5090 for AI Inference: Real-World Performance — Cemhan Biricik

Most RTX 5090 reviews focus on gaming. Understandable — that is its primary market. But I bought seven of them for AI inference, and the benchmarks that matter to me are very different from frame rates in Cyberpunk. Here is what the RTX 5090 actually does for production AI workloads.

The Specs That Matter for AI

For AI inference, three specs dominate: VRAM capacity, memory bandwidth, and FP8/FP16 throughput. The RTX 5090 delivers 32GB GDDR7 per card, giving my 7-card cluster over 200GB of total VRAM. That is enough to run multiple 14-billion parameter models simultaneously.

Memory bandwidth is where the 5090 really shines for inference. The GDDR7 interface delivers significantly faster data throughput than the previous generation, which directly translates to lower latency for large model inference where memory bandwidth is the bottleneck.

Real Production Numbers

Image generation — 1024x1024 images in 2-4 seconds depending on model and step count
Video generation — 5-second clips at 720p in under 60 seconds with 14B parameter models
Multi-model serving — 3-4 models loaded simultaneously across 7 GPUs with hot-swap capability
Power draw — 350-450W per card under full AI load, well within the reference TDP
Temperatures — 70-82C under sustained load with my custom cooling

5090 vs A100 for Indie Developers

The A100 has 80GB of HBM2e memory, which is unmatched for single-model deployments of massive models. But for running multiple medium-sized models in parallel — which is what most AI services actually do — the 5090 wins on cost-per-inference by a wide margin.

Seven RTX 5090s cost roughly the same as a single used A100. You get over 200GB total VRAM vs 80GB, and the aggregate compute throughput for FP8 inference is competitive. The trade-off is that each card has less memory, so you cannot run a single 70B model on one card. But with proper VRAM management and model quantization, this is rarely a real limitation.

What I Wish Were Better

No product is perfect. The 5090 lacks NVLink for multi-GPU tensor parallelism, meaning I cannot split a single model across multiple cards as efficiently as with data center GPUs. PCIe bandwidth limits cross-card communication. And 32GB is generous but still constraining for the largest models — I want 48GB per card in the next generation.

Despite these limitations, the RTX 5090 is the best value proposition I have found for self-hosted AI inference. It is not the fastest GPU in existence, but it is the fastest GPU at its price point, and price-performance is what matters when you are bootstrapping an AI company.

VRAM Management Power Consumption Cooling Solutions Hardware vs Software GPU Cluster Try ZSky AI