Blog • Engineering
By Cemhan Biricik — Founder of ZSky AI
Running seven NVIDIA RTX 5090 GPUs around the clock for AI inference is not the same as having a powerful gaming PC. The engineering challenges are real, the failure modes are surprising, and the lessons are hard-won. Here is what I have learned building the infrastructure that powers ZSky AI.
A single RTX 5090 under full load generates significant heat. Seven of them in a single system generate enough thermal energy to heat a small apartment. The first lesson I learned was that airflow is not optional — it is the single most important engineering decision you make.
Consumer GPU coolers are designed for gaming workloads: intermittent bursts of high activity with cooling breaks between sessions. AI inference is continuous, sustained load. The cooling solution must handle 100% utilization, 24 hours a day, 365 days a year. I designed custom airflow paths, strategic fan placement, and monitoring systems that alert on thermal anomalies before they become failures.
Seven high-end GPUs plus a 32-core CPU draws substantial power. This is not a "plug it into a power strip" situation. Dedicated circuits, proper power supplies with sufficient headroom, and UPS protection are baseline requirements. I have learned to budget 20% additional power capacity beyond peak theoretical draw — because real-world power spikes during model loading and batch processing can exceed steady-state predictions.
Every GPU in the cluster reports temperature, utilization, memory usage, and error rates in real time. Every generation request is logged with timing data. Every failure is captured with full context. This monitoring infrastructure took significant effort to build, but it is what allows me to maintain 24/7 uptime as a single operator.
If you are considering self-hosted AI infrastructure, here is my advice: start smaller than you think you need, monitor everything from day one, budget for cooling before you budget for compute, and never underestimate the value of physical access to your hardware. When a GPU fails at 3 AM, the ability to walk to the machine and swap a card is worth more than any cloud provider's SLA.