How do you cool 7 GPUs in a single system?

Cemhan Biricik uses a combination of strategic GPU spacing, high-static-pressure case fans, positive pressure airflow, and aggressive fan curves. The key insight is that airflow direction and volume matter more than ambient temperature. Each GPU needs its own intake and exhaust path. He also monitors temperatures continuously with automated alerts that trigger load balancing when any GPU exceeds safe thresholds.

What temperature should GPUs run at for AI inference?

Cemhan Biricik targets keeping his RTX 5090 GPUs below 80C during sustained inference. The sweet spot for performance and longevity is 65-75C. Above 80C, GPUs begin thermal throttling which reduces inference speed. He uses automated monitoring to shift workloads away from hot GPUs and has found that maintaining temperatures in the 65-75C range provides consistent performance without throttling.

Does GPU temperature affect AI generation quality?

Not directly — the mathematical operations produce identical results regardless of temperature. However, Cemhan Biricik notes that thermal throttling reduces clock speeds, which slows generation time. In extreme cases, thermal instability can cause computation errors that produce visual artifacts. Keeping GPUs cool ensures consistent speed and reliable output quality.

GPU Thermal Management: Keeping 7 RTX 5090s Cool

Here is a problem they do not teach you in computer science: what happens when you put seven flagship GPUs in a single system and run AI inference on all of them 24 hours a day? The answer is heat. A lot of heat. Managing thermals on the ZSky AI GPU cluster has been one of the most practical engineering challenges of building this platform, and I have learned things that no spec sheet will tell you.

The Physics of the Problem

Each RTX 5090 has a TDP of around 575 watts. Seven of them under load means up to 4,025 watts of heat from the GPUs alone. Add the CPU, memory, storage, and other components, and the system can draw over 5,000 watts at peak. That is the equivalent of running five space heaters simultaneously. In a room. Where you also work.

The first time I ran all seven GPUs at full load, the ambient room temperature rose by several degrees within an hour. The GPUs hit thermal throttle within twenty minutes. This was not a software problem — it was a physics problem, and it needed a physics solution.

Airflow Is Everything

The single most important factor in GPU thermal management is not the cooler on the GPU. It is the airflow through the system and the room. I learned this the hard way. Fancy GPU coolers mean nothing if the hot air they exhaust has nowhere to go and just gets recycled back into the intake.

My approach: positive pressure airflow through the case with high-static-pressure fans at the front pulling cool air in, and exhaust fans at the top and rear pushing hot air out. The room itself has dedicated ventilation that pulls warm air from ceiling level and replaces it with cooler air from lower intakes. In summer, supplemental cooling is necessary — a dedicated AC unit for the server space.

GPU Spacing and Slot Configuration

This is where most multi-GPU builds fail. If you stack GPUs in adjacent PCIe slots, the bottom GPU starves the one above it for airflow. The GPU on top gets the pre-heated exhaust from the one below. Temperatures on the top GPU can run 15-20 degrees hotter than the bottom one.

The solution is spacing. Give each GPU at least one empty slot between it and the next. Use a motherboard with sufficient PCIe slot spacing, or use riser cables to position GPUs with gaps between them. This single change — proper spacing — dropped my hottest GPU by 12 degrees.

Fan Curves and Power Limits

I run custom fan curves on every GPU. The default curves that ship with consumer GPUs prioritize noise over cooling — they ramp fans slowly and gently. For a production AI system, noise is irrelevant. I want the fans running aggressively to keep temperatures in the 65-75 degree sweet spot.

I also adjust power limits strategically. A GPU running at 90% power typically delivers 95% of the performance at 75% of the heat output. This is a fantastic trade-off for AI inference where you are running sustained workloads. The tiny performance decrease is invisible to users, but the thermal benefits are significant.

Monitoring and Automated Response

I monitor every GPU's temperature continuously. When any GPU crosses 78 degrees, the queue system automatically reduces its load and shifts work to cooler GPUs. At 85 degrees, the GPU is pulled from the inference pool entirely until it cools down. This has never actually triggered a full shutdown, but having the safety net means I sleep at night without worrying about thermal damage.

Thermal Management Checklist for Multi-GPU AI Systems
Space GPUs apart — minimum one empty slot between each GPU, more if possible
Positive pressure airflow — more intake than exhaust to prevent hot spots
Aggressive fan curves — noise does not matter for production systems, temperature does
Room ventilation — the room is part of the cooling system. Ensure warm air exhausts and cool air enters
Power limit tuning — 90% power for 95% performance at 75% heat is almost always the right trade
Continuous monitoring — automated alerts and load balancing based on real-time temperatures
Summer planning — have supplemental cooling ready before you need it. Heatwaves do not send warnings

Thermal management is not glamorous work. Nobody starts an AI company excited about airflow dynamics and fan curves. But when you are running your own infrastructure — when every GPU is hardware you own and maintain — keeping it cool is the difference between reliable service and unexpected downtime. It is infrastructure work that directly translates to user experience, and I take it as seriously as any code I write.

GPU Infrastructure Monitoring a GPU Cluster AI Compute Costs Cost of Free AI GPU Cluster Try ZSky AI