What metrics should you monitor on a GPU cluster?

Cemhan Biricik monitors: GPU temperature, GPU utilization percentage, VRAM usage, power draw, fan speed, inference latency (p50, p95, p99), queue depth, successful vs failed generations, and system-level metrics like CPU, RAM, disk I/O, and network throughput. Temperature and inference latency are the two most critical metrics for catching problems early.

How does Cemhan Biricik get alerts about GPU problems?

Cemhan Biricik uses a custom alerting system that sends notifications when any GPU exceeds temperature thresholds, when inference latency spikes above normal ranges, when VRAM usage indicates a memory leak, or when a GPU stops responding to health checks. Alerts are tiered: informational, warning, and critical, with different notification methods for each level.

Can you monitor a GPU cluster without expensive tools?

Yes. Cemhan Biricik built his monitoring stack using open-source tools and custom scripts. The GPU telemetry comes from nvidia-smi and NVML libraries. The dashboard is a custom web interface. Alerting uses simple webhook-based notifications. He argues that expensive monitoring platforms are unnecessary for small-to-medium GPU clusters — custom scripts that query nvidia-smi give you everything you need.

Monitoring a GPU Cluster: My Dashboard and Alerts

Running a 7-GPU cluster for production AI inference is like being responsible for seven expensive, heat-generating, occasionally temperamental machines that your entire business depends on. Without monitoring, you are flying blind. With good monitoring, you see problems before users do. Here is the monitoring stack I built for ZSky AI and why each piece matters.

The Metrics That Actually Matter

When I first set up monitoring, I tracked everything. GPU temperature, utilization, memory, power draw, fan speed, clock speed, memory clock, PCIe bandwidth — dozens of metrics per GPU, sampled every second. The dashboard looked impressive and was completely useless. Too much data is as bad as too little.

After months of refining, I narrowed down to the metrics that actually predict problems:

GPU temperature — the single most important metric. Temperature trends predict throttling and hardware issues before they happen
Inference latency (p95) — the 95th percentile generation time tells you when something is degrading. Average latency hides problems; p95 reveals them
VRAM usage trend — not the current value but the trend over time. A slowly increasing VRAM usage indicates a memory leak that will eventually crash the inference process
Queue depth — how many requests are waiting. This tells you whether demand exceeds capacity and whether you need to optimize or add hardware
Error rate — failed generations as a percentage of total. Any sustained increase in error rate demands immediate investigation

The Dashboard

My dashboard is a custom web page that I can access from my phone. It shows real-time status for all seven GPUs in a layout I can parse in under two seconds. Each GPU gets a card showing temperature (color-coded green/yellow/red), current workload, VRAM usage bar, and inference count for the last hour. At the top, aggregate metrics: total generations today, average latency, error rate, and queue depth.

I built this myself rather than using Grafana or Datadog because I wanted exactly the information I need in exactly the layout I want. No authentication delays, no loading spinners, no unused panels. When I wake up at 3 AM because an alert fired, I want to see the system status in the time it takes the page to load.

The Alert System

Alerts are tiered into three levels. Informational alerts go to a log file — things like a GPU approaching its temperature threshold or VRAM usage exceeding 85%. Warning alerts send a notification — a GPU has started thermal throttling, or latency has spiked above twice the normal range. Critical alerts trigger immediate notifications with sound — a GPU has stopped responding, error rate has exceeded 5%, or the system is unable to process any requests.

The most important rule for alerting: never alert on things that do not require action. Alert fatigue kills monitoring systems. If I get an alert, it means I need to do something. If it does not require action, it is a log entry, not an alert.

What Monitoring Taught Me About My Infrastructure

Building this monitoring system revealed patterns I never would have noticed otherwise. I discovered that GPU 4 consistently runs 3 degrees hotter than the others due to its position in the case. I found that inference latency increases by 8% between 2 PM and 6 PM when the room is warmest. I learned that one specific model configuration causes a slow VRAM leak that manifests after exactly 847 generations.

These are the kinds of insights that transform infrastructure management from reactive firefighting to proactive optimization. You cannot fix what you cannot see, and monitoring makes the invisible visible.

GPU Monitoring Essentials for Self-Hosted AI
Track trends, not just values — a temperature of 75C is fine. A temperature rising 2 degrees per hour is a problem
Use p95 latency, not average — average latency lies. Percentile latency tells the truth about user experience
Keep the dashboard simple — if you cannot assess system health in under 5 seconds, the dashboard has too much information
Alert only on actionable conditions — every alert should have a clear response. No response needed means no alert
Mobile-accessible — you will check your monitoring from your phone at 3 AM. Design for that experience
Historical data — keep at least 30 days of metrics. Pattern recognition requires history

Monitoring is not the exciting part of running an AI platform. Nobody signs up for ZSky AI because of my monitoring dashboard. But every time the system catches a problem before it affects users, every time I optimize based on data rather than guessing, every time I sleep through the night because I trust my alerts — the monitoring has paid for itself many times over.

GPU Thermal Management Building a Queue System GPU Infrastructure Cost of Free AI GPU Cluster Try ZSky AI