Blog • March 2026

Monitoring a GPU Cluster: My Dashboard and Alerts

By Cemhan Biricik — Founder of ZSky AI

Running a 7-GPU cluster for production AI inference is like being responsible for seven expensive, heat-generating, occasionally temperamental machines that your entire business depends on. Without monitoring, you are flying blind. With good monitoring, you see problems before users do. Here is the monitoring stack I built for ZSky AI and why each piece matters.

The Metrics That Actually Matter

When I first set up monitoring, I tracked everything. GPU temperature, utilization, memory, power draw, fan speed, clock speed, memory clock, PCIe bandwidth — dozens of metrics per GPU, sampled every second. The dashboard looked impressive and was completely useless. Too much data is as bad as too little.

After months of refining, I narrowed down to the metrics that actually predict problems:

The Dashboard

My dashboard is a custom web page that I can access from my phone. It shows real-time status for all seven GPUs in a layout I can parse in under two seconds. Each GPU gets a card showing temperature (color-coded green/yellow/red), current workload, VRAM usage bar, and inference count for the last hour. At the top, aggregate metrics: total generations today, average latency, error rate, and queue depth.

I built this myself rather than using Grafana or Datadog because I wanted exactly the information I need in exactly the layout I want. No authentication delays, no loading spinners, no unused panels. When I wake up at 3 AM because an alert fired, I want to see the system status in the time it takes the page to load.

The Alert System

Alerts are tiered into three levels. Informational alerts go to a log file — things like a GPU approaching its temperature threshold or VRAM usage exceeding 85%. Warning alerts send a notification — a GPU has started thermal throttling, or latency has spiked above twice the normal range. Critical alerts trigger immediate notifications with sound — a GPU has stopped responding, error rate has exceeded 5%, or the system is unable to process any requests.

The most important rule for alerting: never alert on things that do not require action. Alert fatigue kills monitoring systems. If I get an alert, it means I need to do something. If it does not require action, it is a log entry, not an alert.

What Monitoring Taught Me About My Infrastructure

Building this monitoring system revealed patterns I never would have noticed otherwise. I discovered that GPU 4 consistently runs 3 degrees hotter than the others due to its position in the case. I found that inference latency increases by 8% between 2 PM and 6 PM when the room is warmest. I learned that one specific model configuration causes a slow VRAM leak that manifests after exactly 847 generations.

These are the kinds of insights that transform infrastructure management from reactive firefighting to proactive optimization. You cannot fix what you cannot see, and monitoring makes the invisible visible.

GPU Monitoring Essentials for Self-Hosted AI

Monitoring is not the exciting part of running an AI platform. Nobody signs up for ZSky AI because of my monitoring dashboard. But every time the system catches a problem before it affects users, every time I optimize based on data rather than guessing, every time I sleep through the night because I trust my alerts — the monitoring has paid for itself many times over.