Blog • February 2026

Building a GPU Queue System from Scratch

By Cemhan Biricik — Founder of ZSky AI

When you have seven GPUs and hundreds of concurrent users, you need a system that decides which GPU handles which request. This sounds like a solved problem — job queues have existed for decades. But GPU-aware job scheduling for AI inference has specific requirements that generic queue systems do not address. So I built my own.

Why Not Use an Existing Solution

I tried Celery with Redis. I tried custom workers with RabbitMQ. I tried simple round-robin with a load balancer. Every existing solution treated GPUs as interchangeable workers. They are not. At any given moment, each GPU in my 7-GPU cluster has a different temperature, different available VRAM, different queue depth, and different model loaded. A queue system that ignores these variables leaves performance on the table.

The generic solutions also did not handle GPU-specific failure modes well. A GPU can run out of VRAM mid-inference. It can thermal throttle and slow down by 30%. It can throw a CUDA error and need a process restart. The queue system needs to detect all of these conditions and respond intelligently, not just retry the job on the same broken worker.

The Architecture

The system I built has three components. First, a request intake layer that accepts generation requests from the web API and normalizes them into a standard job format. Second, a GPU state tracker that polls every GPU multiple times per second for temperature, VRAM usage, current workload, and health status. Third, a scheduler that matches incoming jobs to available GPUs using a scoring algorithm.

The scoring algorithm considers: current GPU temperature (lower is better), available VRAM (more is better), queue depth (shorter is better), whether the required model is already loaded on that GPU (avoids model switching), and estimated time to completion for the current job. The GPU with the highest composite score gets the next job.

Handling the Edge Cases

The fun part of building a queue system is all the ways it can go wrong. Here are the edge cases I had to solve:

What I Learned Building It

The most important lesson: simplicity beats cleverness in production systems. My first version of the scheduler used a complex weighted algorithm with dynamic parameter tuning. It was beautiful engineering and a nightmare to debug. The current version uses simpler heuristics that are easier to reason about and produce results that are within 5% of the "optimal" complex algorithm.

The second lesson: monitoring is not optional, it is the product. The queue system generates detailed logs for every routing decision. When a user reports slow generation, I can trace their exact request through the system — which GPU it was assigned to, why, how long each step took. This observability has been invaluable for optimization work.

Key Design Principles for GPU Queue Systems

Building the queue system was one of the most satisfying engineering projects in the entire ZSky AI stack. It is invisible to users — they just see fast, reliable generation. But behind every sub-3-second image generation is a routing decision that considered temperature, memory, load, and model state across seven GPUs. That invisible infrastructure is what makes the visible experience possible.