Blog • March 2026

Inference Optimization: How We Hit Sub-3-Second Generation

By Cemhan Biricik — Founder of ZSky AI

Speed matters more than most AI platform builders realize. When I started building ZSky AI, initial generation times were around 15-20 seconds. That felt fast compared to some competitors, but I knew it was not fast enough. Users do not want to wait. Every second of generation time is a second where they might close the tab, get distracted, or decide the experience is not worth it. Getting to sub-3-second generation was a months-long optimization journey, and I want to share what worked.

FP8 Quantization: The Biggest Single Win

The largest performance improvement came from model quantization. Running models in full FP32 precision is wasteful for inference — the model does not need 32 bits of precision to produce a good image. I run FP8 (8-bit floating point) quantized models, which cuts memory usage by 4x and dramatically increases throughput.

The quality impact is minimal. In blind testing, users cannot distinguish FP8 outputs from FP32 outputs. The models are surprisingly tolerant of reduced precision during inference. This single optimization took generation from 15 seconds to about 6 seconds — a 60% improvement with no visible quality loss.

Optimized Step Counts

Most diffusion models default to 20-50 denoising steps. More steps generally mean better quality, but the relationship is logarithmic, not linear. The difference between 20 and 50 steps is visible. The difference between 15 and 20 steps is subtle. The difference between 10 and 15 steps is barely perceptible for most prompts.

I spent weeks testing different step counts across thousands of prompts. The sweet spot for our models is around 10-15 steps with the right scheduler configuration. This is not just about reducing steps — it is about choosing the right noise schedule so that fewer steps produce results as good as more steps with a default schedule.

Model Caching and Warm Starts

Loading a model from disk takes several seconds. If you load the model for every generation request, you are wasting time on I/O that could be spent on actual inference. I keep models loaded in GPU memory permanently. The 7-GPU cluster has enough VRAM to keep multiple model variants loaded simultaneously, so there is zero model loading time for any request.

This sounds obvious, but many platforms still load and unload models per-request because they are trying to serve multiple model types on limited GPU resources. Having seven GPUs with 32GB VRAM each means I can dedicate GPUs to specific models and never unload them.

Intelligent Queue Distribution

Not all GPUs are equal at any given moment. One might be slightly warmer and therefore slightly slower. Another might be finishing a previous request. The queue system I built tracks the state of every GPU in real time and routes incoming requests to the GPU that will produce the fastest result.

This is not simple round-robin load balancing. It considers current GPU temperature, VRAM utilization, queue depth, and estimated completion time. The result is that no GPU sits idle while another is overloaded, and the user always gets routed to the fastest available resource.

Pipeline Optimization

Beyond the model itself, there are optimizations in the surrounding pipeline that matter. Text encoding (converting the user's prompt into model input) can be done in parallel with other preprocessing steps. Image encoding and post-processing can be pipelined so that the GPU is never waiting for the CPU. These are small wins individually — 100 milliseconds here, 200 milliseconds there — but they compound.

The Optimization Stack, Ranked by Impact

The result of all these optimizations stacked together: most image generations on ZSky AI complete in 1.5-3 seconds. That is fast enough that the experience feels interactive rather than batch-processed. Users describe it as "instant" even though there is still measurable latency. Getting below that perceptual threshold was worth every hour of optimization work.