Blog • March 2026
By Cemhan Biricik — Founder of ZSky AI
The AI model landscape moves fast. A model that is state-of-the-art today is last generation in three months. At ZSky AI, I need to continuously upgrade models without users ever noticing downtime. Here is how I do it with 7 GPUs and zero ops team.
I never switch all GPUs to a new model simultaneously. The process is always rolling: load the new model on one GPU, route a small percentage of traffic to it, monitor quality and performance metrics, then gradually shift more traffic as confidence builds. The old model stays loaded until the new one is fully validated.
With 7 GPUs, this means I typically have 5-6 running the production model and 1-2 running the candidate. If the candidate fails validation, I lose zero capacity — traffic routes back to the production GPUs instantly.
Before any new model touches real user traffic, it must pass automated quality checks:
Every model version gets a unique identifier tied to its weights, quantization level, and configuration. This allows instant rollback to any previous version. I keep the last 3 versions of each model on fast storage, ready to load within minutes. Older versions archive to slower storage but remain accessible.
Early on, I upgraded a model without proper validation and the new version produced subtly worse outputs. Not broken — just slightly less coherent. Users noticed before my metrics did. That taught me that automated quality checks must include perceptual quality metrics, not just error rates and latency.
Another time, a model passed all quality gates but consumed 20% more VRAM under concurrent load than during single-request testing. Now my burn-in tests simulate concurrent requests, not just sequential ones. Every failure teaches you something the documentation never mentions.