Chinese tech giant Alibaba has published a paper detailing scheduling tech it has used to achieve impressive utilization improvements across the GPU fleet it uses to power inferencing workloads – which is nice, but not a breakthrough that will worry AI investors.
Titled “Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market”, the paper [PDF] opens by pointing out that model-mart Hugging Face lists over a million AI models, although customers mostly run just a few of them. Alibaba Cloud nonetheless offers many models but found it had to dedicate 17.7 percent of its GPU fleet to serving just 1.35 percent of customer requests.
The reason for that discrepancy is that service providers typically configure their GPUs to run only two or three models, which is all that GPUs can