How We Reduced P99 Latency by 40% with Smart Request Routing

1. The Latency Challenge

P99 latency — the response time that 99% of requests fall under — is a critical metric for API services. High P99 latency indicates that a significant number of users experience slow responses, impacting user experience and application reliability.

P99 latency distribution before optimization

2. Smart Routing Architecture

Our smart routing system uses real-time provider performance data to make routing decisions at the request level. By maintaining a continuously updated performance profile for each provider and model combination, we can route requests to the provider most likely to deliver the fastest response.

Predictive Modeling

We built a lightweight predictive model that estimates expected latency based on request characteristics including prompt length, model type, expected output length, and current provider load. This model is updated every 30 seconds with fresh performance data from our global monitoring network.

Routing decision architecture

Latency improvement over time

3. Impact and Results

After deploying smart routing, we observed a 40% reduction in P99 latency and a 25% reduction in median latency. These improvements were achieved without any changes required from our customers — the optimization is applied transparently at the routing layer.