I am not sure I understand the scenario.

Nov 9, 2020

you mean what happens when you have a huge load of requests to your inference server and they're starting to queue?

If you can afford to work in a-sync pattern (i.e. you can consume requests from a queue and return results to a queue), then this is your back-pressure mechanism (i.e. the inference consumer consumes from queue).

This setup also allows you to batch requests and get a much higher throughput in inference esp. from neural networks on both cpu and gpu.

If your integration needs to be synchronous (request/response), then most servers can build a backlog of requests (some do it better than others), but they all end up losing requests if they are really overwhelmed by incoming traffic.

Hope this helps.

Written by Assaf Pinhasi

Responses (1)