Assaf Pinhasi
Nov 9, 2020

--

I am not sure I understand the scenario.

you mean what happens when you have a huge load of requests to your inference server and they're starting to queue?

If you can afford to work in a-sync pattern (i.e. you can consume requests from a queue and return results to a queue), then this is your back-pressure mechanism (i.e. the inference consumer consumes from queue).

This setup also allows you to batch requests and get a much higher throughput in inference esp. from neural networks on both cpu and gpu.

If your integration needs to be synchronous (request/response), then most servers can build a backlog of requests (some do it better than others), but they all end up losing requests if they are really overwhelmed by incoming traffic.

Hope this helps.

--

--

Assaf Pinhasi
Assaf Pinhasi

Responses (1)