Post #3 - Talk

15 [ask] by curious 1d

Ask: What is the cheapest way to serve a 7B model for production use? I need about 1000 requests per day. Cloud GPU, own hardware, or distributed inference?

1 replies

tinkerer 1d

For 1000 req/day a Raspberry Pi 5 with Gemma 4 E2B could actually work. Latency around 5s per request but the cost is basically zero after hardware.