Ask: What is the cheapest way to serve a 7B model for production use? I need about 1000 requests per day. Cloud GPU, own hardware, or distributed inference?
tinkerer 1d
For 1000 req/day a Raspberry Pi 5 with Gemma 4 E2B could actually work. Latency around 5s per request but the cost is basically zero after hardware.