A Rust and gRPC server for large language models text generation inference.
A Rust and gRPC server for large language models text generation inference.
## Load Tests for BLOOM
## Features
See `k6/load_test.js`
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
We send the default examples with a 1 second delay between requests.
-[Dynamic bathing of incoming requests](https://github.com/huggingface/text-generation-inference/blob/main/router/src/batcher.rs#L88) for increased total throughput