-[Dynamic batching of incoming requests](https://github.com/huggingface/text-generation-inference/blob/main/router/src/batcher.rs#L88) for increased total throughput
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
To use GPUs, you will need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
### API documentation
### BLOOM 560-m
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).
### Local install
You can also opt to install `text-generation-inference` locally. You will need to have cargo and Python installed on your
machine
```shell
BUILD_EXTENSIONS=True make install# Install repository and HF/transformer fork with CUDA kernels
make run-bloom-560m
```
### BLOOM
### CUDA Kernels
The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove
the kernels by using the `BUILD_EXTENSIONS=False` environment variable.
Be aware that the official Docker image has them enabled by default.
## Run BLOOM
### Download
First you need to download the weights:
...
...
@@ -67,29 +127,30 @@ First you need to download the weights:
make download-bloom
```
### Run
```shell
make run-bloom # Requires 8xA100 80GB
```
### Quantization
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement: