README.md 8.73 KB
Newer Older
1
2
<div align="center">

3
4
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)

5
# Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
6

7
8
9
10
11
12
13
14
15
<a href="https://github.com/huggingface/text-generation-inference">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a>
<a href="https://github.com/huggingface/text-generation-inference/blob/main/LICENSE">
  <img alt="License" src="https://img.shields.io/github/license/huggingface/text-generation-inference">
</a>
<a href="https://huggingface.github.io/text-generation-inference">
  <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a>
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
16
17
</div>

18
19
20
21
22
23
A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co) 
to power LLMs api-inference widgets.

## Table of contents

- [Features](#features)
24
- [Optimized Architectures](#optimized-architectures)
25
26
- [Get Started](#get-started)
  - [Docker](#docker)
27
28
  - [API Documentation](#api-documentation)
  - [A note on Shared Memory](#a-note-on-shared-memory-shm)
29
  - [Distributed Tracing](#distributed-tracing)
30
31
32
33
34
35
36
37
  - [Local Install](#local-install)
  - [CUDA Kernels](#cuda-kernels)
- [Run BLOOM](#run-bloom)
  - [Download](#download)
  - [Run](#run)
  - [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)
38

39
## Features
Olivier Dehaene's avatar
Olivier Dehaene committed
40

OlivierDehaene's avatar
OlivierDehaene committed
41
42
- Serve the most popular Large Language Models with a simple launcher
- Tensor Parallelism for faster inference on multiple GPUs
Yannic Kilcher's avatar
Yannic Kilcher committed
43
- Token streaming using Server-Sent Events (SSE)
OlivierDehaene's avatar
v0.8.0  
OlivierDehaene committed
44
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
45
46
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323)
47
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
OlivierDehaene's avatar
OlivierDehaene committed
48
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
49
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
50
- Stop sequences
OlivierDehaene's avatar
OlivierDehaene committed
51
- Log probabilities
OlivierDehaene's avatar
OlivierDehaene committed
52
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
53

54
## Optimized architectures
Olivier Dehaene's avatar
Olivier Dehaene committed
55

OlivierDehaene's avatar
OlivierDehaene committed
56
- [BLOOM](https://huggingface.co/bigscience/bloom)
57
- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
58
- [Galactica](https://huggingface.co/facebook/galactica-120b)
59
60
- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
- [Llama](https://github.com/facebookresearch/llama)
61
62
- [OPT](https://huggingface.co/facebook/opt-66b)
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
OlivierDehaene's avatar
v0.8.0  
OlivierDehaene committed
63
64
65
- [Starcoder](https://huggingface.co/bigcode/starcoder)
- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
OlivierDehaene's avatar
OlivierDehaene committed
66
67
- [MPT](https://huggingface.co/mosaicml/mpt-30b)
- [Llama V2](https://huggingface.co/meta-llama)
68

69
Other architectures are supported on a best effort basis using:
70
71
72
73
74
75
76

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`

77
78
79
## Get started

### Docker
Olivier Dehaene's avatar
Olivier Dehaene committed
80

81
82
83
84
85
86
87
The easiest way of getting started is using the official Docker container:

```shell
model=bigscience/bloom-560m
num_shard=2
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

OlivierDehaene's avatar
OlivierDehaene committed
88
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id $model --num-shard $num_shard
89
```
90
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.
Olivier Dehaene's avatar
Olivier Dehaene committed
91

92
93
94
95
96
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli:
```
text-generation-launcher --help
```

97
You can then query the model using either the `/generate` or `/generate_stream` routes:
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
98

99
100
101
```shell
curl 127.0.0.1:8080/generate \
    -X POST \
102
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
103
104
    -H 'Content-Type: application/json'
```
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
105
106

```shell
107
108
curl 127.0.0.1:8080/generate_stream \
    -X POST \
109
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
110
    -H 'Content-Type: application/json'
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
111
112
```

OlivierDehaene's avatar
OlivierDehaene committed
113
114
115
or from Python:

```shell
116
pip install text-generation
OlivierDehaene's avatar
OlivierDehaene committed
117
118
```

119
120
```python
from text_generation import Client
OlivierDehaene's avatar
OlivierDehaene committed
121

122
123
client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=17).generated_text)
OlivierDehaene's avatar
OlivierDehaene committed
124

125
126
127
128
129
130
text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=17):
    if not response.token.special:
        text += response.token.text
print(text)
```
OlivierDehaene's avatar
OlivierDehaene committed
131

132
### API documentation
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
133

134
135
136
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

Nicolas Patry's avatar
Nicolas Patry committed
137
138
139
140
### Using on private models or gated models

You can use `HUGGING_FACE_HUB_TOKEN` environment variable to set the token used by `text-generation-inference` to give access to protected ressources.

141
142
143
144
145
### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument.

146
147
### A note on Shared Memory (shm)

148
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
`PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

169
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
170
171
this will impact performance.

172
173
### Local install

174
You can also opt to install `text-generation-inference` locally.
175

176
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
177
178
179
180
181
182
183
184
185
Python 3.9, e.g. using `conda`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.9 
conda activate text-generation-inference
```

186
187
188
189
190
191
192
193
194
195
196
197
You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

198
On MacOS, using Homebrew:
199
200
201
202
203

```shell
brew install protobuf
```

204
Then run:
205

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
206
```shell
207
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
208
make run-bloom-560m
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
209
210
```

211
**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
212
213

```shell
214
sudo apt-get install libssl-dev gcc -y
215
216
```

217
218
219
220
221
222
223
224
225
226
### CUDA Kernels

The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove 
the kernels by using the `BUILD_EXTENSIONS=False` environment variable.

Be aware that the official Docker image has them enabled by default.

## Run BLOOM

### Download
227

228
It is advised to download the weights ahead of time with the following command:
229
230
231
232
233

```shell
make download-bloom
```

234
235
### Run

236
237
238
239
```shell
make run-bloom # Requires 8xA100 80GB
```

240
241
### Quantization

242
243
244
245
246
247
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

```shell
make run-bloom-quantize # Requires 8xA100 40GB
```

248
## Develop
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
249

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
250
```shell
251
252
make server-dev
make router-dev
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
253
254
```

255
## Testing
Nicolas Patry's avatar
Nicolas Patry committed
256
257

```shell
258
259
260
261
# python
make python-server-tests
make python-client-tests
# or both server and client tests
262
make python-tests
263
# rust cargo tests
264
265
make rust-tests
# integration tests
266
make integration-tests
267
```