README.md 8.44 KB
Newer Older
1
2
<div align="center">

3
4
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)

5
# Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
6

7
8
9
10
11
12
13
14
15
<a href="https://github.com/huggingface/text-generation-inference">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a>
<a href="https://github.com/huggingface/text-generation-inference/blob/main/LICENSE">
  <img alt="License" src="https://img.shields.io/github/license/huggingface/text-generation-inference">
</a>
<a href="https://huggingface.github.io/text-generation-inference">
  <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a>
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
16
17
</div>

18
19
20
21
22
23
A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co) 
to power LLMs api-inference widgets.

## Table of contents

- [Features](#features)
24
- [Optimized Architectures](#optimized-architectures)
25
26
- [Get Started](#get-started)
  - [Docker](#docker)
27
28
  - [API Documentation](#api-documentation)
  - [A note on Shared Memory](#a-note-on-shared-memory-shm)
29
  - [Distributed Tracing](#distributed-tracing)
30
31
32
33
34
35
36
37
  - [Local Install](#local-install)
  - [CUDA Kernels](#cuda-kernels)
- [Run BLOOM](#run-bloom)
  - [Download](#download)
  - [Run](#run)
  - [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)
38

39
## Features
Olivier Dehaene's avatar
Olivier Dehaene committed
40

OlivierDehaene's avatar
OlivierDehaene committed
41
42
- Serve the most popular Large Language Models with a simple launcher
- Tensor Parallelism for faster inference on multiple GPUs
Yannic Kilcher's avatar
Yannic Kilcher committed
43
- Token streaming using Server-Sent Events (SSE)
OlivierDehaene's avatar
v0.8.0  
OlivierDehaene committed
44
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
45
46
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323)
47
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
OlivierDehaene's avatar
OlivierDehaene committed
48
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
49
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
50
- Stop sequences
OlivierDehaene's avatar
OlivierDehaene committed
51
- Log probabilities
OlivierDehaene's avatar
OlivierDehaene committed
52
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
53

54
## Optimized architectures
Olivier Dehaene's avatar
Olivier Dehaene committed
55

OlivierDehaene's avatar
OlivierDehaene committed
56
- [BLOOM](https://huggingface.co/bigscience/bloom)
57
- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
58
- [Galactica](https://huggingface.co/facebook/galactica-120b)
59
60
- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
- [Llama](https://github.com/facebookresearch/llama)
61
62
- [OPT](https://huggingface.co/facebook/opt-66b)
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
OlivierDehaene's avatar
v0.8.0  
OlivierDehaene committed
63
64
65
- [Starcoder](https://huggingface.co/bigcode/starcoder)
- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
66

67
Other architectures are supported on a best effort basis using:
68
69
70
71
72
73
74

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`

75
76
77
## Get started

### Docker
Olivier Dehaene's avatar
Olivier Dehaene committed
78

79
80
81
82
83
84
85
The easiest way of getting started is using the official Docker container:

```shell
model=bigscience/bloom-560m
num_shard=2
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

OlivierDehaene's avatar
OlivierDehaene committed
86
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id $model --num-shard $num_shard
87
```
88
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.
Olivier Dehaene's avatar
Olivier Dehaene committed
89

90
91
92
93
94
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli:
```
text-generation-launcher --help
```

95
You can then query the model using either the `/generate` or `/generate_stream` routes:
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
96

97
98
99
```shell
curl 127.0.0.1:8080/generate \
    -X POST \
100
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
101
102
    -H 'Content-Type: application/json'
```
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
103
104

```shell
105
106
curl 127.0.0.1:8080/generate_stream \
    -X POST \
107
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
108
    -H 'Content-Type: application/json'
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
109
110
```

OlivierDehaene's avatar
OlivierDehaene committed
111
112
113
or from Python:

```shell
114
pip install text-generation
OlivierDehaene's avatar
OlivierDehaene committed
115
116
```

117
118
```python
from text_generation import Client
OlivierDehaene's avatar
OlivierDehaene committed
119

120
121
client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=17).generated_text)
OlivierDehaene's avatar
OlivierDehaene committed
122

123
124
125
126
127
128
text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=17):
    if not response.token.special:
        text += response.token.text
print(text)
```
OlivierDehaene's avatar
OlivierDehaene committed
129

130
### API documentation
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
131

132
133
134
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

135
136
137
138
139
### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument.

140
141
### A note on Shared Memory (shm)

142
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
`PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

163
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
164
165
this will impact performance.

166
167
### Local install

168
You can also opt to install `text-generation-inference` locally.
169

170
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
171
172
173
174
175
176
177
178
179
Python 3.9, e.g. using `conda`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.9 
conda activate text-generation-inference
```

180
181
182
183
184
185
186
187
188
189
190
191
You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

192
On MacOS, using Homebrew:
193
194
195
196
197

```shell
brew install protobuf
```

198
Then run:
199

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
200
```shell
201
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
202
make run-bloom-560m
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
203
204
```

205
**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
206
207

```shell
208
sudo apt-get install libssl-dev gcc -y
209
210
```

211
212
213
214
215
216
217
218
219
220
### CUDA Kernels

The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove 
the kernels by using the `BUILD_EXTENSIONS=False` environment variable.

Be aware that the official Docker image has them enabled by default.

## Run BLOOM

### Download
221

222
It is advised to download the weights ahead of time with the following command:
223
224
225
226
227

```shell
make download-bloom
```

228
229
### Run

230
231
232
233
```shell
make run-bloom # Requires 8xA100 80GB
```

234
235
### Quantization

236
237
238
239
240
241
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

```shell
make run-bloom-quantize # Requires 8xA100 40GB
```

242
## Develop
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
243

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
244
```shell
245
246
make server-dev
make router-dev
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
247
248
```

249
## Testing
Nicolas Patry's avatar
Nicolas Patry committed
250
251

```shell
252
253
254
255
# python
make python-server-tests
make python-client-tests
# or both server and client tests
256
make python-tests
257
# rust cargo tests
258
259
make rust-tests
# integration tests
260
make integration-tests
261
```