README.md 9.25 KB
Newer Older
1
2
<div align="center">

3
4
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/jlMAX2Oaht0?si=mh7STo7c83mIL9Q_" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

5

6
# Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
7

8
9
10
11
12
13
<a href="https://github.com/huggingface/text-generation-inference">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a>
<a href="https://huggingface.github.io/text-generation-inference">
  <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a>
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
14

15
A Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)
OlivierDehaene's avatar
OlivierDehaene committed
16
17
18
to power Hugging Chat, the Inference API and Inference Endpoint.

</div>
19
20
21
22

## Table of contents

- [Get Started](#get-started)
23
  - [API Documentation](#api-documentation)
OlivierDehaene's avatar
OlivierDehaene committed
24
  - [Using a private or gated model](#using-a-private-or-gated-model)
25
  - [A note on Shared Memory](#a-note-on-shared-memory-shm)
26
  - [Distributed Tracing](#distributed-tracing)
27
28
  - [Local Install](#local-install)
  - [CUDA Kernels](#cuda-kernels)
29
- [Optimized architectures](#optimized-architectures)
OlivierDehaene's avatar
OlivierDehaene committed
30
- [Run Falcon](#run-falcon)
31
32
33
34
  - [Run](#run)
  - [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)
35

36
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:
Olivier Dehaene's avatar
Olivier Dehaene committed
37

38
39
- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
OlivierDehaene's avatar
OlivierDehaene committed
40
- Tensor Parallelism for faster inference on multiple GPUs
Yannic Kilcher's avatar
Yannic Kilcher committed
41
- Token streaming using Server-Sent Events (SSE)
42
43
- Continuous batching of incoming requests for increased total throughput
- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
44
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323)
45
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
OlivierDehaene's avatar
OlivierDehaene committed
46
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
47
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
48
- Stop sequences
OlivierDehaene's avatar
OlivierDehaene committed
49
- Log probabilities
50
51
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance
52
53


54
## Get Started
55
56

### Docker
Olivier Dehaene's avatar
Olivier Dehaene committed
57

58
For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:
59
60

```shell
OlivierDehaene's avatar
OlivierDehaene committed
61
model=tiiuae/falcon-7b-instruct
62
63
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

Nicolas Patry's avatar
Nicolas Patry committed
64
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model
65
```
66

67
And then you can make requests like
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
68

69
```bash
70
71
curl 127.0.0.1:8080/generate \
    -X POST \
OlivierDehaene's avatar
OlivierDehaene committed
72
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
73
74
    -H 'Content-Type: application/json'
```
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
75

76
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
OlivierDehaene's avatar
OlivierDehaene committed
77

78
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
OlivierDehaene's avatar
OlivierDehaene committed
79
```
80
text-generation-launcher --help
81
```
OlivierDehaene's avatar
OlivierDehaene committed
82

83
### API documentation
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
84

85
86
87
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

OlivierDehaene's avatar
OlivierDehaene committed
88
### Using a private or gated model
Nicolas Patry's avatar
Nicolas Patry committed
89

90
You have the option to utilize the `HUGGING_FACE_HUB_TOKEN` environment variable for configuring the token employed by
OlivierDehaene's avatar
OlivierDehaene committed
91
`text-generation-inference`. This allows you to gain access to protected resources.
Nicolas Patry's avatar
Nicolas Patry committed
92

OlivierDehaene's avatar
OlivierDehaene committed
93
For example, if you want to serve the gated Llama V2 model variants:
94

OlivierDehaene's avatar
OlivierDehaene committed
95
96
97
98
99
100
1. Go to https://huggingface.co/settings/tokens
2. Copy your cli READ token
3. Export `HUGGING_FACE_HUB_TOKEN=<your cli READ token>`

or with Docker:

101
```shell
OlivierDehaene's avatar
OlivierDehaene committed
102
103
104
105
model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

OlivierDehaene's avatar
OlivierDehaene committed
106
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model
OlivierDehaene's avatar
OlivierDehaene committed
107
```
108

109
110
### A note on Shared Memory (shm)

111
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
`PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

132
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
133
134
this will impact performance.

OlivierDehaene's avatar
OlivierDehaene committed
135
136
137
138
139
### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
by setting the address to an OTLP collector with the `--otlp-endpoint` argument.

140
141
142
143
### Architecture

![image](https://github.com/huggingface/text-generation-inference/assets/3841370/38ba1531-ea0d-4851-b31a-a6d4ddc944b0)

144
145
### Local install

146
You can also opt to install `text-generation-inference` locally.
147

148
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
149
150
151
152
153
Python 3.9, e.g. using `conda`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

154
conda create -n text-generation-inference python=3.9
155
156
157
conda activate text-generation-inference
```

158
159
160
161
162
163
164
165
166
167
168
169
You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

170
On MacOS, using Homebrew:
171
172
173
174
175

```shell
brew install protobuf
```

176
Then run:
177

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
178
```shell
179
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
180
make run-falcon-7b-instruct
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
181
182
```

183
**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
184
185

```shell
186
sudo apt-get install libssl-dev gcc -y
187
188
```

189
190
### CUDA Kernels

191
The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove
Nicolas Patry's avatar
Nicolas Patry committed
192
the kernels by using the `DISABLE_CUSTOM_KERNELS=True` environment variable.
193
194
195

Be aware that the official Docker image has them enabled by default.

196
197
198
199
200
201
202
203
204
205
206
207
208
209
## Optimized architectures

TGI works out of the box to serve optimized models in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).

Other architectures are supported on a best-effort basis using:

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`



OlivierDehaene's avatar
OlivierDehaene committed
210
## Run Falcon
211

212
213
### Run

214
```shell
215
make run-falcon-7b-instruct
216
217
```

218
219
### Quantization

220
221
222
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

```shell
OlivierDehaene's avatar
OlivierDehaene committed
223
make run-falcon-7b-instruct-quantize
224
225
```

Nicolas Patry's avatar
Nicolas Patry committed
226
227
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.

228
## Develop
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
229

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
230
```shell
231
232
make server-dev
make router-dev
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
233
234
```

235
## Testing
Nicolas Patry's avatar
Nicolas Patry committed
236
237

```shell
238
239
240
241
# python
make python-server-tests
make python-client-tests
# or both server and client tests
242
make python-tests
243
# rust cargo tests
244
245
make rust-tests
# integration tests
246
make integration-tests
247
```