README.md 13.4 KB
Newer Older
1
<div align="center">
OlivierDehaene's avatar
OlivierDehaene committed
2

Nicolas Patry's avatar
Nicolas Patry committed
3
<a href="https://www.youtube.com/watch?v=jlMAX2Oaht0">
Nicolas Patry's avatar
Nicolas Patry committed
4
  <img width=560 width=315 alt="Making TGI deployment optimal" src="https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png">
Nicolas Patry's avatar
Nicolas Patry committed
5
</a>
6

7
# Text Generation Inference
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
8

9
10
11
12
13
14
<a href="https://github.com/huggingface/text-generation-inference">
  <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social">
</a>
<a href="https://huggingface.github.io/text-generation-inference">
  <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
</a>
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
15

16
A Rust, Python and gRPC server for text generation inference. Used in production at [Hugging Face](https://huggingface.co)
OlivierDehaene's avatar
OlivierDehaene committed
17
18
19
to power Hugging Chat, the Inference API and Inference Endpoint.

</div>
20
21
22

## Table of contents

vinkamath's avatar
vinkamath committed
23
24
25
26
27
28
29
30
  - [Get Started](#get-started)
    - [Docker](#docker)
    - [API documentation](#api-documentation)
    - [Using a private or gated model](#using-a-private-or-gated-model)
    - [A note on Shared Memory (shm)](#a-note-on-shared-memory-shm)
    - [Distributed Tracing](#distributed-tracing)
    - [Architecture](#architecture)
    - [Local install](#local-install)
31
    - [Local install (Nix)](#local-install-nix)
vinkamath's avatar
vinkamath committed
32
33
34
35
36
37
  - [Optimized architectures](#optimized-architectures)
  - [Run locally](#run-locally)
    - [Run](#run)
    - [Quantization](#quantization)
  - [Develop](#develop)
  - [Testing](#testing)
38

39
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:
Olivier Dehaene's avatar
Olivier Dehaene committed
40

41
42
- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
OlivierDehaene's avatar
OlivierDehaene committed
43
- Tensor Parallelism for faster inference on multiple GPUs
Yannic Kilcher's avatar
Yannic Kilcher committed
44
- Token streaming using Server-Sent Events (SSE)
45
- Continuous batching of incoming requests for increased total throughput
46
- [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) compatible with Open AI Chat Completion API
47
- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
Nicolas Patry's avatar
Nicolas Patry committed
48
49
50
51
52
- Quantization with :
  - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
  - [GPT-Q](https://arxiv.org/abs/2210.17323)
  - [EETQ](https://github.com/NetEase-FuXi/EETQ)
  - [AWQ](https://github.com/casper-hansen/AutoAWQ)
53
  - [Marlin](https://github.com/IST-DASLab/marlin)
54
  - [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/)
55
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
OlivierDehaene's avatar
OlivierDehaene committed
56
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
57
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
58
- Stop sequences
OlivierDehaene's avatar
OlivierDehaene committed
59
- Log probabilities
Nicolas Patry's avatar
Nicolas Patry committed
60
61
- [Speculation](https://huggingface.co/docs/text-generation-inference/conceptual/speculation) ~2x latency
- [Guidance/JSON](https://huggingface.co/docs/text-generation-inference/conceptual/guidance). Specify output format to speed up inference and make sure the output is valid according to some specs..
62
63
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance
64

Nicolas Patry's avatar
Nicolas Patry committed
65
66
67
68
69
70
71
### Hardware support

- [Nvidia](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference)
- [AMD](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) (-rocm)
- [Inferentia](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference)
- [Intel GPU](https://github.com/huggingface/text-generation-inference/pull/1475)
- [Gaudi](https://github.com/huggingface/tgi-gaudi)
72
- [Google TPU](https://huggingface.co/docs/optimum-tpu/howto/serving)
Nicolas Patry's avatar
Nicolas Patry committed
73

74

75
## Get Started
76
77

### Docker
Olivier Dehaene's avatar
Olivier Dehaene committed
78

79
For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:
80
81

```shell
Nicolas Patry's avatar
Nicolas Patry committed
82
model=HuggingFaceH4/zephyr-7b-beta
Nicolas Patry's avatar
Nicolas Patry committed
83
84
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data
85

Nicolas Patry's avatar
Nicolas Patry committed
86
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
87
    ghcr.io/huggingface/text-generation-inference:2.4.1 --model-id $model
88
```
89

90
And then you can make requests like
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
91

92
```bash
93
curl 127.0.0.1:8080/generate_stream \
94
    -X POST \
OlivierDehaene's avatar
OlivierDehaene committed
95
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
96
97
    -H 'Content-Type: application/json'
```
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
98

99
100
101
You can also use [TGI's Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) to obtain Open AI Chat Completion API compatible responses.

```bash
102
curl localhost:8080/v1/chat/completions \
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'
```

122
**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
fxmarty's avatar
fxmarty committed
123

124
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.4.1-rocm --model-id $model` instead of the command above.
OlivierDehaene's avatar
OlivierDehaene committed
125

126
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
OlivierDehaene's avatar
OlivierDehaene committed
127
```
128
text-generation-launcher --help
129
```
OlivierDehaene's avatar
OlivierDehaene committed
130

131
### API documentation
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
132

133
134
135
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).

OlivierDehaene's avatar
OlivierDehaene committed
136
### Using a private or gated model
Nicolas Patry's avatar
Nicolas Patry committed
137

138
You have the option to utilize the `HF_TOKEN` environment variable for configuring the token employed by
OlivierDehaene's avatar
OlivierDehaene committed
139
`text-generation-inference`. This allows you to gain access to protected resources.
Nicolas Patry's avatar
Nicolas Patry committed
140

OlivierDehaene's avatar
OlivierDehaene committed
141
For example, if you want to serve the gated Llama V2 model variants:
142

OlivierDehaene's avatar
OlivierDehaene committed
143
144
1. Go to https://huggingface.co/settings/tokens
2. Copy your cli READ token
145
3. Export `HF_TOKEN=<your cli READ token>`
OlivierDehaene's avatar
OlivierDehaene committed
146
147
148

or with Docker:

149
```shell
150
model=meta-llama/Meta-Llama-3.1-8B-Instruct
OlivierDehaene's avatar
OlivierDehaene committed
151
152
153
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

154
docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.4.1 --model-id $model
OlivierDehaene's avatar
OlivierDehaene committed
155
```
156

157
158
### A note on Shared Memory (shm)

159
[`NCCL`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html) is a communication framework used by
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
`PyTorch` to do distributed training/inference. `text-generation-inference` make
use of `NCCL` to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a `NCCL` group, `NCCL` might fall back to using the host memory if
peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add `--shm-size 1g` on the above command.

If you are running `text-generation-inference` inside `Kubernetes`. You can also add Shared Memory to the container by
creating a volume with:

```yaml
- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi
```

and mounting it to `/dev/shm`.

180
Finally, you can also disable SHM sharing by using the `NCCL_SHM_DISABLE=1` environment variable. However, note that
181
182
this will impact performance.

OlivierDehaene's avatar
OlivierDehaene committed
183
184
185
### Distributed Tracing

`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
Nicolas Patry's avatar
Nicolas Patry committed
186
by setting the address to an OTLP collector with the `--otlp-endpoint` argument. The default service name can be
187
overridden with the `--otlp-service-name` argument
OlivierDehaene's avatar
OlivierDehaene committed
188

189
190
### Architecture

fxmarty's avatar
fxmarty committed
191
![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png)
192

193
Detailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)
194

195
196
### Local install

197
You can also opt to install `text-generation-inference` locally.
198

199
First [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
200
201
202
203
204
Python 3.9, e.g. using `conda`:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Nicolas Patry's avatar
Nicolas Patry committed
205
conda create -n text-generation-inference python=3.11
206
207
208
conda activate text-generation-inference
```

209
210
211
212
213
214
215
216
217
218
219
220
You may also need to install Protoc.

On Linux:

```shell
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
```

221
On MacOS, using Homebrew:
222
223
224
225
226

```shell
brew install protobuf
```

227
Then run:
228

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
229
```shell
230
BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
Nicolas Patry's avatar
Nicolas Patry committed
231
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
232
233
```

234
**Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
235
236

```shell
237
sudo apt-get install libssl-dev gcc -y
238
239
```

240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
### Local install (Nix)

Another option is to install `text-generation-inference` locally using [Nix](https://nixos.org). Currently,
we only support Nix on x86_64 Linux with CUDA GPUs. When using Nix, all dependencies can
be pulled from a binary cache, removing the need to build them locally.

First follow the instructions to [install Cachix and enable the TGI cache](https://app.cachix.org/cache/text-generation-inference).
Setting up the cache is important, otherwise Nix will build many of the dependencies
locally, which can take hours.

After that you can run TGI with `nix run`:

```shell
nix run . -- --model-id meta-llama/Llama-3.1-8B-Instruct
```

**Note:** when you are using Nix on a non-NixOS system, you have to [make some symlinks](https://danieldk.eu/Nix-CUDA-on-non-NixOS-systems#make-runopengl-driverlib-and-symlink-the-driver-library)
to make the CUDA driver libraries visible to Nix packages.

For TGI development, you can use the `impure` dev shell:

```shell
nix develop .#impure

# Only needed the first time the devshell is started or after updating the protobuf.
(
cd server
mkdir text_generation_server/pb || true
python -m grpc_tools.protoc -I../proto/v3 --python_out=text_generation_server/pb \
       --grpc_python_out=text_generation_server/pb --mypy_out=text_generation_server/pb ../proto/v3/generate.proto
find text_generation_server/pb/ -type f -name "*.py" -print0 -exec sed -i -e 's/^\(import.*pb2\)/from . \1/g' {} \;
touch text_generation_server/pb/__init__.py
)
```

All development dependencies (cargo, Python, Torch), etc. are available in this
dev shell.

278
279
## Optimized architectures

Nicolas Patry's avatar
Nicolas Patry committed
280
TGI works out of the box to serve optimized models for all modern models. They can be found in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).
281
282
283
284
285
286
287
288
289
290
291

Other architectures are supported on a best-effort basis using:

`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`

or

`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`



Nicolas Patry's avatar
Nicolas Patry committed
292
## Run locally
293

294
295
### Run

296
```shell
Nicolas Patry's avatar
Nicolas Patry committed
297
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
298
299
```

300
301
### Quantization

302
You can also run pre-quantized weights (AWQ, GPTQ, Marlin) or on-the-fly quantize weights with bitsandbytes, EETQ, fp8, to reduce the VRAM requirement:
303
304

```shell
OlivierDehaene's avatar
OlivierDehaene committed
305
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
306
307
```

Nicolas Patry's avatar
Nicolas Patry committed
308
309
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.

310
311
Read more about quantization in the [Quantization documentation](https://huggingface.co/docs/text-generation-inference/en/conceptual/quantization).

312
## Develop
Olivier Dehaene's avatar
v0.1.0  
Olivier Dehaene committed
313

Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
314
```shell
315
316
make server-dev
make router-dev
Olivier Dehaene's avatar
Init  
Olivier Dehaene committed
317
318
```

319
## Testing
Nicolas Patry's avatar
Nicolas Patry committed
320
321

```shell
322
323
324
325
# python
make python-server-tests
make python-client-tests
# or both server and client tests
326
make python-tests
327
# rust cargo tests
328
329
make rust-tests
# integration tests
330
make integration-tests
331
```