quicktour.md 3.29 KB
Newer Older
1
2
3
4
5
6
# Quick Tour

The easiest way of getting started is using the official Docker container. Install Docker following [their installation instructions](https://docs.docker.com/get-docker/).

Let's say you want to deploy [Falcon-7B Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) model with TGI. Here is an example on how to do that:

Merve Noyan's avatar
Merve Noyan committed
7
```bash
8
9
10
model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

OlivierDehaene's avatar
v1.2.0  
OlivierDehaene committed
11
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.2 --model-id $model
12
13
14
15
```

<Tip warning={true}>

fxmarty's avatar
fxmarty committed
16
To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.
fxmarty's avatar
fxmarty committed
17

18
19
</Tip>

fxmarty's avatar
fxmarty committed
20
21
22
23
24
25
TGI also supports ROCm-enabled AMD GPUs (only MI210 and MI250 are tested), details are available in the [Supported Hardware section](./supported_models#supported-hardware) and [AMD documentation](https://rocm.docs.amd.com/en/latest/deploy/docker.html). To launch TGI on ROCm GPUs, please use instead:

```bash
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.2-rocm --model-id $model
```

26
Once TGI is running, you can use the `generate` endpoint by doing requests. To learn more about how to query the endpoints, check the [Consuming TGI](./basic_tutorials/consuming_tgi) section, where we show examples with utility libraries and UIs. Below you can see a simple snippet to query the endpoint.
27

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

<inferencesnippet>
<python>

```python
import requests

headers = {
    "Content-Type": "application/json",
}

data = {
    'inputs': 'What is Deep Learning?',
    'parameters': {
        'max_new_tokens': 20,
    },
}

response = requests.post('http://127.0.0.1:8080/generate', headers=headers, json=data)
print(response.json())
# {'generated_text': '\n\nDeep Learning is a subset of Machine Learning that is concerned with the development of algorithms that can'}
49
```
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
</python>
<js>

```js
async function query() {
    const response = await fetch(
        'http://127.0.0.1:8080/generate', 
        {
            method: 'POST',
            headers: { 'Content-Type': 'application/json'},
            body: JSON.stringify({
                'inputs': 'What is Deep Learning?',
                'parameters': {
                    'max_new_tokens': 20
                }
            })
        }
    );
}

query().then((response) => {
    console.log(JSON.stringify(response));
});
/// {"generated_text":"\n\nDeep Learning is a subset of Machine Learning that is concerned with the development of algorithms that can"}
```

</js>
<curl>

```curl
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
```

</curl>
</inferencesnippet>
88
89
90

<Tip>

91
To see all possible deploy flags and options, you can use the `--help` flag. It's possible to configure the number of shards, quantization, generation parameters, and more.
92

93
```bash
OlivierDehaene's avatar
v1.2.0  
OlivierDehaene committed
94
docker run ghcr.io/huggingface/text-generation-inference:1.2 --help
95
96
```

Merve Noyan's avatar
Merve Noyan committed
97
</Tip>