@@ -25,12 +25,12 @@ to power LLMs api-inference widgets.
...
@@ -25,12 +25,12 @@ to power LLMs api-inference widgets.
-[Get Started](#get-started)
-[Get Started](#get-started)
-[Docker](#docker)
-[Docker](#docker)
-[API Documentation](#api-documentation)
-[API Documentation](#api-documentation)
-[Using a private or gated model](#using-a-private-or-gated-model)
-[A note on Shared Memory](#a-note-on-shared-memory-shm)
-[A note on Shared Memory](#a-note-on-shared-memory-shm)
-[Distributed Tracing](#distributed-tracing)
-[Distributed Tracing](#distributed-tracing)
-[Local Install](#local-install)
-[Local Install](#local-install)
-[CUDA Kernels](#cuda-kernels)
-[CUDA Kernels](#cuda-kernels)
-[Run BLOOM](#run-bloom)
-[Run Falcon](#run-falcon)
-[Download](#download)
-[Run](#run)
-[Run](#run)
-[Quantization](#quantization)
-[Quantization](#quantization)
-[Develop](#develop)
-[Develop](#develop)
...
@@ -81,11 +81,10 @@ or
...
@@ -81,11 +81,10 @@ or
The easiest way of getting started is using the official Docker container:
The easiest way of getting started is using the official Docker container:
```shell
```shell
model=bigscience/bloom-560m
model=tiiuae/falcon-7b-instruct
num_shard=2
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id$model--num-shard$num_shard
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:0.9.3--model-id$model
```
```
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.
...
@@ -99,14 +98,14 @@ You can then query the model using either the `/generate` or `/generate_stream`
...
@@ -99,14 +98,14 @@ You can then query the model using either the `/generate` or `/generate_stream`
```shell
```shell
curl 127.0.0.1:8080/generate \
curl 127.0.0.1:8080/generate \
-X POST \
-X POST \
-d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}'\
-d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'\
-H'Content-Type: application/json'
-H'Content-Type: application/json'
```
```
```shell
```shell
curl 127.0.0.1:8080/generate_stream \
curl 127.0.0.1:8080/generate_stream \
-X POST \
-X POST \
-d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}'\
-d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'\
-H'Content-Type: application/json'
-H'Content-Type: application/json'
```
```
...
@@ -120,10 +119,10 @@ pip install text-generation
...
@@ -120,10 +119,10 @@ pip install text-generation
fromtext_generationimportClient
fromtext_generationimportClient
client=Client("http://127.0.0.1:8080")
client=Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?",max_new_tokens=17).generated_text)
print(client.generate("What is Deep Learning?",max_new_tokens=20).generated_text)
text=""
text=""
forresponseinclient.generate_stream("What is Deep Learning?",max_new_tokens=17):
forresponseinclient.generate_stream("What is Deep Learning?",max_new_tokens=20):
ifnotresponse.token.special:
ifnotresponse.token.special:
text+=response.token.text
text+=response.token.text
print(text)
print(text)
...
@@ -134,14 +133,26 @@ print(text)
...
@@ -134,14 +133,26 @@ print(text)
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route.
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).
The Swagger UI is also available at: [https://huggingface.github.io/text-generation-inference](https://huggingface.github.io/text-generation-inference).
### Using on private models or gated models
### Using a private or gated model
You can use `HUGGING_FACE_HUB_TOKEN` environment variable to set the token used by `text-generation-inference` to give access to protected ressources.
You have the option to utilize the `HUGGING_FACE_HUB_TOKEN` environment variable for configuring the token employed by
`text-generation-inference`. This allows you to gain access to protected resources.
### Distributed Tracing
For example, if you want to serve the gated Llama V2 model variants:
`text-generation-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
1. Go to https://huggingface.co/settings/tokens
by setting the address to an OTLP collector with the `--otlp-endpoint` argument.