"description":"A guideline to be used in the chat_template",
"default":"null",
"example":"null",
"nullable":true
},
"logit_bias":{
"type":"array",
"items":{
...
...
@@ -854,7 +991,8 @@
"model":{
"type":"string",
"description":"[UNUSED] ID of the model to use. See the model endpoint compatibility table for details on which models work with the Chat API.",
"example":"mistralai/Mistral-7B-Instruct-v0.2"
"example":"mistralai/Mistral-7B-Instruct-v0.2",
"nullable":true
},
"n":{
"type":"integer",
...
...
@@ -899,6 +1037,14 @@
"stream":{
"type":"boolean"
},
"stream_options":{
"allOf":[
{
"$ref":"#/components/schemas/StreamOptions"
}
],
"nullable":true
},
"temperature":{
"type":"number",
"format":"float",
...
...
@@ -909,7 +1055,7 @@
"tool_choice":{
"allOf":[
{
"$ref":"#/components/schemas/ToolType"
"$ref":"#/components/schemas/ToolChoice"
}
],
"nullable":true
...
...
@@ -917,7 +1063,7 @@
"tool_prompt":{
"type":"string",
"description":"A prompt to be appended before the tools",
"example":"\"You will be presented with a JSON schema representing a set of tools.\nIf the user request lacks of sufficient information to make a precise tool selection: Do not invent any tool's properties, instead notify with an error message.\n\nJSON Schema:\n\"",
"example":"Given the functions available, please respond with a JSON for a function call with its proper arguments that best answers the given prompt. Respond in the format {name: function name, parameters: dictionary of argument name and its value}.Do not use variables.",
"nullable":true
},
"tools":{
...
...
@@ -1116,7 +1262,6 @@
"CompletionRequest":{
"type":"object",
"required":[
"model",
"prompt"
],
"properties":{
...
...
@@ -1138,7 +1283,8 @@
"model":{
"type":"string",
"description":"UNUSED\nID of the model to use. See the model endpoint compatibility table for details on which models work with the Chat API.",
"description":"If set, an additional chunk will be streamed before the data: [DONE] message. The usage field on this chunk shows the token usage statistics for the entire request, and the choices field will always be an empty array. All other chunks will also include a usage field, but with a null value.",
"example":"true"
}
}
},
"StreamResponse":{
"type":"object",
"required":[
...
...
@@ -1834,6 +2110,23 @@
}
}
},
"TextMessage":{
"type":"object",
"required":[
"role",
"content"
],
"properties":{
"content":{
"type":"string",
"example":"My name is David and I"
},
"role":{
"type":"string",
"example":"user"
}
}
},
"Token":{
"type":"object",
"required":[
...
...
@@ -1906,15 +2199,64 @@
}
}
},
"ToolCallDelta":{
"type":"object",
"required":[
"role",
"tool_calls"
],
"properties":{
"role":{
"type":"string",
"example":"assistant"
},
"tool_calls":{
"$ref":"#/components/schemas/DeltaToolCall"
}
}
},
"ToolCallMessage":{
"type":"object",
"required":[
"role",
"tool_calls"
],
"properties":{
"role":{
"type":"string",
"example":"assistant"
},
"tool_calls":{
"type":"array",
"items":{
"$ref":"#/components/schemas/ToolCall"
}
}
}
},
"ToolChoice":{
"allOf":[
{
"$ref":"#/components/schemas/ToolType"
}
],
"nullable":true
},
"ToolType":{
"oneOf":[
{
"type":"object",
"default":null,
"nullable":true
"type":"string",
"description":"Means the model can pick between generating a message or calling one or more tools.",
"enum":[
"auto"
]
},
{
"type":"string"
"type":"string",
"description":"Means the model will not call any tool and instead generates a message.",
"enum":[
"none"
]
},
{
"type":"object",
...
...
@@ -1927,7 +2269,20 @@
}
}
}
]
],
"description":"Controls which (if any) tool is called by the model.",
@@ -10,7 +10,7 @@ This diagram shows well there are these separate components:
-**The router**, also named `webserver`, that receives the client requests, buffers them, creates some batches, and prepares gRPC calls to a model server.
-**The model server**, responsible of receiving the gRPC requests and to process the inference on the model. If the model is sharded across multiple accelerators (e.g.: multiple GPUs), the model server shards might be synchronized via NCCL or equivalent.
-**The launcher** is a helper thar will be able to launch one or several model servers (if model is sharded), and it launches the router with the compatible arguments.
-**The launcher** is a helper that will be able to launch one or several model servers (if model is sharded), and it launches the router with the compatible arguments.
The router and the model server can be two different machines, they do not need to be deployed together.
...
...
@@ -103,6 +103,7 @@ Several variants of the model server exist that are actively supported by Huggin
- By default, the model server will attempt building [a server optimized for Nvidia GPUs with CUDA](https://huggingface.co/docs/text-generation-inference/installation_nvidia). The code for this version is hosted in the [main TGI repository](https://github.com/huggingface/text-generation-inference).
- A [version optimized for AMD with ROCm](https://huggingface.co/docs/text-generation-inference/installation_amd) is hosted in the main TGI repository. Some model features differ.
- A [version optimized for Intel GPUs](https://huggingface.co/docs/text-generation-inference/installation_intel) is hosted in the main TGI repository. Some model features differ.
- The [version for Intel Gaudi](https://huggingface.co/docs/text-generation-inference/installation_gaudi) is maintained on a forked repository, often resynchronized with the main [TGI repository](https://github.com/huggingface/tgi-gaudi).
- A [version for Neuron (AWS Inferentia2)](https://huggingface.co/docs/text-generation-inference/installation_inferentia) is maintained as part of [Optimum Neuron](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference).
- A version for Google TPUs is maintained as part of [Optimum TPU](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference).
There are many ways you can consume Text Generation Inference server in your applications. After launching, you can use the `/generate` route and make a `POST` request to get results from the server. You can also use the `/generate_stream` route if you want TGI to return a stream of tokens. You can make the requests using the tool of your preference, such as curl, Python or TypeScrpt. For a final end-to-end experience, we also open-sourced ChatUI, a chat interface for open-source models.
There are many ways to consume Text Generation Inference (TGI) server in your applications. After launching the server, you can use the [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api)`/v1/chat/completions` route and make a `POST` request to get results from the server. You can also pass `"stream": true` to the call if you want TGI to return a stream of tokens.
For more information on the API, consult the OpenAPI documentation of `text-generation-inference` available [here](https://huggingface.github.io/text-generation-inference).
You can make the requests using any tool of your preference, such as curl, Python, or TypeScript. For an end-to-end experience, we've open-sourced [ChatUI](https://github.com/huggingface/chat-ui), a chat interface for open-access models.
## curl
After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:
After a successful server launch, you can query the model using the`v1/chat/completions` route, to get responses that are compliant to the OpenAI Chat Completion spec:
```bash
curl 127.0.0.1:8080/generate\
curl localhost:8080/v1/chat/completions\
-X POST \
-d'{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}'\
-d'{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 20
}'\
-H'Content-Type: application/json'
```
## Inference Client
[`huggingface-hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a nice high-level class, [`~huggingface_hub.InferenceClient`], which makes it easy to make calls to a TGI endpoint. `InferenceClient` also takes care of parameter validation and provides a simple to-use interface.
You can simply install `huggingface-hub` package with pip.
For non-chat use-cases, you can also use the `/generate` and `/generate_stream` routes.
```bash
pip install huggingface-hub
```
Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
client.text_generation(prompt="Write a code for snake game")
curl 127.0.0.1:8080/generate \
-X POST \
-d'{
"inputs":"What is Deep Learning?",
"parameters":{
"max_new_tokens":20
}
}'\
-H'Content-Type: application/json'
```
You can do streaming with `InferenceClient` by passing `stream=True`. Streaming will return tokens as they are being generated in the server. To use streaming, you can do as follows:
## Python
```python
fortokeninclient.text_generation("How do you make cheese?",max_new_tokens=12,stream=True):
print(token)
```
### Inference Client
Another parameter you can use with TGI backend is `details`. You can get more details on generation (tokens, probabilities, etc.) by setting `details` to `True`. When it's specified, TGI will return a `TextGenerationResponse` or `TextGenerationStreamResponse` rather than a string or stream.
[`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a high-level class, [`huggingface_hub.InferenceClient`](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient), which makes it easy to make calls to TGI's Messages API. `InferenceClient` also takes care of parameter validation and provides a simple-to-use interface.
```python
output=client.text_generation(prompt="Meaning of life is",details=True)
print(output)
Install `huggingface_hub` package via pip.
# TextGenerationResponse(generated_text=' a complex concept that is not always clear to the individual. It is a concept that is not always', details=Details(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=20, seed=None, prefill=[], tokens=[Token(id=267, text=' a', logprob=-2.0723474, special=False), Token(id=11235, text=' complex', logprob=-3.1272552, special=False), Token(id=17908, text=' concept', logprob=-1.3632495, special=False),..))
```bash
pip install huggingface_hub
```
You can see how to stream below.
You can now use `InferenceClient` the exact same way you would use `OpenAI` client in Python
```python
output=client.text_generation(prompt="Meaning of life is",stream=True,details=True)
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Count to 10"},
],
stream=True,
max_tokens=1024,
)
forchunkinoutput:
print(chunk.choices[0].delta.content)
```
You can check out the details of the function[here](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation). There is also an async version of the client, `AsyncInferenceClient`, based on `asyncio` and `aiohttp`. You can find docs for it [here](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)
You can check out more details about OpenAI compatibility[here](https://huggingface.co/docs/huggingface_hub/en/guides/inference#openai-compatibility).
There is also an async version of the client, `AsyncInferenceClient`, based on `asyncio` and `aiohttp`. You can find docs for it [here](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)
## ChatUI
### OpenAI Client
ChatUI is an open-source interface built for LLM serving. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.
You can directly use the OpenAI [Python](https://github.com/openai/openai-python) or [JS](https://github.com/openai/openai-node) clients to interact with TGI.
To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.
Gradio is a Python library that helps you build web applications for your machine learning models with a few lines of code. It has a `ChatInterface` wrapper that helps create neat UIs for chatbots. Let's take a look at how to create a chatbot with streaming mode using TGI and Gradio. Let's install Gradio and Hub Python library first.
...
...
@@ -89,19 +133,28 @@ Assume you are serving your model on port 8080, we will query through [Inference
You can read more about how to customize a `ChatInterface`[here](https://www.gradio.app/guides/creating-a-chatbot-fast).
[ChatUI](https://github.com/huggingface/chat-ui) is an open-source interface built for consuming LLMs. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.
## API documentation
To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference).
@@ -4,7 +4,7 @@ Text Generation Inference improves the model in several aspects.
## Quantization
TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes), [GPT-Q](https://arxiv.org/abs/2210.17323) and[AWQ](https://arxiv.org/abs/2306.00978) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes`, `gptq` or`awq` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq) when using AWQ quantization, you need to point to one of the models[here](https://huggingface.co/models?search=awq). To get more information about quantization, please refer to [quantization guide](./../conceptual/quantization)
TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes), [GPT-Q](https://arxiv.org/abs/2210.17323),[AWQ](https://arxiv.org/abs/2306.00978), [Marlin](https://github.com/IST-DASLab/marlin), [EETQ](https://github.com/NetEase-FuXi/EETQ), [EXL2](https://github.com/turboderp/exllamav2), and [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes`, `gptq`,`awq`, `marlin`, `exl2`, `eetq` or `fp8` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq). Similarly, when using AWQ quantization, you need to point to one of [these models](https://huggingface.co/models?search=awq). To get more information about quantization, please refer to [quantization guide](./../conceptual/quantization)
@@ -4,7 +4,7 @@ Text Generation Inference (TGI) now supports [JSON and regex grammars](#grammar-
These feature are available starting from version `1.4.3`. They are accessible via the [`huggingface_hub`](https://pypi.org/project/huggingface-hub/) library. The tool support is compatible with OpenAI's client libraries. The following guide will walk you through the new features and how to use them!
_note: guidance is supported as grammar in the `/generate` endpoint and as tools in the `/chat/completions` endpoint._
_note: guidance is supported as grammar in the `/generate` endpoint and as tools in the `v1/chat/completions` endpoint._
## How it works
...
...
@@ -157,7 +157,12 @@ from huggingface_hub import InferenceClient
To use a locally stored lora adapter, use `adapter-name=/path/to/adapter`, as seen below. When you want to use this adapter, set `"parameters": {"adapter_id": "adapter-name"}"`
In the server logs, you will see the following message:
```txt
...
...
@@ -60,6 +78,22 @@ curl 127.0.0.1:3000/generate \
}'
```
If you are using a lora adapter stored locally that was set in the following manner: `LORA_ADAPTERS=myadapter=/some/path/to/adapter`, here is an example payload:
```json
curl127.0.0.1:3000/generate\
-XPOST\
-H'Content-Type:application/json'\
-d'{
"inputs":"Hello who are you?",
"parameters":{
"max_new_tokens":40,
"adapter_id":"myadapter"
}
}'
```
> **Note:** The Lora feature is new and still being improved. If you encounter any issues or have any feedback, please let us know by opening an issue on the [GitHub repository](https://github.com/huggingface/text-generation-inference/issues/new/choose). Additionally documentation and an improved client library will be published soon.
An updated tutorial with detailed examples will be published soon. Stay tuned!
TGI offers GPTQ and bits-and-bytes quantization to quantize large language models.
TGI offers many quantization schemes to run LLMs effectively and fast based on your use-case. TGI supports GPTQ, AWQ, bits-and-bytes, EETQ, Marlin, EXL2 and fp8 quantization.
To leverage GPTQ, AWQ, Marlin and EXL2 quants, you must provide pre-quantized weights. Whereas for bits-and-bytes, EETQ and fp8, weights are quantized by TGI on the fly.
We recommend using the official quantization scripts for creating your quants:
For on-the-fly quantization you simply need to pass one of the supported quantization types and TGI takes care of the rest.
## Quantization with bitsandbytes, EETQ & fp8
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much.
In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:2.4.0 --model-id$model--quantize bitsandbytes
```
4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (`fp4`), or 4-bit `NormalFloat` (`nf4`). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.
In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:2.4.0 --model-id$model--quantize bitsandbytes-nf4
```
You can get more information about 8-bit quantization by reading this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), and 4-bit quantization by reading [this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
Similarly you can use pass you can pass `--quantize eetq` or `--quantize fp8` for respective quantization schemes.
In addition to this, TGI allows creating GPTQ quants directly by passing the model weights and a calibration dataset.
TGI allows you to both run an already GPTQ quantized model (see available models [here](https://huggingface.co/models?search=gptq)) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing --quantize like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:latest--model-id$model--quantize gptq
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:2.4.0--model-id$model--quantize gptq
```
Note that TGI's GPTQ implementation doesn't use [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI.
...
...
@@ -36,24 +70,3 @@ You can learn more about the quantization options by running `text-generation-se
If you wish to do more with GPTQ models (e.g. train an adapter on top), you can read about transformers GPTQ integration [here](https://huggingface.co/blog/gptq-integration).
You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323.pdf).
## Quantization with bitsandbytes
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much.
In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id$model--quantize bitsandbytes
```
4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (`fp4`), or 4-bit `NormalFloat` (`nf4`). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.
In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v$volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id$model--quantize bitsandbytes-nf4
```
You can get more information about 8-bit quantization by reading this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), and 4-bit quantization by reading [this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.
...
...
@@ -48,34 +49,29 @@ To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate o
```python
fromhuggingface_hubimportInferenceClient
client=InferenceClient("http://127.0.0.1:8080")
fortokeninclient.text_generation("How do you make cheese?",max_new_tokens=12,stream=True):
print(token)
# To
# make
# cheese
#,
# you
# need
# to
# start
# with
# milk
#.
```
If you want additional details, you can add `details=True`. In this case, you get a `TextGenerationStreamResponse` which contains additional information such as the probabilities and the tokens. For the final response in the stream, it also returns the full generated text.
```python
fordetailsinclient.text_generation("How do you make cheese?",max_new_tokens=12,details=True,stream=True):
#TextGenerationStreamResponse(token=Token(id=25, text='.', logprob=-1.3408203, special=False), generated_text='\nTo make cheese, you need to start with milk.', details=StreamDetails(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=12, seed=None))
messages=[{"role":"user","content":"Say this is a test"}],
stream=True,
)
asyncforchunkinstream:
print(chunk.choices[0].delta.contentor"",end="")
asyncio.run(main())
# This
# is
# a
# test
#.
```
### Streaming with cURL
To use the `generate_stream` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server
To use the OpenAI Chat Completions compatible Messages API `v1/chat/completions` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server
```curl
curl -N 127.0.0.1:8080/generate_stream \
curl localhost:8080/v1/chat/completions \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
@@ -31,6 +31,12 @@ Two implementations of Flash Attention are available for ROCm, the first is [ROC
By default, the Composable Kernel implementation is used. However, the Triton implementation has slightly lower latency on MI250 and MI300, but requires a warmup which can be prohibitive as it needs to be done again for each new prompt length. If needed, FA Triton impelmentation can be enabled with `--env ROCM_USE_FLASH_ATTN_V2_TRITON="0"` when launching TGI's docker container.
## Custom PagedAttention
For better performance on ROCm, a custom Paged Attention kernel is available and is enabled by default. To disable it and fall back to the PagedAttention v2 kernel, set the environment variable `ROCM_USE_CUSTOM_PAGED_ATTN=0`.
The custom kernel supports bf16 and fp16 data types, block size of 16, head size of 128, a maximum context length of 16k, and GQA ratios between 1 and 16. For other configurations, we use the PagedAttention v2 kernel.
## Unsupported features
The following features are currently not supported in the ROCm version of TGI, and the supported may be extended in the future:
TGI optimized models are supported on Intel Data Center GPU [Max1100](https://www.intel.com/content/www/us/en/products/sku/232876/intel-data-center-gpu-max-1100/specifications.html), [Max1550](https://www.intel.com/content/www/us/en/products/sku/232873/intel-data-center-gpu-max-1550/specifications.html), the recommended usage is through Docker.
On a server powered by Intel GPUs, TGI can be launched with the following command:
```bash
model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
Intel® Extension for PyTorch (IPEX) also provides further optimizations for Intel CPUs. The IPEX provides optimization operations such as flash attention, page attention, Add + LayerNorm, ROPE and more.
On a server powered by Intel CPU, TGI can be launched with the following command:
```bash
model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
TGI supports various hardware. Make sure to check the [Using TGI with Nvidia GPUs](./installation_nvidia), [Using TGI with AMD GPUs](./installation_amd), [Using TGI with Gaudi](./installation_gaudi), [Using TGI with Inferentia](./installation_inferentia) guides depending on which hardware you would like to deploy TGI on.
TGI supports various hardware. Make sure to check the [Using TGI with Nvidia GPUs](./installation_nvidia), [Using TGI with AMD GPUs](./installation_amd), [Using TGI with Intel GPUs](./installation_intel), [Using TGI with Gaudi](./installation_gaudi), [Using TGI with Inferentia](./installation_inferentia) guides depending on which hardware you would like to deploy TGI on.
## Consuming TGI
Once TGI is running, you can use the `generate` endpoint by doing requests. To learn more about how to query the endpoints, check the [Consuming TGI](./basic_tutorials/consuming_tgi) section, where we show examples with utility libraries and UIs. Below you can see a simple snippet to query the endpoint.
Once TGI is running, you can use the `generate` endpoint or the Open AI Chat Completion API compatible [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api)by doing requests. To learn more about how to query the endpoints, check the [Consuming TGI](./basic_tutorials/consuming_tgi) section, where we show examples with utility libraries and UIs. Below you can see a simple snippet to query the endpoint.
<inferencesnippet>
<python>
...
...
@@ -88,7 +96,7 @@ curl 127.0.0.1:8080/generate \
To see all possible deploy flags and options, you can use the `--help` flag. It's possible to configure the number of shards, quantization, generation parameters, and more.
```bash
docker run ghcr.io/huggingface/text-generation-inference:2.1.0 --help
docker run ghcr.io/huggingface/text-generation-inference:2.4.0 --help
Text Generation Inference (TGI) now supports the Messages API, which is fully compatible with the OpenAI Chat Completion API. This feature is available starting from version 1.4.0. You can use OpenAI's client libraries or third-party libraries expecting OpenAI schema to interact with TGI's Messages API. Below are some examples of how to utilize this compatibility.
#### Table of Contents
> **Note:** The Messages API is supported from TGI version 1.4.0 and above. Ensure you are using a compatible version to access this feature.
-[Hugging Face Inference Endpoints](#hugging-face-inference-endpoints)
-[Cloud Providers](#cloud-providers)
-[Amazon SageMaker](#amazon-sagemaker)
#### Table of Contents
The HTTP API is a RESTful API that allows you to interact with the text-generation-inference component. Two endpoints are available:
* Text Generation Inference [custom API](https://huggingface.github.io/text-generation-inference/)
* OpenAI's [Messages API](#openai-messages-api)
## Text Generation Inference custom API
-[Making a Request](#making-a-request)
-[Streaming](#streaming)
-[Synchronous](#synchronous)
-[Hugging Face Inference Endpoints](#hugging-face-inference-endpoints)
-[Cloud Providers](#cloud-providers)
-[Amazon SageMaker](#amazon-sagemaker)
Check the [API documentation](https://huggingface.github.io/text-generation-inference/) for more information on how to interact with the Text Generation Inference API.
## OpenAI Messages API
Text Generation Inference (TGI) now supports the Messages API, which is fully compatible with the OpenAI Chat Completion API. This feature is available starting from version 1.4.0. You can use OpenAI's client libraries or third-party libraries expecting OpenAI schema to interact with TGI's Messages API. Below are some examples of how to utilize this compatibility.
> **Note:** The Messages API is supported from TGI version 1.4.0 and above. Ensure you are using a compatible version to access this feature.
## Making a Request
...
...
@@ -128,9 +141,7 @@ TGI can be deployed on various cloud providers for scalable and robust text gene
## Amazon SageMaker
To enable the Messages API in Amazon SageMaker you need to set the environment variable `MESSAGES_API_ENABLED=true`.
This will modify the `/invocations` route to accept Messages dictonaries consisting out of role and content. See the example below on how to deploy Llama with the new Messages API.
Amazon Sagemaker natively supports the message API: