You need to sign in or sign up before continuing.
Unverified Commit dd304cf1 authored by Omar Sanseviero's avatar Omar Sanseviero Committed by GitHub
Browse files

Remove some content from the README in favour of the documentation (#958)

parent 00b8f36f
...@@ -18,71 +18,43 @@ to power Hugging Chat, the Inference API and Inference Endpoint. ...@@ -18,71 +18,43 @@ to power Hugging Chat, the Inference API and Inference Endpoint.
## Table of contents ## Table of contents
- [Features](#features)
- [Optimized Architectures](#optimized-architectures)
- [Get Started](#get-started) - [Get Started](#get-started)
- [Docker](#docker)
- [API Documentation](#api-documentation) - [API Documentation](#api-documentation)
- [Using a private or gated model](#using-a-private-or-gated-model) - [Using a private or gated model](#using-a-private-or-gated-model)
- [A note on Shared Memory](#a-note-on-shared-memory-shm) - [A note on Shared Memory](#a-note-on-shared-memory-shm)
- [Distributed Tracing](#distributed-tracing) - [Distributed Tracing](#distributed-tracing)
- [Local Install](#local-install) - [Local Install](#local-install)
- [CUDA Kernels](#cuda-kernels) - [CUDA Kernels](#cuda-kernels)
- [Optimized architectures](#optimized-architectures)
- [Run Falcon](#run-falcon) - [Run Falcon](#run-falcon)
- [Run](#run) - [Run](#run)
- [Quantization](#quantization) - [Quantization](#quantization)
- [Develop](#develop) - [Develop](#develop)
- [Testing](#testing) - [Testing](#testing)
- [Other supported hardware](#other-supported-hardware)
## Features Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and [more](https://huggingface.co/docs/text-generation-inference/supported_models). TGI implements many features, such as:
- Serve the most popular Large Language Models with a simple launcher - Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs - Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE) - Token streaming using Server-Sent Events (SSE)
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput - Continuous batching of incoming requests for increased total throughput
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures - Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323) - Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323)
- [Safetensors](https://github.com/huggingface/safetensors) weight loading - [Safetensors](https://github.com/huggingface/safetensors) weight loading
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226) - Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor)) - Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
- Stop sequences - Stop sequences
- Log probabilities - Log probabilities
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics) - Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output. - Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance.
## Optimized architectures
- [BLOOM](https://huggingface.co/bigscience/bloom)
- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
- [Galactica](https://huggingface.co/facebook/galactica-120b)
- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
- [Llama](https://github.com/facebookresearch/llama)
- [OPT](https://huggingface.co/facebook/opt-66b)
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
- [Starcoder](https://huggingface.co/bigcode/starcoder)
- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
- [MPT](https://huggingface.co/mosaicml/mpt-30b)
- [Llama V2](https://huggingface.co/meta-llama)
- [Code Llama](https://huggingface.co/codellama)
- [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
Other architectures are supported on a best effort basis using:
`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`
or
`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")` ## Get Started
## Get started
### Docker ### Docker
The easiest way of getting started is using the official Docker container: For a detailed starting guide, please see the [Quick Tour](https://huggingface.co/docs/text-generation-inference/quicktour). The easiest way of getting started is using the official Docker container:
```shell ```shell
model=tiiuae/falcon-7b-instruct model=tiiuae/falcon-7b-instruct
...@@ -90,46 +62,21 @@ volume=$PWD/data # share a volume with the Docker container to avoid downloading ...@@ -90,46 +62,21 @@ volume=$PWD/data # share a volume with the Docker container to avoid downloading
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model
``` ```
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
```
text-generation-launcher --help
```
You can then query the model using either the `/generate` or `/generate_stream` routes: And then you can make requests like
```shell ```bash
curl 127.0.0.1:8080/generate \ curl 127.0.0.1:8080/generate \
-X POST \ -X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json' -H 'Content-Type: application/json'
``` ```
```shell **Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
```
or from Python:
```shell To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
pip install text-generation
``` ```
text-generation-launcher --help
```python
from text_generation import Client
client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)
text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):
if not response.token.special:
text += response.token.text
print(text)
``` ```
### API documentation ### API documentation
...@@ -241,6 +188,20 @@ the kernels by using the `DISABLE_CUSTOM_KERNELS=True` environment variable. ...@@ -241,6 +188,20 @@ the kernels by using the `DISABLE_CUSTOM_KERNELS=True` environment variable.
Be aware that the official Docker image has them enabled by default. Be aware that the official Docker image has them enabled by default.
## Optimized architectures
TGI works out of the box to serve optimized models in [this list](https://huggingface.co/docs/text-generation-inference/supported_models).
Other architectures are supported on a best-effort basis using:
`AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`
or
`AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")`
## Run Falcon ## Run Falcon
### Run ### Run
...@@ -279,10 +240,3 @@ make rust-tests ...@@ -279,10 +240,3 @@ make rust-tests
# integration tests # integration tests
make integration-tests make integration-tests
``` ```
## Other supported hardware
TGI is also supported on the following AI hardware accelerators:
- *Habana first-gen Gaudi and Gaudi2:* checkout [here](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)
...@@ -18,7 +18,8 @@ Text Generation Inference implements many optimizations and features, such as: ...@@ -18,7 +18,8 @@ Text Generation Inference implements many optimizations and features, such as:
- Logits warper (temperature scaling, top-p, top-k, repetition penalty) - Logits warper (temperature scaling, top-p, top-k, repetition penalty)
- Stop sequences - Stop sequences
- Log probabilities - Log probabilities
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output.
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance.
Text Generation Inference is used in production by multiple projects, such as: Text Generation Inference is used in production by multiple projects, such as:
......
...@@ -45,4 +45,3 @@ TGI is also supported on the following AI hardware accelerators: ...@@ -45,4 +45,3 @@ TGI is also supported on the following AI hardware accelerators:
- *Habana first-gen Gaudi and Gaudi2:* check out this [example](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index) - *Habana first-gen Gaudi and Gaudi2:* check out this [example](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment