Unverified Commit 9946165e authored by OlivierDehaene's avatar OlivierDehaene Committed by GitHub
Browse files

chore: add pre-commit (#1569)

parent 142cdabe
......@@ -2,29 +2,29 @@
You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters. To install the CLI, please refer to [the installation section](../installation#install-cli).
`text-generation-server` lets you download the model with `download-weights` command like below 👇
`text-generation-server` lets you download the model with `download-weights` command like below 👇
```bash
text-generation-server download-weights MODEL_HUB_ID
```
You can also use it to quantize models like below 👇
You can also use it to quantize models like below 👇
```bash
text-generation-server quantize MODEL_HUB_ID OUTPUT_DIR
text-generation-server quantize MODEL_HUB_ID OUTPUT_DIR
```
You can use `text-generation-launcher` to serve models.
You can use `text-generation-launcher` to serve models.
```bash
text-generation-launcher --model-id MODEL_HUB_ID --port 8080
```
There are many options and parameters you can pass to `text-generation-launcher`. The documentation for CLI is kept minimal and intended to rely on self-generating documentation, which can be found by running
There are many options and parameters you can pass to `text-generation-launcher`. The documentation for CLI is kept minimal and intended to rely on self-generating documentation, which can be found by running
```bash
text-generation-launcher --help
```
```
You can also find it hosted in this [Swagger UI](https://huggingface.github.io/text-generation-inference/).
......
# Flash Attention
Scaling the transformer architecture is heavily bottlenecked by the self-attention mechanism, which has quadratic time and memory complexity. Recent developments in accelerator hardware mainly focus on enhancing compute capacities and not memory and transferring data between hardware. This results in attention operation having a memory bottleneck. **Flash Attention** is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and inference.
Scaling the transformer architecture is heavily bottlenecked by the self-attention mechanism, which has quadratic time and memory complexity. Recent developments in accelerator hardware mainly focus on enhancing compute capacities and not memory and transferring data between hardware. This results in attention operation having a memory bottleneck. **Flash Attention** is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and inference.
Standard attention mechanism uses High Bandwidth Memory (HBM) to store, read and write keys, queries and values. HBM is large in memory, but slow in processing, meanwhile SRAM is smaller in memory, but faster in operations. In the standard attention implementation, the cost of loading and writing keys, queries, and values from HBM is high. It loads keys, queries, and values from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism, writes it back to HBM, and repeats this for every single attention step. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism, and writes them back.
Standard attention mechanism uses High Bandwidth Memory (HBM) to store, read and write keys, queries and values. HBM is large in memory, but slow in processing, meanwhile SRAM is smaller in memory, but faster in operations. In the standard attention implementation, the cost of loading and writing keys, queries, and values from HBM is high. It loads keys, queries, and values from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism, writes it back to HBM, and repeats this for every single attention step. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism, and writes them back.
![Flash Attention](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png)
It is implemented for supported models. You can check out the complete list of models that support Flash Attention [here](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models), for models with flash prefix.
You can learn more about Flash Attention by reading the paper in this [link](https://arxiv.org/abs/2205.14135).
......@@ -4,20 +4,20 @@ TGI offers GPTQ and bits-and-bytes quantization to quantize large language model
## Quantization with GPTQ
GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error like below 👇
GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error like below 👇
Given a layer \\(l\\) with weight matrix \\(W_{l}\\) and layer input \\(X_{l}\\), find quantized weight \\(\\hat{W}_{l}\\):
$$({\hat{W}_{l}}^{*} = argmin_{\hat{W_{l}}} ||W_{l}X-\hat{W}_{l}X||^{2}_{2})$$
TGI allows you to both run an already GPTQ quantized model (see available models [here](https://huggingface.co/models?search=gptq)) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing --quantize like below 👇
TGI allows you to both run an already GPTQ quantized model (see available models [here](https://huggingface.co/models?search=gptq)) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing --quantize like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize gptq
```
Note that TGI's GPTQ implementation doesn't use [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI.
Note that TGI's GPTQ implementation doesn't use [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI.
To quantize a given model using GPTQ with a calibration dataset, simply run
......@@ -41,7 +41,7 @@ You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323.
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much.
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much.
In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇
```bash
......@@ -50,7 +50,7 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingf
4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (`fp4`), or 4-bit `NormalFloat` (`nf4`). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.
In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇
In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes-nf4
......
# Safetensors
Safetensors is a model serialization format for deep learning models. It is [faster](https://huggingface.co/docs/safetensors/speed) and safer compared to other serialization formats like pickle (which is used under the hood in many deep learning libraries).
Safetensors is a model serialization format for deep learning models. It is [faster](https://huggingface.co/docs/safetensors/speed) and safer compared to other serialization formats like pickle (which is used under the hood in many deep learning libraries).
TGI depends on safetensors format mainly to enable [tensor parallelism sharding](./tensor_parallelism). For a given model repository during serving, TGI looks for safetensors weights. If there are no safetensors weights, TGI converts the PyTorch weights to safetensors format.
TGI depends on safetensors format mainly to enable [tensor parallelism sharding](./tensor_parallelism). For a given model repository during serving, TGI looks for safetensors weights. If there are no safetensors weights, TGI converts the PyTorch weights to safetensors format.
You can learn more about safetensors by reading the [safetensors documentation](https://huggingface.co/docs/safetensors/index).
\ No newline at end of file
You can learn more about safetensors by reading the [safetensors documentation](https://huggingface.co/docs/safetensors/index).
......@@ -5,12 +5,12 @@
Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.
<div class="flex justify-center">
<img
class="block dark:hidden"
<img
class="block dark:hidden"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual_360.gif"
/>
<img
class="hidden dark:block"
<img
class="hidden dark:block"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual-dark_360.gif"
/>
</div>
......@@ -25,14 +25,14 @@ With token streaming, the server can start returning the tokens one by one befor
For example, a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and although end-to-end latency will be the same, they can see half of the generation after five seconds. Below you can see an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below.
<div class="block dark:hidden">
<iframe
<iframe
src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=light"
width="850"
height="350"
></iframe>
</div>
<div class="hidden dark:block">
<iframe
<iframe
src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=dark"
width="850"
height="350"
......@@ -43,7 +43,7 @@ For example, a system can generate 100 tokens per second. If the system generate
### Streaming with Python
To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate over the response.
To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate over the response.
```python
from huggingface_hub import InferenceClient
......@@ -116,7 +116,7 @@ curl -N 127.0.0.1:8080/generate_stream \
First, we need to install the `@huggingface/inference` library.
`npm install @huggingface/inference`
If you're using the free Inference API, you can use `HfInference`. If you're using inference endpoints, you can use `HfInferenceEndpoint`. Let's
If you're using the free Inference API, you can use `HfInference`. If you're using inference endpoints, you can use `HfInferenceEndpoint`. Let's
We can create a `HfInferenceEndpoint` providing our endpoint URL and credential.
......@@ -129,7 +129,7 @@ const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.
const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'
const stream = hf.textGenerationStream({ inputs: prompt })
for await (const r of stream) {
for await (const r of stream) {
// yield the generated token
process.stdout.write(r.token.text)
}
......
# Tensor Parallelism
Tensor parallelism is a technique used to fit a large model in multiple GPUs. For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication is equivalent to splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs. These outputs are then transferred from the GPUs and concatenated together to get the final result, like below 👇
Tensor parallelism is a technique used to fit a large model in multiple GPUs. For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication is equivalent to splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs. These outputs are then transferred from the GPUs and concatenated together to get the final result, like below 👇
![Image courtesy of Anton Lozkhov](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/TP.png)
......
......@@ -4,7 +4,7 @@ This section explains how to install the CLI tool as well as installing TGI from
## Install CLI
You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters.
You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters.
To install the CLI, you need to first clone the TGI repository and then run `make`.
......@@ -23,7 +23,7 @@ BUILD_EXTENSIONS=True make install
Before you start, you will need to setup your environment, and install Text Generation Inference. Text Generation Inference is tested on **Python 3.9+**.
Text Generation Inference is available on pypi, conda and GitHub.
Text Generation Inference is available on pypi, conda and GitHub.
To install and launch locally, first [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
Python 3.9, e.g. using conda:
......
......@@ -92,7 +92,7 @@ print(chat_completion)
## Hugging Face Inference Endpoints
The Messages API is integrated with [Inference Endpoints](https://huggingface.co/inference-endpoints/dedicated).
The Messages API is integrated with [Inference Endpoints](https://huggingface.co/inference-endpoints/dedicated).
Every endpoint that uses "Text Generation Inference" with an LLM, which has a chat template can now be used. Below is an example of how to use IE with TGI using OpenAI's Python client library:
> **Note:** Make sure to replace `base_url` with your endpoint URL and to include `v1/` at the end of the URL. The `api_key` should be replaced with your Hugging Face API key.
......
......@@ -53,7 +53,7 @@ print(response.json())
```js
async function query() {
const response = await fetch(
'http://127.0.0.1:8080/generate',
'http://127.0.0.1:8080/generate',
{
method: 'POST',
headers: { 'Content-Type': 'application/json'},
......
......@@ -54,7 +54,9 @@ async def test_mamba_all_params(fused_kernel_mamba, response_snapshot):
@pytest.mark.asyncio
@pytest.mark.private
async def test_mamba_load(fused_kernel_mamba, generate_load, generous_response_snapshot):
async def test_mamba_load(
fused_kernel_mamba, generate_load, generous_response_snapshot
):
responses = await generate_load(
fused_kernel_mamba, "What is Deep Learning?", max_new_tokens=10, n=4
)
......
......@@ -2,4 +2,4 @@
addopts = --snapshot-warn-unused
asyncio_mode = auto
markers =
private: marks tests as requiring an admin hf token (deselect with '-m "not private"')
\ No newline at end of file
private: marks tests as requiring an admin hf token (deselect with '-m "not private"')
......@@ -57,7 +57,7 @@ export function run(host, generate_payload, max_new_tokens) {
const duration = res.timings.duration;
if (res.status === 200) {
const body = res.json();
const body = res.json();
const n_tokens = body.details.tokens.length;
const latency_ms_per_token = duration / n_tokens;
timePerToken.add(latency_ms_per_token);
......
......@@ -60,4 +60,4 @@ export default function () {
inferenceTime.add(res.headers["X-Inference-Time"]);
timePerToken.add(res.headers["X-Time-Per-Token"]);
}
}
\ No newline at end of file
}
import { get_options, run } from "./common.js";
const reference_latency_ms = 70;
const host = __ENV.HOST || '127.0.0.1:8000';
const max_new_tokens = 50;
......
import { get_options, run } from "./common.js";
const reference_latency_ms = 22;
const host = __ENV.HOST || '127.0.0.1:8000';
const max_new_tokens = 50;
......
......@@ -28,7 +28,7 @@ this is controlled by the client, and therefore the amount of batching is decide
beforehand.
For text-generation, and LLMs which are memory bound we can try to be much more
efficient with the available compute, by having client sending us single queries,
efficient with the available compute, by having client sending us single queries,
and let the router mix&match queries into or out of batches to make the use the
compute the most efficiently. This is possible because for LLMs the total compute
for running the model is much bigger than doing mix&match of the batches themselves.
......@@ -89,5 +89,5 @@ most critical perceived quality of an LLM API.
With token streaming, the server can start answering after the first `prefill` pass
directly, without waiting for all the generation to be done. For extremely long queries
this means clients can start to see something happening orders of magnitude before
the work is done. Seeing something in progress allows them to cut short if it's not
the work is done. Seeing something in progress allows them to cut short if it's not
what's wanted but also it "feels" better.
*.rs
\ No newline at end of file
*.rs
......@@ -27,6 +27,7 @@ pub struct Validation {
}
impl Validation {
#[allow(clippy::too_many_arguments)]
pub(crate) fn new(
workers: usize,
tokenizer: Option<Tokenizer>,
......
......@@ -3,4 +3,4 @@
# Branched from master on: 10 November, 2023
# https://releases.rs/docs/1.75.0/
channel = "1.75.0"
components = ["rustfmt", "clippy"]
\ No newline at end of file
components = ["rustfmt", "clippy"]
......@@ -2,7 +2,7 @@
# to make cuda graphs work.
awq_commit := bd1dc2d5254345cc76ab71894651fb821275bdd4
awq:
awq:
rm -rf llm-awq
git clone https://github.com/huggingface/llm-awq
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment