Unverified Commit 9946165e authored by OlivierDehaene's avatar OlivierDehaene Committed by GitHub
Browse files

chore: add pre-commit (#1569)

parent 142cdabe
......@@ -5,14 +5,14 @@ body:
id: system-info
attributes:
label: System Info
description: |
description: |
Please share your system info with us (`text-generation-launcher --env` if installed locally).
The full command line used that causes issues:
The full command line used that causes issues:
OS version:
Rust version (if self-compiling, `cargo version`):
Model being used (`curl 127.0.0.1:8080/info | jq`):
If local model please explicit the kind of model and/or equivalents.
Hardware used (GPUs, how many, on which cloud) (`nvidia-smi`):
Hardware used (GPUs, how many, on which cloud) (`nvidia-smi`):
Deployment specificities (Kubernetes, EKS, AKS, any particular deployments):
The current version being used:
......@@ -52,11 +52,11 @@ body:
placeholder: |
Steps to reproduce the behavior:
1.
2.
3.
- type: textarea
id: expected-behavior
......
......@@ -19,7 +19,7 @@ body:
label: Motivation
description: |
Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.
- type: textarea
id: contribution
......
......@@ -6,15 +6,15 @@ on:
jobs:
update_docs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Install Launcher
id: install-launcher
run: cargo install --git https://github.com/${{ github.repository }} --branch ${{ github.head_ref }} text-generation-launcher
- name: Check launcher Docs are up-to-date
run: |
echo text-generation-launcher --help
......
......@@ -16,4 +16,4 @@ jobs:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: text-generation-inference
additional_args: --not_python_module
additional_args: --not_python_module
......@@ -71,12 +71,11 @@ jobs:
pip install pytest
export HUGGING_FACE_HUB_TOKEN=${{ secrets.HUGGING_FACE_HUB_TOKEN }}
pytest -s -vv server/tests
- name: Run Rust fmt
- name: Pre-commit checks
run: |
cargo fmt --check
- name: Run Rust clippy
run: |
cargo clippy
pip install pre-commit
pre-commit install
pre-commit run --all-files
- name: Run Rust tests
run: |
cargo test
......
......@@ -13,4 +13,4 @@ jobs:
package_name: text-generation-inference
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
\ No newline at end of file
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
......@@ -11,4 +11,3 @@ server/exllama_kernels/exllama_kernels/hip_func/
*_hip.cuh
server/exllama_kernels/exllama_kernels/hip_buffers.cuh
server/exllama_kernels/exllama_kernels/exllama_ext_hip.cpp
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
exclude: docs/source/basic_tutorials/launcher.md
- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
- repo: https://github.com/doublify/pre-commit-rust
rev: v1.0
hooks:
- id: fmt
- id: cargo-check
- id: clippy
<div align="center">
<a href="https://www.youtube.com/watch?v=jlMAX2Oaht0">
<img width=560 width=315 alt="Making TGI deployment optimal" src="https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png">
</a>
......@@ -228,7 +228,7 @@ text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:
```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
```
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
......
......@@ -29,4 +29,3 @@ tui = {package = "ratatui", version = "0.23", default-features = false, features
tracing = "0.1.37"
tracing-subscriber = { version = "0.3.17", features = ["json", "env-filter"] }
hf-hub = "0.3.1"
......@@ -6,12 +6,12 @@
</div>
A lightweight benchmarking tool based inspired by [oha](https://github.com/hatoo/oha)
A lightweight benchmarking tool based inspired by [oha](https://github.com/hatoo/oha)
and powered by [tui](https://github.com/tui-rs-revival/ratatui).
## Install
## Install
```shell
```shell
make install-benchmark
```
......@@ -27,4 +27,4 @@ Then run the benchmarking tool:
```shell
text-generation-benchmark --tokenizer-name bigscience/bloom-560m
```
\ No newline at end of file
```
......@@ -155,4 +155,4 @@ dmypy.json
cython_debug/
transformers
safetensors
\ No newline at end of file
safetensors
......@@ -3,4 +3,4 @@ unit-tests:
install:
pip install pip --upgrade
pip install -e .
\ No newline at end of file
pip install -e .
......@@ -141,7 +141,7 @@ class Parameters:
# Get decoder input token logprobs and ids
decoder_input_details: bool
# Return the N most likely tokens at each step
top_n_tokens: Optional[int]
top_n_tokens: Optional[int]
# Decoder input tokens
class InputToken:
......@@ -192,7 +192,7 @@ class BestOfSequence:
# Generated tokens
tokens: List[Token]
# Most likely tokens
top_tokens: Optional[List[List[Token]]]
top_tokens: Optional[List[List[Token]]]
# `generate` details
......@@ -236,7 +236,7 @@ class StreamResponse:
# Generated token
token: Token
# Most likely tokens
top_tokens: Optional[List[Token]]
top_tokens: Optional[List[Token]]
# Complete generated text
# Only available when the generation is finished
generated_text: Optional[str]
......@@ -248,4 +248,4 @@ class StreamResponse:
class DeployedModel:
model_id: str
sha: str
```
\ No newline at end of file
```
......@@ -134,6 +134,7 @@ class Parameters(BaseModel):
raise ValidationError("`value` cannot be empty for `json` grammar")
return v
class Request(BaseModel):
# Prompt
inputs: str
......
......@@ -27,4 +27,4 @@
}
</script>
</body>
</html>
\ No newline at end of file
</html>
......@@ -1290,4 +1290,4 @@
"description": "Hugging Face Text Generation Inference API"
}
]
}
\ No newline at end of file
}
......@@ -23,7 +23,7 @@ You can simply install `huggingface-hub` package with pip.
pip install huggingface-hub
```
Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
```python
from huggingface_hub import InferenceClient
......@@ -83,8 +83,8 @@ Gradio is a Python library that helps you build web applications for your machin
pip install huggingface-hub gradio
```
Assume you are serving your model on port 8080, we will query through [InferenceClient](consuming_tgi#inference-client).
Assume you are serving your model on port 8080, we will query through [InferenceClient](consuming_tgi#inference-client).
```python
import gradio as gr
from huggingface_hub import InferenceClient
......@@ -110,30 +110,30 @@ gr.ChatInterface(
).queue().launch()
```
The UI looks like this 👇
The UI looks like this 👇
<div class="flex justify-center">
<img
class="block dark:hidden"
<img
class="block dark:hidden"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi.png"
/>
<img
class="hidden dark:block"
<img
class="hidden dark:block"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi-dark.png"
/>
</div>
You can try the demo directly here 👇
You can try the demo directly here 👇
<div class="block dark:hidden">
<iframe
<iframe
src="https://merve-gradio-tgi-2.hf.space?__theme=light"
width="850"
height="750"
></iframe>
</div>
<div class="hidden dark:block">
<iframe
<iframe
src="https://merve-gradio-tgi-2.hf.space?__theme=dark"
width="850"
height="750"
......@@ -152,4 +152,4 @@ You can read more about how to customize a `ChatInterface` [here](https://www.gr
## API documentation
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference).
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference).
......@@ -2,19 +2,19 @@
TGI supports various LLM architectures (see full list [here](../supported_models)). If you wish to serve a model that is not one of the supported models, TGI will fallback to the `transformers` implementation of that model. This means you will be unable to use some of the features introduced by TGI, such as tensor-parallel sharding or flash attention. However, you can still get many benefits of TGI, such as continuous batching or streaming outputs.
You can serve these models using the same Docker command-line invocation as with fully supported models 👇
You can serve these models using the same Docker command-line invocation as with fully supported models 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
```
If the model you wish to serve is a custom transformers model, and its weights and implementation are available in the Hub, you can still serve the model by passing the `--trust-remote-code` flag to the `docker run` command like below 👇
If the model you wish to serve is a custom transformers model, and its weights and implementation are available in the Hub, you can still serve the model by passing the `--trust-remote-code` flag to the `docker run` command like below 👇
```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code
```
Finally, if the model is not on Hugging Face Hub but on your local, you can pass the path to the folder that contains your model like below 👇
Finally, if the model is not on Hugging Face Hub but on your local, you can pass the path to the folder that contains your model like below 👇
```bash
# Make sure your model is in the $volume directory
......
# Preparing the Model
Text Generation Inference improves the model in several aspects.
Text Generation Inference improves the model in several aspects.
## Quantization
......@@ -9,7 +9,7 @@ TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsan
## RoPE Scaling
RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass `--rope-scaling`, `--max-input-length` and `--rope-factors` flags when running through CLI. `--rope-scaling` can take the values `linear` or `dynamic`. If your model is not fine-tuned to a longer sequence length, use `dynamic`. `--rope-factor` is the ratio between the intended max sequence length and the model's original max sequence length. Make sure to pass `--max-input-length` to provide maximum input length for extension.
RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass `--rope-scaling`, `--max-input-length` and `--rope-factors` flags when running through CLI. `--rope-scaling` can take the values `linear` or `dynamic`. If your model is not fine-tuned to a longer sequence length, use `dynamic`. `--rope-factor` is the ratio between the intended max sequence length and the model's original max sequence length. Make sure to pass `--max-input-length` to provide maximum input length for extension.
<Tip>
......@@ -19,4 +19,4 @@ We recommend using `dynamic` RoPE scaling.
## Safetensors
[Safetensors](https://github.com/huggingface/safetensors) is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports `safetensors` model loading under the hood. By default, given a repository with `safetensors` and `pytorch` weights, TGI will always load `safetensors`. If there's no `pytorch` weights, TGI will convert the weights to `safetensors` format.
[Safetensors](https://github.com/huggingface/safetensors) is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports `safetensors` model loading under the hood. By default, given a repository with `safetensors` and `pytorch` weights, TGI will always load `safetensors`. If there's no `pytorch` weights, TGI will convert the weights to `safetensors` format.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment