Unverified Commit 9946165e authored by OlivierDehaene's avatar OlivierDehaene Committed by GitHub
Browse files

chore: add pre-commit (#1569)

parent 142cdabe
...@@ -5,14 +5,14 @@ body: ...@@ -5,14 +5,14 @@ body:
id: system-info id: system-info
attributes: attributes:
label: System Info label: System Info
description: | description: |
Please share your system info with us (`text-generation-launcher --env` if installed locally). Please share your system info with us (`text-generation-launcher --env` if installed locally).
The full command line used that causes issues: The full command line used that causes issues:
OS version: OS version:
Rust version (if self-compiling, `cargo version`): Rust version (if self-compiling, `cargo version`):
Model being used (`curl 127.0.0.1:8080/info | jq`): Model being used (`curl 127.0.0.1:8080/info | jq`):
If local model please explicit the kind of model and/or equivalents. If local model please explicit the kind of model and/or equivalents.
Hardware used (GPUs, how many, on which cloud) (`nvidia-smi`): Hardware used (GPUs, how many, on which cloud) (`nvidia-smi`):
Deployment specificities (Kubernetes, EKS, AKS, any particular deployments): Deployment specificities (Kubernetes, EKS, AKS, any particular deployments):
The current version being used: The current version being used:
...@@ -52,11 +52,11 @@ body: ...@@ -52,11 +52,11 @@ body:
placeholder: | placeholder: |
Steps to reproduce the behavior: Steps to reproduce the behavior:
1. 1.
2. 2.
3. 3.
- type: textarea - type: textarea
id: expected-behavior id: expected-behavior
......
...@@ -19,7 +19,7 @@ body: ...@@ -19,7 +19,7 @@ body:
label: Motivation label: Motivation
description: | description: |
Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too. Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.
- type: textarea - type: textarea
id: contribution id: contribution
......
...@@ -6,15 +6,15 @@ on: ...@@ -6,15 +6,15 @@ on:
jobs: jobs:
update_docs: update_docs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Checkout code - name: Checkout code
uses: actions/checkout@v2 uses: actions/checkout@v2
- name: Install Launcher - name: Install Launcher
id: install-launcher id: install-launcher
run: cargo install --git https://github.com/${{ github.repository }} --branch ${{ github.head_ref }} text-generation-launcher run: cargo install --git https://github.com/${{ github.repository }} --branch ${{ github.head_ref }} text-generation-launcher
- name: Check launcher Docs are up-to-date - name: Check launcher Docs are up-to-date
run: | run: |
echo text-generation-launcher --help echo text-generation-launcher --help
......
...@@ -16,4 +16,4 @@ jobs: ...@@ -16,4 +16,4 @@ jobs:
commit_sha: ${{ github.event.pull_request.head.sha }} commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }} pr_number: ${{ github.event.number }}
package: text-generation-inference package: text-generation-inference
additional_args: --not_python_module additional_args: --not_python_module
...@@ -71,12 +71,11 @@ jobs: ...@@ -71,12 +71,11 @@ jobs:
pip install pytest pip install pytest
export HUGGING_FACE_HUB_TOKEN=${{ secrets.HUGGING_FACE_HUB_TOKEN }} export HUGGING_FACE_HUB_TOKEN=${{ secrets.HUGGING_FACE_HUB_TOKEN }}
pytest -s -vv server/tests pytest -s -vv server/tests
- name: Run Rust fmt - name: Pre-commit checks
run: | run: |
cargo fmt --check pip install pre-commit
- name: Run Rust clippy pre-commit install
run: | pre-commit run --all-files
cargo clippy
- name: Run Rust tests - name: Run Rust tests
run: | run: |
cargo test cargo test
......
...@@ -13,4 +13,4 @@ jobs: ...@@ -13,4 +13,4 @@ jobs:
package_name: text-generation-inference package_name: text-generation-inference
secrets: secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }} comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
\ No newline at end of file
...@@ -11,4 +11,3 @@ server/exllama_kernels/exllama_kernels/hip_func/ ...@@ -11,4 +11,3 @@ server/exllama_kernels/exllama_kernels/hip_func/
*_hip.cuh *_hip.cuh
server/exllama_kernels/exllama_kernels/hip_buffers.cuh server/exllama_kernels/exllama_kernels/hip_buffers.cuh
server/exllama_kernels/exllama_kernels/exllama_ext_hip.cpp server/exllama_kernels/exllama_kernels/exllama_ext_hip.cpp
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
exclude: docs/source/basic_tutorials/launcher.md
- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
- repo: https://github.com/doublify/pre-commit-rust
rev: v1.0
hooks:
- id: fmt
- id: cargo-check
- id: clippy
<div align="center"> <div align="center">
<a href="https://www.youtube.com/watch?v=jlMAX2Oaht0"> <a href="https://www.youtube.com/watch?v=jlMAX2Oaht0">
<img width=560 width=315 alt="Making TGI deployment optimal" src="https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png"> <img width=560 width=315 alt="Making TGI deployment optimal" src="https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png">
</a> </a>
...@@ -228,7 +228,7 @@ text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 ...@@ -228,7 +228,7 @@ text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement: You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:
```shell ```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
``` ```
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`. 4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
......
...@@ -29,4 +29,3 @@ tui = {package = "ratatui", version = "0.23", default-features = false, features ...@@ -29,4 +29,3 @@ tui = {package = "ratatui", version = "0.23", default-features = false, features
tracing = "0.1.37" tracing = "0.1.37"
tracing-subscriber = { version = "0.3.17", features = ["json", "env-filter"] } tracing-subscriber = { version = "0.3.17", features = ["json", "env-filter"] }
hf-hub = "0.3.1" hf-hub = "0.3.1"
...@@ -6,12 +6,12 @@ ...@@ -6,12 +6,12 @@
</div> </div>
A lightweight benchmarking tool based inspired by [oha](https://github.com/hatoo/oha) A lightweight benchmarking tool based inspired by [oha](https://github.com/hatoo/oha)
and powered by [tui](https://github.com/tui-rs-revival/ratatui). and powered by [tui](https://github.com/tui-rs-revival/ratatui).
## Install ## Install
```shell ```shell
make install-benchmark make install-benchmark
``` ```
...@@ -27,4 +27,4 @@ Then run the benchmarking tool: ...@@ -27,4 +27,4 @@ Then run the benchmarking tool:
```shell ```shell
text-generation-benchmark --tokenizer-name bigscience/bloom-560m text-generation-benchmark --tokenizer-name bigscience/bloom-560m
``` ```
\ No newline at end of file
...@@ -155,4 +155,4 @@ dmypy.json ...@@ -155,4 +155,4 @@ dmypy.json
cython_debug/ cython_debug/
transformers transformers
safetensors safetensors
\ No newline at end of file
...@@ -3,4 +3,4 @@ unit-tests: ...@@ -3,4 +3,4 @@ unit-tests:
install: install:
pip install pip --upgrade pip install pip --upgrade
pip install -e . pip install -e .
\ No newline at end of file
...@@ -141,7 +141,7 @@ class Parameters: ...@@ -141,7 +141,7 @@ class Parameters:
# Get decoder input token logprobs and ids # Get decoder input token logprobs and ids
decoder_input_details: bool decoder_input_details: bool
# Return the N most likely tokens at each step # Return the N most likely tokens at each step
top_n_tokens: Optional[int] top_n_tokens: Optional[int]
# Decoder input tokens # Decoder input tokens
class InputToken: class InputToken:
...@@ -192,7 +192,7 @@ class BestOfSequence: ...@@ -192,7 +192,7 @@ class BestOfSequence:
# Generated tokens # Generated tokens
tokens: List[Token] tokens: List[Token]
# Most likely tokens # Most likely tokens
top_tokens: Optional[List[List[Token]]] top_tokens: Optional[List[List[Token]]]
# `generate` details # `generate` details
...@@ -236,7 +236,7 @@ class StreamResponse: ...@@ -236,7 +236,7 @@ class StreamResponse:
# Generated token # Generated token
token: Token token: Token
# Most likely tokens # Most likely tokens
top_tokens: Optional[List[Token]] top_tokens: Optional[List[Token]]
# Complete generated text # Complete generated text
# Only available when the generation is finished # Only available when the generation is finished
generated_text: Optional[str] generated_text: Optional[str]
...@@ -248,4 +248,4 @@ class StreamResponse: ...@@ -248,4 +248,4 @@ class StreamResponse:
class DeployedModel: class DeployedModel:
model_id: str model_id: str
sha: str sha: str
``` ```
\ No newline at end of file
...@@ -134,6 +134,7 @@ class Parameters(BaseModel): ...@@ -134,6 +134,7 @@ class Parameters(BaseModel):
raise ValidationError("`value` cannot be empty for `json` grammar") raise ValidationError("`value` cannot be empty for `json` grammar")
return v return v
class Request(BaseModel): class Request(BaseModel):
# Prompt # Prompt
inputs: str inputs: str
......
...@@ -27,4 +27,4 @@ ...@@ -27,4 +27,4 @@
} }
</script> </script>
</body> </body>
</html> </html>
\ No newline at end of file
...@@ -1290,4 +1290,4 @@ ...@@ -1290,4 +1290,4 @@
"description": "Hugging Face Text Generation Inference API" "description": "Hugging Face Text Generation Inference API"
} }
] ]
} }
\ No newline at end of file
...@@ -23,7 +23,7 @@ You can simply install `huggingface-hub` package with pip. ...@@ -23,7 +23,7 @@ You can simply install `huggingface-hub` package with pip.
pip install huggingface-hub pip install huggingface-hub
``` ```
Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python. Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
```python ```python
from huggingface_hub import InferenceClient from huggingface_hub import InferenceClient
...@@ -83,8 +83,8 @@ Gradio is a Python library that helps you build web applications for your machin ...@@ -83,8 +83,8 @@ Gradio is a Python library that helps you build web applications for your machin
pip install huggingface-hub gradio pip install huggingface-hub gradio
``` ```
Assume you are serving your model on port 8080, we will query through [InferenceClient](consuming_tgi#inference-client). Assume you are serving your model on port 8080, we will query through [InferenceClient](consuming_tgi#inference-client).
```python ```python
import gradio as gr import gradio as gr
from huggingface_hub import InferenceClient from huggingface_hub import InferenceClient
...@@ -110,30 +110,30 @@ gr.ChatInterface( ...@@ -110,30 +110,30 @@ gr.ChatInterface(
).queue().launch() ).queue().launch()
``` ```
The UI looks like this 👇 The UI looks like this 👇
<div class="flex justify-center"> <div class="flex justify-center">
<img <img
class="block dark:hidden" class="block dark:hidden"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi.png" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi.png"
/> />
<img <img
class="hidden dark:block" class="hidden dark:block"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi-dark.png" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi-dark.png"
/> />
</div> </div>
You can try the demo directly here 👇 You can try the demo directly here 👇
<div class="block dark:hidden"> <div class="block dark:hidden">
<iframe <iframe
src="https://merve-gradio-tgi-2.hf.space?__theme=light" src="https://merve-gradio-tgi-2.hf.space?__theme=light"
width="850" width="850"
height="750" height="750"
></iframe> ></iframe>
</div> </div>
<div class="hidden dark:block"> <div class="hidden dark:block">
<iframe <iframe
src="https://merve-gradio-tgi-2.hf.space?__theme=dark" src="https://merve-gradio-tgi-2.hf.space?__theme=dark"
width="850" width="850"
height="750" height="750"
...@@ -152,4 +152,4 @@ You can read more about how to customize a `ChatInterface` [here](https://www.gr ...@@ -152,4 +152,4 @@ You can read more about how to customize a `ChatInterface` [here](https://www.gr
## API documentation ## API documentation
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference). You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference).
...@@ -2,19 +2,19 @@ ...@@ -2,19 +2,19 @@
TGI supports various LLM architectures (see full list [here](../supported_models)). If you wish to serve a model that is not one of the supported models, TGI will fallback to the `transformers` implementation of that model. This means you will be unable to use some of the features introduced by TGI, such as tensor-parallel sharding or flash attention. However, you can still get many benefits of TGI, such as continuous batching or streaming outputs. TGI supports various LLM architectures (see full list [here](../supported_models)). If you wish to serve a model that is not one of the supported models, TGI will fallback to the `transformers` implementation of that model. This means you will be unable to use some of the features introduced by TGI, such as tensor-parallel sharding or flash attention. However, you can still get many benefits of TGI, such as continuous batching or streaming outputs.
You can serve these models using the same Docker command-line invocation as with fully supported models 👇 You can serve these models using the same Docker command-line invocation as with fully supported models 👇
```bash ```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2 docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
``` ```
If the model you wish to serve is a custom transformers model, and its weights and implementation are available in the Hub, you can still serve the model by passing the `--trust-remote-code` flag to the `docker run` command like below 👇 If the model you wish to serve is a custom transformers model, and its weights and implementation are available in the Hub, you can still serve the model by passing the `--trust-remote-code` flag to the `docker run` command like below 👇
```bash ```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code
``` ```
Finally, if the model is not on Hugging Face Hub but on your local, you can pass the path to the folder that contains your model like below 👇 Finally, if the model is not on Hugging Face Hub but on your local, you can pass the path to the folder that contains your model like below 👇
```bash ```bash
# Make sure your model is in the $volume directory # Make sure your model is in the $volume directory
......
# Preparing the Model # Preparing the Model
Text Generation Inference improves the model in several aspects. Text Generation Inference improves the model in several aspects.
## Quantization ## Quantization
...@@ -9,7 +9,7 @@ TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsan ...@@ -9,7 +9,7 @@ TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsan
## RoPE Scaling ## RoPE Scaling
RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass `--rope-scaling`, `--max-input-length` and `--rope-factors` flags when running through CLI. `--rope-scaling` can take the values `linear` or `dynamic`. If your model is not fine-tuned to a longer sequence length, use `dynamic`. `--rope-factor` is the ratio between the intended max sequence length and the model's original max sequence length. Make sure to pass `--max-input-length` to provide maximum input length for extension. RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass `--rope-scaling`, `--max-input-length` and `--rope-factors` flags when running through CLI. `--rope-scaling` can take the values `linear` or `dynamic`. If your model is not fine-tuned to a longer sequence length, use `dynamic`. `--rope-factor` is the ratio between the intended max sequence length and the model's original max sequence length. Make sure to pass `--max-input-length` to provide maximum input length for extension.
<Tip> <Tip>
...@@ -19,4 +19,4 @@ We recommend using `dynamic` RoPE scaling. ...@@ -19,4 +19,4 @@ We recommend using `dynamic` RoPE scaling.
## Safetensors ## Safetensors
[Safetensors](https://github.com/huggingface/safetensors) is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports `safetensors` model loading under the hood. By default, given a repository with `safetensors` and `pytorch` weights, TGI will always load `safetensors`. If there's no `pytorch` weights, TGI will convert the weights to `safetensors` format. [Safetensors](https://github.com/huggingface/safetensors) is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports `safetensors` model loading under the hood. By default, given a repository with `safetensors` and `pytorch` weights, TGI will always load `safetensors`. If there's no `pytorch` weights, TGI will convert the weights to `safetensors` format.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment