You need to sign in or sign up before continuing.
Unverified Commit 9946165e authored by OlivierDehaene's avatar OlivierDehaene Committed by GitHub
Browse files

chore: add pre-commit (#1569)

parent 142cdabe
...@@ -2,29 +2,29 @@ ...@@ -2,29 +2,29 @@
You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters. To install the CLI, please refer to [the installation section](../installation#install-cli). You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters. To install the CLI, please refer to [the installation section](../installation#install-cli).
`text-generation-server` lets you download the model with `download-weights` command like below 👇 `text-generation-server` lets you download the model with `download-weights` command like below 👇
```bash ```bash
text-generation-server download-weights MODEL_HUB_ID text-generation-server download-weights MODEL_HUB_ID
``` ```
You can also use it to quantize models like below 👇 You can also use it to quantize models like below 👇
```bash ```bash
text-generation-server quantize MODEL_HUB_ID OUTPUT_DIR text-generation-server quantize MODEL_HUB_ID OUTPUT_DIR
``` ```
You can use `text-generation-launcher` to serve models. You can use `text-generation-launcher` to serve models.
```bash ```bash
text-generation-launcher --model-id MODEL_HUB_ID --port 8080 text-generation-launcher --model-id MODEL_HUB_ID --port 8080
``` ```
There are many options and parameters you can pass to `text-generation-launcher`. The documentation for CLI is kept minimal and intended to rely on self-generating documentation, which can be found by running There are many options and parameters you can pass to `text-generation-launcher`. The documentation for CLI is kept minimal and intended to rely on self-generating documentation, which can be found by running
```bash ```bash
text-generation-launcher --help text-generation-launcher --help
``` ```
You can also find it hosted in this [Swagger UI](https://huggingface.github.io/text-generation-inference/). You can also find it hosted in this [Swagger UI](https://huggingface.github.io/text-generation-inference/).
......
# Flash Attention # Flash Attention
Scaling the transformer architecture is heavily bottlenecked by the self-attention mechanism, which has quadratic time and memory complexity. Recent developments in accelerator hardware mainly focus on enhancing compute capacities and not memory and transferring data between hardware. This results in attention operation having a memory bottleneck. **Flash Attention** is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and inference. Scaling the transformer architecture is heavily bottlenecked by the self-attention mechanism, which has quadratic time and memory complexity. Recent developments in accelerator hardware mainly focus on enhancing compute capacities and not memory and transferring data between hardware. This results in attention operation having a memory bottleneck. **Flash Attention** is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and inference.
Standard attention mechanism uses High Bandwidth Memory (HBM) to store, read and write keys, queries and values. HBM is large in memory, but slow in processing, meanwhile SRAM is smaller in memory, but faster in operations. In the standard attention implementation, the cost of loading and writing keys, queries, and values from HBM is high. It loads keys, queries, and values from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism, writes it back to HBM, and repeats this for every single attention step. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism, and writes them back. Standard attention mechanism uses High Bandwidth Memory (HBM) to store, read and write keys, queries and values. HBM is large in memory, but slow in processing, meanwhile SRAM is smaller in memory, but faster in operations. In the standard attention implementation, the cost of loading and writing keys, queries, and values from HBM is high. It loads keys, queries, and values from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism, writes it back to HBM, and repeats this for every single attention step. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism, and writes them back.
![Flash Attention](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png) ![Flash Attention](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png)
It is implemented for supported models. You can check out the complete list of models that support Flash Attention [here](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models), for models with flash prefix. It is implemented for supported models. You can check out the complete list of models that support Flash Attention [here](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models), for models with flash prefix.
You can learn more about Flash Attention by reading the paper in this [link](https://arxiv.org/abs/2205.14135). You can learn more about Flash Attention by reading the paper in this [link](https://arxiv.org/abs/2205.14135).
...@@ -4,20 +4,20 @@ TGI offers GPTQ and bits-and-bytes quantization to quantize large language model ...@@ -4,20 +4,20 @@ TGI offers GPTQ and bits-and-bytes quantization to quantize large language model
## Quantization with GPTQ ## Quantization with GPTQ
GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error like below 👇 GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error like below 👇
Given a layer \\(l\\) with weight matrix \\(W_{l}\\) and layer input \\(X_{l}\\), find quantized weight \\(\\hat{W}_{l}\\): Given a layer \\(l\\) with weight matrix \\(W_{l}\\) and layer input \\(X_{l}\\), find quantized weight \\(\\hat{W}_{l}\\):
$$({\hat{W}_{l}}^{*} = argmin_{\hat{W_{l}}} ||W_{l}X-\hat{W}_{l}X||^{2}_{2})$$ $$({\hat{W}_{l}}^{*} = argmin_{\hat{W_{l}}} ||W_{l}X-\hat{W}_{l}X||^{2}_{2})$$
TGI allows you to both run an already GPTQ quantized model (see available models [here](https://huggingface.co/models?search=gptq)) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing --quantize like below 👇 TGI allows you to both run an already GPTQ quantized model (see available models [here](https://huggingface.co/models?search=gptq)) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing --quantize like below 👇
```bash ```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize gptq docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize gptq
``` ```
Note that TGI's GPTQ implementation doesn't use [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI. Note that TGI's GPTQ implementation doesn't use [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI.
To quantize a given model using GPTQ with a calibration dataset, simply run To quantize a given model using GPTQ with a calibration dataset, simply run
...@@ -41,7 +41,7 @@ You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323. ...@@ -41,7 +41,7 @@ You can learn more about GPTQ from the [paper](https://arxiv.org/pdf/2210.17323.
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision. bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much. 8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much.
In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇 In TGI, you can use 8-bit quantization by adding `--quantize bitsandbytes` like below 👇
```bash ```bash
...@@ -50,7 +50,7 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingf ...@@ -50,7 +50,7 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingf
4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (`fp4`), or 4-bit `NormalFloat` (`nf4`). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load. 4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (`fp4`), or 4-bit `NormalFloat` (`nf4`). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.
In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇 In TGI, you can use 4-bit quantization by adding `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` like below 👇
```bash ```bash
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes-nf4 docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes-nf4
......
# Safetensors # Safetensors
Safetensors is a model serialization format for deep learning models. It is [faster](https://huggingface.co/docs/safetensors/speed) and safer compared to other serialization formats like pickle (which is used under the hood in many deep learning libraries). Safetensors is a model serialization format for deep learning models. It is [faster](https://huggingface.co/docs/safetensors/speed) and safer compared to other serialization formats like pickle (which is used under the hood in many deep learning libraries).
TGI depends on safetensors format mainly to enable [tensor parallelism sharding](./tensor_parallelism). For a given model repository during serving, TGI looks for safetensors weights. If there are no safetensors weights, TGI converts the PyTorch weights to safetensors format. TGI depends on safetensors format mainly to enable [tensor parallelism sharding](./tensor_parallelism). For a given model repository during serving, TGI looks for safetensors weights. If there are no safetensors weights, TGI converts the PyTorch weights to safetensors format.
You can learn more about safetensors by reading the [safetensors documentation](https://huggingface.co/docs/safetensors/index). You can learn more about safetensors by reading the [safetensors documentation](https://huggingface.co/docs/safetensors/index).
\ No newline at end of file
...@@ -5,12 +5,12 @@ ...@@ -5,12 +5,12 @@
Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience. Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.
<div class="flex justify-center"> <div class="flex justify-center">
<img <img
class="block dark:hidden" class="block dark:hidden"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual_360.gif" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual_360.gif"
/> />
<img <img
class="hidden dark:block" class="hidden dark:block"
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual-dark_360.gif" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual-dark_360.gif"
/> />
</div> </div>
...@@ -25,14 +25,14 @@ With token streaming, the server can start returning the tokens one by one befor ...@@ -25,14 +25,14 @@ With token streaming, the server can start returning the tokens one by one befor
For example, a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and although end-to-end latency will be the same, they can see half of the generation after five seconds. Below you can see an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below. For example, a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and although end-to-end latency will be the same, they can see half of the generation after five seconds. Below you can see an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below.
<div class="block dark:hidden"> <div class="block dark:hidden">
<iframe <iframe
src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=light" src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=light"
width="850" width="850"
height="350" height="350"
></iframe> ></iframe>
</div> </div>
<div class="hidden dark:block"> <div class="hidden dark:block">
<iframe <iframe
src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=dark" src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=dark"
width="850" width="850"
height="350" height="350"
...@@ -43,7 +43,7 @@ For example, a system can generate 100 tokens per second. If the system generate ...@@ -43,7 +43,7 @@ For example, a system can generate 100 tokens per second. If the system generate
### Streaming with Python ### Streaming with Python
To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate over the response. To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate over the response.
```python ```python
from huggingface_hub import InferenceClient from huggingface_hub import InferenceClient
...@@ -116,7 +116,7 @@ curl -N 127.0.0.1:8080/generate_stream \ ...@@ -116,7 +116,7 @@ curl -N 127.0.0.1:8080/generate_stream \
First, we need to install the `@huggingface/inference` library. First, we need to install the `@huggingface/inference` library.
`npm install @huggingface/inference` `npm install @huggingface/inference`
If you're using the free Inference API, you can use `HfInference`. If you're using inference endpoints, you can use `HfInferenceEndpoint`. Let's If you're using the free Inference API, you can use `HfInference`. If you're using inference endpoints, you can use `HfInferenceEndpoint`. Let's
We can create a `HfInferenceEndpoint` providing our endpoint URL and credential. We can create a `HfInferenceEndpoint` providing our endpoint URL and credential.
...@@ -129,7 +129,7 @@ const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface. ...@@ -129,7 +129,7 @@ const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.
const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips' const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'
const stream = hf.textGenerationStream({ inputs: prompt }) const stream = hf.textGenerationStream({ inputs: prompt })
for await (const r of stream) { for await (const r of stream) {
// yield the generated token // yield the generated token
process.stdout.write(r.token.text) process.stdout.write(r.token.text)
} }
......
# Tensor Parallelism # Tensor Parallelism
Tensor parallelism is a technique used to fit a large model in multiple GPUs. For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication is equivalent to splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs. These outputs are then transferred from the GPUs and concatenated together to get the final result, like below 👇 Tensor parallelism is a technique used to fit a large model in multiple GPUs. For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication is equivalent to splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs. These outputs are then transferred from the GPUs and concatenated together to get the final result, like below 👇
![Image courtesy of Anton Lozkhov](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/TP.png) ![Image courtesy of Anton Lozkhov](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/TP.png)
......
...@@ -4,7 +4,7 @@ This section explains how to install the CLI tool as well as installing TGI from ...@@ -4,7 +4,7 @@ This section explains how to install the CLI tool as well as installing TGI from
## Install CLI ## Install CLI
You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters. You can use TGI command-line interface (CLI) to download weights, serve and quantize models, or get information on serving parameters.
To install the CLI, you need to first clone the TGI repository and then run `make`. To install the CLI, you need to first clone the TGI repository and then run `make`.
...@@ -23,7 +23,7 @@ BUILD_EXTENSIONS=True make install ...@@ -23,7 +23,7 @@ BUILD_EXTENSIONS=True make install
Before you start, you will need to setup your environment, and install Text Generation Inference. Text Generation Inference is tested on **Python 3.9+**. Before you start, you will need to setup your environment, and install Text Generation Inference. Text Generation Inference is tested on **Python 3.9+**.
Text Generation Inference is available on pypi, conda and GitHub. Text Generation Inference is available on pypi, conda and GitHub.
To install and launch locally, first [install Rust](https://rustup.rs/) and create a Python virtual environment with at least To install and launch locally, first [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
Python 3.9, e.g. using conda: Python 3.9, e.g. using conda:
......
...@@ -92,7 +92,7 @@ print(chat_completion) ...@@ -92,7 +92,7 @@ print(chat_completion)
## Hugging Face Inference Endpoints ## Hugging Face Inference Endpoints
The Messages API is integrated with [Inference Endpoints](https://huggingface.co/inference-endpoints/dedicated). The Messages API is integrated with [Inference Endpoints](https://huggingface.co/inference-endpoints/dedicated).
Every endpoint that uses "Text Generation Inference" with an LLM, which has a chat template can now be used. Below is an example of how to use IE with TGI using OpenAI's Python client library: Every endpoint that uses "Text Generation Inference" with an LLM, which has a chat template can now be used. Below is an example of how to use IE with TGI using OpenAI's Python client library:
> **Note:** Make sure to replace `base_url` with your endpoint URL and to include `v1/` at the end of the URL. The `api_key` should be replaced with your Hugging Face API key. > **Note:** Make sure to replace `base_url` with your endpoint URL and to include `v1/` at the end of the URL. The `api_key` should be replaced with your Hugging Face API key.
......
...@@ -53,7 +53,7 @@ print(response.json()) ...@@ -53,7 +53,7 @@ print(response.json())
```js ```js
async function query() { async function query() {
const response = await fetch( const response = await fetch(
'http://127.0.0.1:8080/generate', 'http://127.0.0.1:8080/generate',
{ {
method: 'POST', method: 'POST',
headers: { 'Content-Type': 'application/json'}, headers: { 'Content-Type': 'application/json'},
......
...@@ -54,7 +54,9 @@ async def test_mamba_all_params(fused_kernel_mamba, response_snapshot): ...@@ -54,7 +54,9 @@ async def test_mamba_all_params(fused_kernel_mamba, response_snapshot):
@pytest.mark.asyncio @pytest.mark.asyncio
@pytest.mark.private @pytest.mark.private
async def test_mamba_load(fused_kernel_mamba, generate_load, generous_response_snapshot): async def test_mamba_load(
fused_kernel_mamba, generate_load, generous_response_snapshot
):
responses = await generate_load( responses = await generate_load(
fused_kernel_mamba, "What is Deep Learning?", max_new_tokens=10, n=4 fused_kernel_mamba, "What is Deep Learning?", max_new_tokens=10, n=4
) )
......
...@@ -2,4 +2,4 @@ ...@@ -2,4 +2,4 @@
addopts = --snapshot-warn-unused addopts = --snapshot-warn-unused
asyncio_mode = auto asyncio_mode = auto
markers = markers =
private: marks tests as requiring an admin hf token (deselect with '-m "not private"') private: marks tests as requiring an admin hf token (deselect with '-m "not private"')
\ No newline at end of file
...@@ -57,7 +57,7 @@ export function run(host, generate_payload, max_new_tokens) { ...@@ -57,7 +57,7 @@ export function run(host, generate_payload, max_new_tokens) {
const duration = res.timings.duration; const duration = res.timings.duration;
if (res.status === 200) { if (res.status === 200) {
const body = res.json(); const body = res.json();
const n_tokens = body.details.tokens.length; const n_tokens = body.details.tokens.length;
const latency_ms_per_token = duration / n_tokens; const latency_ms_per_token = duration / n_tokens;
timePerToken.add(latency_ms_per_token); timePerToken.add(latency_ms_per_token);
......
...@@ -60,4 +60,4 @@ export default function () { ...@@ -60,4 +60,4 @@ export default function () {
inferenceTime.add(res.headers["X-Inference-Time"]); inferenceTime.add(res.headers["X-Inference-Time"]);
timePerToken.add(res.headers["X-Time-Per-Token"]); timePerToken.add(res.headers["X-Time-Per-Token"]);
} }
} }
\ No newline at end of file
import { get_options, run } from "./common.js"; import { get_options, run } from "./common.js";
const reference_latency_ms = 70; const reference_latency_ms = 70;
const host = __ENV.HOST || '127.0.0.1:8000'; const host = __ENV.HOST || '127.0.0.1:8000';
const max_new_tokens = 50; const max_new_tokens = 50;
......
import { get_options, run } from "./common.js"; import { get_options, run } from "./common.js";
const reference_latency_ms = 22; const reference_latency_ms = 22;
const host = __ENV.HOST || '127.0.0.1:8000'; const host = __ENV.HOST || '127.0.0.1:8000';
const max_new_tokens = 50; const max_new_tokens = 50;
......
...@@ -28,7 +28,7 @@ this is controlled by the client, and therefore the amount of batching is decide ...@@ -28,7 +28,7 @@ this is controlled by the client, and therefore the amount of batching is decide
beforehand. beforehand.
For text-generation, and LLMs which are memory bound we can try to be much more For text-generation, and LLMs which are memory bound we can try to be much more
efficient with the available compute, by having client sending us single queries, efficient with the available compute, by having client sending us single queries,
and let the router mix&match queries into or out of batches to make the use the and let the router mix&match queries into or out of batches to make the use the
compute the most efficiently. This is possible because for LLMs the total compute compute the most efficiently. This is possible because for LLMs the total compute
for running the model is much bigger than doing mix&match of the batches themselves. for running the model is much bigger than doing mix&match of the batches themselves.
...@@ -89,5 +89,5 @@ most critical perceived quality of an LLM API. ...@@ -89,5 +89,5 @@ most critical perceived quality of an LLM API.
With token streaming, the server can start answering after the first `prefill` pass With token streaming, the server can start answering after the first `prefill` pass
directly, without waiting for all the generation to be done. For extremely long queries directly, without waiting for all the generation to be done. For extremely long queries
this means clients can start to see something happening orders of magnitude before this means clients can start to see something happening orders of magnitude before
the work is done. Seeing something in progress allows them to cut short if it's not the work is done. Seeing something in progress allows them to cut short if it's not
what's wanted but also it "feels" better. what's wanted but also it "feels" better.
*.rs *.rs
\ No newline at end of file
...@@ -27,6 +27,7 @@ pub struct Validation { ...@@ -27,6 +27,7 @@ pub struct Validation {
} }
impl Validation { impl Validation {
#[allow(clippy::too_many_arguments)]
pub(crate) fn new( pub(crate) fn new(
workers: usize, workers: usize,
tokenizer: Option<Tokenizer>, tokenizer: Option<Tokenizer>,
......
...@@ -3,4 +3,4 @@ ...@@ -3,4 +3,4 @@
# Branched from master on: 10 November, 2023 # Branched from master on: 10 November, 2023
# https://releases.rs/docs/1.75.0/ # https://releases.rs/docs/1.75.0/
channel = "1.75.0" channel = "1.75.0"
components = ["rustfmt", "clippy"] components = ["rustfmt", "clippy"]
\ No newline at end of file
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
# to make cuda graphs work. # to make cuda graphs work.
awq_commit := bd1dc2d5254345cc76ab71894651fb821275bdd4 awq_commit := bd1dc2d5254345cc76ab71894651fb821275bdd4
awq: awq:
rm -rf llm-awq rm -rf llm-awq
git clone https://github.com/huggingface/llm-awq git clone https://github.com/huggingface/llm-awq
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment