@@ -43,8 +43,8 @@ to power LLMs api-inference widgets.
...
@@ -43,8 +43,8 @@ to power LLMs api-inference widgets.
- Tensor Parallelism for faster inference on multiple GPUs
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Token streaming using Server-Sent Events (SSE)
-[Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
-[Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention)and [Paged Attention](https://github.com/vllm-project/vllm)on the most popular architectures
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323)