👋 join us on <ahref="https://twitter.com/intern_lm"target="_blank">Twitter</a>, <ahref="https://discord.gg/xa29JuW87d"target="_blank">Discord</a> and <ahref="https://r.vansin.top/?r=internwx"target="_blank">WeChat</a>
-\[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
-\[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
-\[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
-\[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
-\[2023/07\] TurboMind supports Llama-2 70B with GQA.
-\[2023/07\] TurboMind supports Llama-2 7B/13B.
-\[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
-**Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
-**Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
-**Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
-**Persistent Batch Inference**: Further optimization of model execution efficiency.
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
**Case II**: request throughput with real conversation data
Test Setting: LLaMA-7B, NVIDIA A100(80G)
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
👋 join us on <ahref="https://twitter.com/intern_lm"target="_blank">Twitter</a>, <ahref="https://discord.gg/xa29JuW87d"target="_blank">Discord</a> and <ahref="https://r.vansin.top/?r=internwx"target="_blank">WeChat</a>
-\[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
-\[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
-\[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
-\[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
-\[2023/07\] TurboMind supports Llama-2 70B with GQA.
-\[2023/07\] TurboMind supports Llama-2 7B/13B.
-\[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
-**Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
-**Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
-**Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
-**Persistent Batch Inference**: Further optimization of model execution efficiency.
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
**Case II**: request throughput with real conversation data
Test Setting: LLaMA-7B, NVIDIA A100(80G)
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.