👋 join us on <ahref="https://twitter.com/intern_lm"target="_blank">Twitter</a>, <ahref="https://discord.gg/xa29JuW87d"target="_blank">Discord</a> and <ahref="https://r.vansin.top/?r=internwx"target="_blank">WeChat</a>
-\[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
### 使用源码编译方式安装
-\[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
-\[2023/09\] TurboMind supports Qwen-14B
-\[2023/09\] TurboMind supports InternLM-20B
-\[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
-\[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
-\[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
-\[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
-\[2023/07\] TurboMind supports Llama-2 70B with GQA.
-\[2023/07\] TurboMind supports Llama-2 7B/13B.
-\[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
-**Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
-**Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
-**Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
-**Persistent Batch Inference**: Further optimization of model execution efficiency.
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
**Case II**: request throughput with real conversation data
Test Setting: LLaMA-7B, NVIDIA A100(80G)
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
> `pip install lmdeploy` can only install the runtime required packages. If users want to run codes from modules like `lmdeploy.lite` and `lmdeploy.serve`, they need to install the extra required packages.
# <Host Path>主机端路径
> For instance, running `pip install lmdeploy[lite]` would install extra dependencies for `lmdeploy.lite` module.
> - `all`: Install lmdeploy with all dependencies in `requirements.txt`
```
> - `lite`: Install lmdeploy with extra dependencies in `requirements/lite.txt`
注:
> - `serve`: Install lmdeploy with dependencies in `requirements/serve.txt`
### Deploy InternLM
1、docker启动 -v /opt/hyhal:/opt/hyhal 这个变量不能少
To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.
The following use [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.
👋 join us on <ahref="https://twitter.com/intern_lm"target="_blank">Twitter</a>, <ahref="https://discord.gg/xa29JuW87d"target="_blank">Discord</a> and <ahref="https://r.vansin.top/?r=internwx"target="_blank">WeChat</a>
-\[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
-\[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
-\[2023/09\] TurboMind supports Qwen-14B
-\[2023/09\] TurboMind supports InternLM-20B
-\[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
-\[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
-\[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
-\[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
-\[2023/07\] TurboMind supports Llama-2 70B with GQA.
-\[2023/07\] TurboMind supports Llama-2 7B/13B.
-\[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
-**Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
-**Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
-**Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
-**Persistent Batch Inference**: Further optimization of model execution efficiency.
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
**Case II**: request throughput with real conversation data
Test Setting: LLaMA-7B, NVIDIA A100(80G)
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
```shell
pip install lmdeploy
```
> **Note**<br />
> `pip install lmdeploy` can only install the runtime required packages. If users want to run codes from modules like `lmdeploy.lite` and `lmdeploy.serve`, they need to install the extra required packages.
> For instance, running `pip install lmdeploy[lite]` would install extra dependencies for `lmdeploy.lite` module.
>
> - `all`: Install lmdeploy with all dependencies in `requirements.txt`
> - `lite`: Install lmdeploy with extra dependencies in `requirements/lite.txt`
> - `serve`: Install lmdeploy with dependencies in `requirements/serve.txt`
### Deploy InternLM
To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.
The following use [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.