Unverified Commit 4c303b17 authored by tpoisonooo's avatar tpoisonooo Committed by GitHub
Browse files

docs(README): fix (#50)

parent 0d19a95d
...@@ -35,7 +35,7 @@ English | [简体中文](README_zh-CN.md) ...@@ -35,7 +35,7 @@ English | [简体中文](README_zh-CN.md)
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features: LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
- **Efficient Inference Engine TurboMind**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs. - **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions. - **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
...@@ -79,7 +79,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt ...@@ -79,7 +79,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt
Run one of the following commands to serve a LLaMA model on NVIDIA GPU server: Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:
<details open> <details close>
<summary><b>7B</b></summary> <summary><b>7B</b></summary>
```shell ```shell
...@@ -90,7 +90,7 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turb ...@@ -90,7 +90,7 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turb
</details> </details>
<details open> <details close>
<summary><b>13B</b></summary> <summary><b>13B</b></summary>
```shell ```shell
......
...@@ -44,7 +44,7 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht ...@@ -44,7 +44,7 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/> <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
</div> </div>
- **多 GPU 部署和量化**:我们提供了全面的模型部署和量化支持,已经在 7~100B 模型上完成验证。 - **多 GPU 部署和量化**:我们提供了全面的模型部署和量化支持,已在不同规模上完成验证。
- **persistent batch 推理**:进一步优化模型执行效率。 - **persistent batch 推理**:进一步优化模型执行效率。
...@@ -79,7 +79,7 @@ make -j$(nproc) && make install ...@@ -79,7 +79,7 @@ make -j$(nproc) && make install
执行如下命令,可以把 LLaMA 模型部署到 NVIDIA GPU Server: 执行如下命令,可以把 LLaMA 模型部署到 NVIDIA GPU Server:
<details open> <details close>
<summary><b>7B</b></summary> <summary><b>7B</b></summary>
```shell ```shell
...@@ -90,7 +90,7 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turb ...@@ -90,7 +90,7 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turb
</details> </details>
<details open> <details close>
<summary><b>13B</b></summary> <summary><b>13B</b></summary>
```shell ```shell
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment