# LLM Offline Inference Pipeline In this tutorial, We will present a list of examples to introduce the usage of `lmdeploy.pipeline`. You can overview the detailed pipeline API in [this](https://lmdeploy.readthedocs.io/en/latest/api/pipeline.html) guide. ## Usage - **An example using default parameters:** ```python from lmdeploy import pipeline pipe = pipeline('internlm/internlm2-chat-7b') response = pipe(['Hi, pls intro yourself', 'Shanghai is']) print(response) ``` In this example, the pipeline by default allocates a predetermined percentage of GPU memory for storing k/v cache. The ratio is dictated by the parameter `TurbomindEngineConfig.cache_max_entry_count`. There have been alterations to the strategy for setting the k/v cache ratio throughout the evolution of LMDeploy. The following are the change histories: 1. `v0.2.0 <= lmdeploy <= v0.2.1` `TurbomindEngineConfig.cache_max_entry_count` defaults to 0.5, indicating 50% GPU **total memory** allocated for k/v cache. Out Of Memory (OOM) errors may occur if a 7B model is deployed on a GPU with memory less than 40G. If you encounter an OOM error, please decrease the ratio of the k/v cache occupation as follows: ```python from lmdeploy import pipeline, TurbomindEngineConfig # decrease the ratio of the k/v cache occupation to 20% backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2) pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config) response = pipe(['Hi, pls intro yourself', 'Shanghai is']) print(response) ``` 2. `lmdeploy > v0.2.1` The allocation strategy for k/v cache is changed to reserve space from the **GPU free memory** proportionally. The ratio `TurbomindEngineConfig.cache_max_entry_count` has been adjusted to 0.8 by default. If OOM error happens, similar to the method mentioned above, please consider reducing the ratio value to decrease the memory usage of the k/v cache. - **An example showing how to set tensor parallel num**: ```python from lmdeploy import pipeline, TurbomindEngineConfig backend_config = TurbomindEngineConfig(tp=2) pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config) response = pipe(['Hi, pls intro yourself', 'Shanghai is']) print(response) ``` - **An example for setting sampling parameters:** ```python from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig backend_config = TurbomindEngineConfig(tp=2) gen_config = GenerationConfig(top_p=0.8, top_k=40, temperature=0.8, max_new_tokens=1024) pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config) response = pipe(['Hi, pls intro yourself', 'Shanghai is'], gen_config=gen_config) print(response) ``` - **An example for OpenAI format prompt input:** ```python from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig backend_config = TurbomindEngineConfig(tp=2) gen_config = GenerationConfig(top_p=0.8, top_k=40, temperature=0.8, max_new_tokens=1024) pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config) prompts = [[{ 'role': 'user', 'content': 'Hi, pls intro yourself' }], [{ 'role': 'user', 'content': 'Shanghai is' }]] response = pipe(prompts, gen_config=gen_config) print(response) ``` - **An example for streaming mode:** ```python from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig backend_config = TurbomindEngineConfig(tp=2) gen_config = GenerationConfig(top_p=0.8, top_k=40, temperature=0.8, max_new_tokens=1024) pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config) prompts = [[{ 'role': 'user', 'content': 'Hi, pls intro yourself' }], [{ 'role': 'user', 'content': 'Shanghai is' }]] for item in pipe.stream_infer(prompts, gen_config=gen_config): print(item) ``` - **Below is an example for pytorch backend. Please install triton first.** ```shell pip install triton>=2.1.0 ``` ```python from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig backend_config = PytorchEngineConfig(session_len=2048) gen_config = GenerationConfig(top_p=0.8, top_k=40, temperature=0.8, max_new_tokens=1024) pipe = pipeline('internlm/internlm-chat-7b', backend_config=backend_config) prompts = [[{ 'role': 'user', 'content': 'Hi, pls intro yourself' }], [{ 'role': 'user', 'content': 'Shanghai is' }]] response = pipe(prompts, gen_config=gen_config) print(response) ``` ## FAQs - **RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase**. If you got this for tp>1 in pytorch backend. Please make sure the python script has following ```python if __name__ == '__main__': ``` Generally, in the context of multi-threading or multi-processing, it might be necessary to ensure that initialization code is executed only once. In this case, `if __name__ == '__main__':` can help to ensure that these initialization codes are run only in the main program, and not repeated in each newly created process or thread. - To customize a chat template, please refer to [chat_template.md](../advance/chat_template.md).