0.2.6版本新增文件补充

fe851fbc · zhouxiang · e2d98ddc · fe851fbc · fe851fbc · fe851fbc
Commit fe851fbc authored Mar 24, 2024 by zhouxiang
20 changed files
--- a/autotest/README.md
+++ b/autotest/README.md
+# autotest case
+We provide a autotest caseset to do regression.
+## Preparation before testing
+To improve the efficiency of test case execution, we have downloaded the hf model files to a specific path in advance for easy use in test cases. The path where the model files are stored is defined in the `autotest/config.yaml` file with parameter `model_path`.
+Since the test cases involve converting the hf model using convert, the converted model storage path is defined in the `autotest/config.yaml` file parameter `dst_path`.
+The `autotest/config.yaml` file also defines the supported model table and corresponding model categories, such as the `model_map` parameter, as well as the log storage path `log_path` used during test case execution.
+If you want to create a test environment, you need to prepare the above content and modify the config.yaml file as needed.
+## How to run testcases
+Install required dependencies using the following command line:
+```bash
+python3 -m pip install -r requirements/test.txt
+```
+Run pytest command line with case filtering through -m flag or folder name. eg: `-m convert` Filter cases related to convert or `autotest/tools/convert` for the case in the folder. The corresponding results will be stored in the `allure-results` directory.
+```bash
+pytest autotest -m convert --clean-alluredir --alluredir=allure-results
+pytest autotest/tools/convert --clean-alluredir --alluredir=allure-results
+```
+If you need to generate reports and display report features, you need to install allure according to the [install documentation of allure](https://allurereport.org/docs/gettingstarted-installation/#install-via-the-system-package-manager-for-linux). You can also install it directly using the following command:
+```bash
+wget https://github.com/allure-framework/allure2/releases/download/2.25.0/allure_2.24.1-1_all.deb
+sudo apt-get install -y openjdk-8-jre-headless
+sudo dpkg -i ./allure_2.24.1-1_all.deb
+```
+Then generate the test report and view the corresponding HTML page by using the following command. The generated report will be stored in `allure-reports`.
+```bash
+allure generate -c -o allure-reports
+allure open ./allure-reports
+```
+## Test case functionality coverage
+The testcases are including following models:
+tools model - related to tutorials, the case is basic
+interface model - interface function cases of pipeline、 restful api and triton server api
+The relationship between functionalities and test cases is as follows:
+| case model |             Function             |                    Test Case File                    |
+| :--------: | :------------------------------: | :--------------------------------------------------: |
+|   tools    |       quantization - w4a16       |    tools/quantization/test_quantization_w4a16.py     |
+|   tools    |       quantization - w8a8        |     tools/quantization/test_quantization_w8a8.py     |
+|   tools    |      quantization - kv int8      |    tools/quantization/test_quantization_kvint8.py    |
+|   tools    | quantization - kv int8 and w4a16 | tools/quantization/test_quantization_kvint8_w4a16.py |
+|   tools    |             convert              |            tools/convert/test_convert.py             |
+|   tools    |    pipeline chat - turbomind     |    tools/pipeline/test_pipeline_chat_turbomind.py    |
+|   tools    |     pipeline chat - pytorch      |     tools/pipeline/test_pipeline_chat_pytorch.py     |
+|   tools    |   restful_api chat - turbomind   |    tools/pipeline/test_restful_chat_turbomind.py     |
+|   tools    |    restful_api chat - pytorch    |     tools/pipeline/test_restful_chat_pytorch.py      |
+|   tools    |     command chat - workspace     |      tools/chat/test_command_chat_workspace.py       |
+|   tools    |   command chat - hf turbomind    |     tools/chat/test_command_chat_hf_turbomind.py     |
+|   tools    |    command chat - hf pytorch     |      tools/chat/test_command_chat_hf_pytorch.py      |
+| interface  |    command chat - hf pytorch     |      tools/chat/test_command_chat_hf_pytorch.py      |
+The modules and models currently covered by the turbomind and pytorch backend is in `autotest/config.yaml` by using turbomind_model and pytorch_model.
+## How to add a testcase
+If you want add a new model into tool testcase, you should repare the model in your machine <a href="##Preparation before testing">Jump to prepare Section</a> then add it into `autotest/config.yaml`.
+## How to add a chatcase template
+We have provided some basic cases in the YAML file for dialogue testing.
+For CLI command usage with `chat_prompt_case.yaml` file, use `prompt_case.yaml` file for pipeline chat、 restful api and gradio testing.
+If you want to add a dialogue case, you need to modify the corresponding YAML file.
+The structure and logic of the YAML file are as follows:
+```yaml
+# casename, please name the case function, eg: This case is used to test whether there is memory ability for previous round information during multi-round dialogue.
+memory_test:
+    - please introduce some attractions in Chengdu: # Round 1 prompt
+        # output assert rule list, all rules need to be satisfied for the case to pass.
+        - contain: # The output needs to contain any one of the following items
+            - chengdu
+        - contain:
+            - 熊猫
+            - panda
+        - llama2: # For specific models that require different assert logic, the key is the model type and the value is a list of assert rules. This is a example for llama2 model. In this case, other assert rules will become invalid.
+            - len_g:
+                10
+    - please introduce some delicious foods: # Round 2 prompt
+        # output assert info list
+        - contain:
+            - chengdu
+        - len_g: # The output's length should larger then 10
+            10
+    - XXX: # Round 3 prompt
+```
--- a/autotest/chat_prompt_case.yaml
+++ b/autotest/chat_prompt_case.yaml
+common_case:
+    - 你好，你叫什么名字#hi, what's your name:
+    - 介绍成都的景点#please introduce attractions in Chengdu:
+        - contain:
+            - chengdu
+            - 成都
+        - codellama:
+            - contain:
+                - chengdu
+                - 成都
+                - llama
+        - internlm2-1_8b:
+            - contain:
+                - chengdu
+                - 成都
+                - 你好
+    - end:
+    - 介绍相应美食#please introduce some delicious foods:
+        - not_contain:
+            - 成都
+            - chengdu
+memory_test:
+    - 介绍成都的景点#please introduce attractions in Chengdu:
+        - contain:
+            - chengdu
+            - 成都
+        - contain:
+            - 熊猫
+            - panda
+            - 宽窄巷子
+            - jinli
+            - leshan
+            - 历史悠久
+        - falcon:
+            - contain:
+                - chengdu
+                - 成都
+        - internlm2-1_8b:
+            - contain:
+                - chengdu
+                - 成都
+        - internlm2-20b:
+            - contain:
+                - chengdu
+                - 成都
+    - 介绍相应美食#please introduce some delicious foods:
+        - contain:
+            - 成都
+            - chengdu
+        - contain:
+            - 火锅
+            - hotpot
+            - hot pot
+            - 四川
+        - falcon:
+            - len_g:
+                10
+        - internlm2-1_8b:
+            - contain:
+                - chengdu
+                - 成都
+        - internlm2-20b:
+            - contain:
+                - chengdu
+                - 成都
--- a/autotest/config.yaml
+++ b/autotest/config.yaml
+model_path: /mnt/bigdisk/qa_test_models
+dst_path: /nvme/qa_test_models/autotest_model
+log_path: /nvme/qa_test_models/autotest_model/log
+dataset_path: /nvme/qa_test_models/...dataset
+tp_config:
+    internlm-chat-20b: 2
+    internlm2-chat-20b: 2
+    Baichuan2-13B-Chat: 2
+    Mixtral-8x7B-Instruct-v0.1: 2
+    internlm2-20b: 2
+turbomind_model:
+    - meta-llama/Llama-2-7b-chat
+    - internlm/internlm2-chat-1_8b
+    - internlm/internlm-chat-7b
+    - internlm/internlm-chat-20b
+    - internlm/internlm2-chat-7b
+    - internlm/internlm2-chat-20b
+    - internlm/internlm2-chat-7b-4bits
+    - internlm/internlm2-chat-20b-4bits
+    - Qwen/Qwen-7B-Chat
+    - Qwen/Qwen-14B-Chat
+    - lmdeploy/llama2-chat-7b-w4
+    - baichuan-inc/Baichuan2-7B-Chat
+    - 01-ai/Yi-6B-Chat
+    - internlm/internlm2-1_8b
+    - internlm/internlm2-20b
+    - codellama/CodeLlama-7b-Instruct-hf
+pytorch_model:
+    - meta-llama/Llama-2-7b-chat
+    - internlm/internlm-chat-7b
+    - internlm/internlm-chat-20b
+    - internlm/internlm2-chat-7b
+    - internlm/internlm2-chat-20b
+    - baichuan-inc/Baichuan2-7B-Chat
+    - baichuan-inc/Baichuan2-13B-Chat
+    - THUDM/chatglm2-6b
+    - tiiuae/falcon-7b
+    - 01-ai/Yi-6B-Chat
+    - internlm/internlm2-1_8b
+    - internlm/internlm2-20b
+    - Qwen/Qwen1.5-7B-Chat
+    - mistralai/Mistral-7B-Instruct-v0.1
+    - mistralai/Mixtral-8x7B-Instruct-v0.1
+    - google/gemma-7b-it
+    - deepseek-ai/deepseek-moe-16b-chat
+quatization_case_config:
+    w4a16:
+        - meta-llama/Llama-2-7b-chat
+        - internlm/internlm-chat-20b
+        - Qwen/Qwen-7B-Chat
+        - Qwen/Qwen-14B-Chat
+        - internlm/internlm2-chat-20b
+        - baichuan-inc/Baichuan2-7B-Chat
+        - internlm/internlm2-20b
+    kvint8: # more models are supported kvint8 quantization, but the chat response are not good, already removed
+        - meta-llama/Llama-2-7b-chat
+        - internlm/internlm-chat-20b
+        - internlm/internlm2-chat-20b
+    kvint8_w4a16:
+        - meta-llama/Llama-2-7b-chat
+        - internlm/internlm-chat-20b
+        - internlm/internlm2-chat-20b
+        - internlm/internlm2-20b
+        - Qwen/Qwen-7B-Chat
+        - Qwen/Qwen-14B-Chat
+        - baichuan-inc/Baichuan2-7B-Chat
+    w8a8:
+        - meta-llama/Llama-2-7b-chat
+        - internlm/internlm-chat-20b
+        - internlm/internlm2-chat-20b
+        - internlm/internlm2-chat-7b
+        - 01-ai/Yi-6B-Chat
+        - internlm/internlm2-20b
--- a/autotest/conftest.py
+++ b/autotest/conftest.py
+import os
+import pytest
+import yaml
+cli_prompt_case_file = 'autotest/chat_prompt_case.yaml'
+common_prompt_case_file = 'autotest/prompt_case.yaml'
+config_file = 'autotest/config.yaml'
+@pytest.fixture(scope='session')
+def config():
+    config_path = os.path.join(config_file)
+    with open(config_path) as f:
+        env_config = yaml.load(f.read(), Loader=yaml.SafeLoader)
+    return env_config
+@pytest.fixture(scope='session')
+def cli_case_config():
+    case_path = os.path.join(cli_prompt_case_file)
+    with open(case_path) as f:
+        case_config = yaml.load(f.read(), Loader=yaml.SafeLoader)
+    return case_config
+@pytest.fixture(scope='class', autouse=True)
+def common_case_config():
+    case_path = os.path.join(common_prompt_case_file)
+    with open(case_path) as f:
+        case_config = yaml.load(f.read(), Loader=yaml.SafeLoader)
+    return case_config
+def _init_cli_case_list():
+    case_path = os.path.join(cli_prompt_case_file)
+    with open(case_path) as f:
+        case_config = yaml.load(f.read(), Loader=yaml.SafeLoader)
+    global global_cli_case_List
+    global_cli_case_List = list(case_config.keys())
+def _init_common_case_list():
+    case_path = os.path.join(common_prompt_case_file)
+    with open(case_path) as f:
+        case_config = yaml.load(f.read(), Loader=yaml.SafeLoader)
+    global global_common_case_List
+    global_common_case_List = list(case_config.keys())
--- a/autotest/interface/pipeline/test_pipeline_turbomind_func.py
+++ b/autotest/interface/pipeline/test_pipeline_turbomind_func.py
+import pytest
+from pytest import assume
+from lmdeploy import GenerationConfig, TurbomindEngineConfig, pipeline
+@pytest.mark.order(8)
+@pytest.mark.pipeline_turbomind_func
+@pytest.mark.timeout(240)
+@pytest.mark.flaky(reruns=0)
+class TestPipelineTurbomindFuncRegression:
+    @pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+    def test_backend_config_tp(self, config, model):
+        with pytest.raises(AssertionError, match='tp should be 2\\^n'):
+            model_path = '/'.join([config.get('model_path'), model])
+            backend_config = TurbomindEngineConfig(tp=100)
+            pipe = pipeline(model_path, backend_config=backend_config)
+            del pipe
+    @pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+    def test_backend_config_session_len(self, config, model):
+        model_path = '/'.join([config.get('model_path'), model])
+        backend_config = TurbomindEngineConfig(session_len=10)
+        pipe = pipeline(model_path, backend_config=backend_config)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
+        del pipe
+        for i in range(2):
+            assert response[i].finish_reason == 'length', str(response[i])
+            assert response[i].generate_token_len == 0, str(response[i])
+    @pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+    def test_gen_config_test(self, config, model):
+        model_path = '/'.join([config.get('model_path'), model])
+        pipe = pipeline(model_path)
+        # test min_new_tokens
+        gen_config = GenerationConfig(min_new_tokens=200, ignore_eos=True)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
+                        gen_config=gen_config)
+        for i in range(2):
+            with assume:
+                assert response[i].finish_reason == 'length', str(response[i])
+            with assume:
+                assert response[i].session_id == i
+        # test stop_words
+        gen_config = GenerationConfig(stop_words=[' and', '浦', ' to'],
+                                      random_seed=1,
+                                      temperature=0.01)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
+                        gen_config=gen_config)
+        with assume:
+            assert '浦' not in response[0].text and response[
+                0].finish_reason == 'stop' and response[
+                    0].generate_token_len < 20, str(response[0])
+        with assume:
+            assert ' and' not in response[1].text and ' to ' not in response[
+                1].text and response[1].finish_reason == 'stop' and response[
+                    1].generate_token_len < 20, str(response[1])
+        # test bad_words
+        gen_config = GenerationConfig(bad_words=[' and', '浦', ' to'],
+                                      temperature=0.01,
+                                      random_seed=1)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
+                        gen_config=gen_config)
+        with assume:
+            assert '浦' not in response[0].text and '蒲' in response[
+                0].text, str(response[0])
+        with assume:
+            assert ' and' not in response[1].text and ' to ' not in response[
+                1].text, str(response[1])
+        # test special_words
+        gen_config = GenerationConfig(skip_special_tokens=False)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
+                        gen_config=gen_config)
+        for i in range(2):
+            with assume:
+                assert response[i].finish_reason == 'length' or response[
+                    i].finish_reason == 'stop', str(response[i])
+        # test max_new_tokens
+        gen_config = GenerationConfig(max_new_tokens=5)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
+                        gen_config=gen_config)
+        for i in range(2):
+            with assume:
+                assert response[i].finish_reason == 'length', str(response[i])
+            with assume:
+                assert response[i].generate_token_len == 6, str(response[i])
+        # test max_new_tokens with ignore_eos
+        gen_config = GenerationConfig(ignore_eos=True, max_new_tokens=1024)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
+                        gen_config=gen_config)
+        for i in range(2):
+            with assume:
+                assert response[i].finish_reason == 'length', str(response[i])
+            with assume:
+                assert response[i].generate_token_len == 1025, str(response[i])
+        # test repetition_penalty
+        gen_config = GenerationConfig(repetition_penalty=0.1, random_seed=1)
+        response = pipe('Shanghai is', gen_config=gen_config)
+        with assume:
+            assert response.finish_reason == 'length', str(response)
+        with assume:
+            assert 'a 上海 is a 上海, ' * 10 in response.text, str(response)
+        del pipe
+    @pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+    def future_test_backend_config_cache_max_entry_count(self, config, model):
+        model_path = '/'.join([config.get('model_path'), model])
+        backend_config = TurbomindEngineConfig(cache_max_entry_count=-1)
+        pipe = pipeline(model_path, backend_config=backend_config)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
+        del pipe
+        for i in range(2):
+            with assume:
+                assert response[i].finish_reason == 'length', str(response[i])
+    @pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+    def test_backend_config_max_batch_size2(self, config, model):
+        model_path = '/'.join([config.get('model_path'), model])
+        backend_config = TurbomindEngineConfig(max_batch_size=-1)
+        pipe = pipeline(model_path, backend_config=backend_config)
+        response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
+        del pipe
+        for i in range(2):
+            with assume:
+                assert response[i].finish_reason is None, str(response[i])
+            with assume:
+                assert response[i].input_token_len == 0, str(response[i])
+            with assume:
+                assert response[i].generate_token_len == 0, str(response[i])
+            with assume:
+                assert response[i].text == '', str(response[i])
+    @pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+    def test_pipeline_batch_infer(self, config, model):
+        model_path = '/'.join([config.get('model_path'), model])
+        pipe = pipeline(model_path)
+        response = pipe.batch_infer(['Hi, pls intro yourself'] * 10)
+        del pipe
+        assert len(response) == 10
+        for i in range(10):
+            with assume:
+                assert response[i].finish_reason is not None, str(response[i])
+            with assume:
+                assert response[i].input_token_len > 0, str(response[i])
+            with assume:
+                assert response[i].generate_token_len > 0, str(response[i])
+            with assume:
+                assert len(response[i].text) > 0, str(response[i])
+            with assume:
+                assert response[i].session_id == i
+    @pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+    def test_pipeline_stream_infer(self, config, model):
+        model_path = '/'.join([config.get('model_path'), model])
+        pipe = pipeline(model_path)
+        for outputs in pipe.stream_infer(['Hi, pls intro yourself'] * 3):
+            with assume:
+                assert outputs.generate_token_len > 0, str(outputs)
+            with assume:
+                assert outputs.input_token_len > 50, str(outputs)
+            with assume:
+                assert outputs.session_id in (0, 1, 2), str(outputs)
+            with assume:
+                assert outputs.finish_reason in (None, 'stop',
+                                                 'length'), str(outputs)
+            continue
+        with assume:
+            assert outputs.generate_token_len > 0, str(outputs)
+        with assume:
+            assert outputs.finish_reason in ('stop', 'length'), str(outputs)
+        i = 0
+        outputs_list = []
+        for outputs in pipe.stream_infer('Hi, pls intro yourself'):
+            i += 1
+            if outputs.finish_reason is None:
+                with assume:
+                    assert outputs.generate_token_len == i, str(outputs)
+            else:
+                with assume:
+                    assert outputs.generate_token_len == i - 1, str(outputs)
+            with assume:
+                assert outputs.input_token_len > 50, str(outputs)
+            with assume:
+                assert outputs.session_id == 0, str(outputs)
+            with assume:
+                assert outputs.finish_reason in (None, 'stop',
+                                                 'length'), str(outputs)
+            outputs_list.append(outputs)
+            continue
+        for output in outputs_list[0:-1]:
+            with assume:
+                assert output.finish_reason is None, str(output)
+        with assume:
+            assert outputs_list[-1].finish_reason is not None, str(output)
+    @pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+    def test_pipeline_stream_infer2(self, config, model):
+        model_path = '/'.join([config.get('model_path'), model])
+        pipe = pipeline(model_path)
+        prompts = [{
+            'role': 'user',
+            'content': '介绍成都的景点'
+        }, {
+            'role': 'user',
+            'content': '美食呢？'
+        }]
+        for outputs in pipe.stream_infer([prompts]):
+            with assume:
+                assert outputs.generate_token_len > 0, str(outputs)
+            with assume:
+                assert outputs.input_token_len > 50, str(outputs)
+            with assume:
+                assert outputs.session_id in (0, 1, 2), str(outputs)
+            with assume:
+                assert outputs.finish_reason in (None, 'stop',
+                                                 'length'), str(outputs)
+            continue
+        with assume:
+            assert outputs.generate_token_len > 0, str(outputs)
+        with assume:
+            assert outputs.finish_reason in ('stop', 'length'), str(outputs)
+        i = 0
+        outputs_list = []
+        final_response = ''
+        for outputs in pipe.stream_infer([prompts]):
+            i += 1
+            final_response += outputs.text
+            if outputs.finish_reason is None:
+                with assume:
+                    assert outputs.generate_token_len == i, str(outputs)
+            else:
+                with assume:
+                    assert outputs.generate_token_len == i - 1, str(outputs)
+            with assume:
+                assert outputs.input_token_len > 50, str(outputs)
+            with assume:
+                assert outputs.session_id == 0, str(outputs)
+            with assume:
+                assert outputs.finish_reason in (None, 'stop',
+                                                 'length'), str(outputs)
+            outputs_list.append(outputs)
+            continue
+        print(final_response)
+        for output in outputs_list[0:-1]:
+            with assume:
+                assert output.finish_reason is None, str(output)
+        with assume:
+            assert outputs_list[-1].finish_reason is not None, str(output)
+        with assume:
+            assert '成都' in final_response.lower(), str(output)
+        del pipe
--- a/autotest/interface/pipeline/test_pipeline_turbomind_longtext_func.py
+++ b/autotest/interface/pipeline/test_pipeline_turbomind_longtext_func.py
+import pytest
+from utils.get_run_config import get_tp_num
+from lmdeploy import TurbomindEngineConfig, pipeline
+@pytest.mark.order(8)
+@pytest.mark.pipeline_func
+@pytest.mark.timeout(600)
+class TestPipelineLongtextFunc:
+    def test_long_test_chat_7b(self, config):
+        model = 'internlm/internlm2-chat-7b'
+        tp_config = get_tp_num(config, model)
+        model_path = '/'.join([config.get('model_path'), model])
+        backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0,
+                                               session_len=210000,
+                                               tp=tp_config)
+        pipe = pipeline(model_path, backend_config=backend_config)
+        prompt = '今 天 心 ' * int(200000 / 6)
+        # batch infer
+        pipe(prompt)
+        # stream infer
+        for outputs in pipe.stream_infer(prompt):
+            continue
+        prompts = ['今 天 心 ' * int(200000 / 6)] * 2
+        # batch infer
+        pipe(prompts)
+        # stream infer
+        for outputs in pipe.stream_infer(prompts):
+            continue
+    def test_long_test_chat_20b(self, config):
+        model = 'internlm/internlm2-chat-20b'
+        tp_config = get_tp_num(config, model)
+        model_path = '/'.join([config.get('model_path'), model])
+        backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0,
+                                               session_len=210000,
+                                               tp=tp_config)
+        pipe = pipeline(model_path, backend_config=backend_config)
+        prompt = '今 天 心 ' * int(200000 / 6)
+        # batch infer
+        pipe(prompt)
+        # stream infer
+        for outputs in pipe.stream_infer(prompt):
+            continue
+        prompts = ['今 天 心 ' * int(200000 / 6)] * 2
+        # batch infer
+        pipe(prompts)
+        # stream infer
+        for outputs in pipe.stream_infer(prompts):
+            continue
+    def test_long_test_20b(self, config):
+        model = 'internlm/internlm2-20b'
+        tp_config = get_tp_num(config, model)
+        model_path = '/'.join([config.get('model_path'), model])
+        backend_config = TurbomindEngineConfig(rope_scaling_factor=2.0,
+                                               session_len=210000,
+                                               tp=tp_config)
+        pipe = pipeline(model_path, backend_config=backend_config)
+        prompt = '今 天 心 ' * int(200000 / 6)
+        # batch infer
+        pipe(prompt)
+        # stream infer
+        for outputs in pipe.stream_infer(prompt):
+            continue
+        prompts = ['今 天 心 ' * int(200000 / 6)] * 2
+        # batch infer
+        pipe(prompts)
+        # stream infer
+        for outputs in pipe.stream_infer(prompts):
+            continue
--- a/autotest/interface/restful/test_restful_interface_func_common.py
+++ b/autotest/interface/restful/test_restful_interface_func_common.py
--- a/autotest/interface/restful/test_restful_interface_func_pytorch.py
+++ b/autotest/interface/restful/test_restful_interface_func_pytorch.py
+import pytest
+from utils.restful_return_check import (assert_chat_completions_batch_return,
+                                        assert_chat_completions_stream_return,
+                                        assert_chat_interactive_batch_return,
+                                        assert_chat_interactive_stream_return)
+from lmdeploy.serve.openai.api_client import APIClient
+BASE_HTTP_URL = 'http://localhost'
+DEFAULT_PORT = 23333
+MODEL = 'internlm/internlm2-chat-20b'
+MODEL_NAME = 'internlm2-chat-20b'
+BASE_URL = ':'.join([BASE_HTTP_URL, str(DEFAULT_PORT)])
+@pytest.mark.order(8)
+@pytest.mark.pytorch
+@pytest.mark.flaky(reruns=2)
+class TestRestfulInterfaceChatCompletions:
+    def test_chat_completions_ignore_eos_batch(self):
+        api_client = APIClient(BASE_URL)
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, what is your name?',
+                ignore_eos=True,
+                max_tokens=100,
+                temperature=0.01):
+            continue
+        assert_chat_completions_batch_return(output, MODEL_NAME)
+        assert output.get('usage').get(
+            'completion_tokens') == 101 or output.get('usage').get(
+                'completion_tokens') == 100
+        assert output.get('choices')[0].get('finish_reason') == 'length'
+    def test_chat_completions_ignore_eos_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, what is your name?',
+                ignore_eos=True,
+                stream=True,
+                max_tokens=100,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        assert_chat_completions_stream_return(outputList[-1], MODEL_NAME,
+                                              False, True)
+        for index in range(1, len(outputList) - 1):
+            assert_chat_completions_stream_return(outputList[index],
+                                                  MODEL_NAME)
+        assert outputList[-1].get('choices')[0].get(
+            'finish_reason') == 'length'
+        assert len(outputList) == 102
+    def test_chat_completions_max_tokens_batch(self):
+        api_client = APIClient(BASE_URL)
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, pls intro yourself',
+                max_tokens=5,
+                temperature=0.01):
+            continue
+        assert_chat_completions_batch_return(output, MODEL_NAME)
+        assert output.get('choices')[0].get('finish_reason') == 'length'
+        assert output.get('usage').get('completion_tokens') == 6 or output.get(
+            'usage').get('completion_tokens') == 5
+    def test_chat_completions_max_tokens_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, pls intro yourself',
+                stream=True,
+                max_tokens=5,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        assert_chat_completions_stream_return(outputList[-1], MODEL_NAME,
+                                              False, True)
+        for index in range(1, len(outputList) - 1):
+            assert_chat_completions_stream_return(outputList[index],
+                                                  MODEL_NAME)
+        assert outputList[-1].get('choices')[0].get(
+            'finish_reason') == 'length'
+        assert len(outputList) == 7
+    def test_chat_completions_repetition_penalty_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        response = ''
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, pls intro yourself',
+                stream=True,
+                repetition_penalty=0.1,
+                temperature=0.01,
+                max_tokens=200):
+            outputList.append(output)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        assert_chat_completions_stream_return(outputList[-1], MODEL_NAME,
+                                              False, True)
+        for index in range(1, len(outputList) - 1):
+            assert_chat_completions_stream_return(outputList[index],
+                                                  MODEL_NAME)
+            response += outputList[index].get('choices')[0].get('delta').get(
+                'content')
+        assert 'pls pls ' * 5 in response or \
+            'Hi, pls intro yourself\n' * 5 in response
+    def test_chat_completions_topp_min_batch(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for i in range(3):
+            for output in api_client.chat_completions_v1(
+                    model=MODEL_NAME,
+                    messages='Shanghai is',
+                    top_p=0.1,
+                    temperature=0.01):
+                outputList.append(output)
+            assert_chat_completions_batch_return(output, MODEL_NAME)
+            print(output)
+        assert outputList[0].get('choices')[0].get('message').get(
+            'content') == outputList[1].get('choices')[0].get('message').get(
+                'content')
+        assert outputList[1].get('choices')[0].get('message').get(
+            'content') == outputList[2].get('choices')[0].get('message').get(
+                'content')
+    def test_chat_completions_topp_min_stream(self):
+        api_client = APIClient(BASE_URL)
+        responseList = []
+        for i in range(3):
+            outputList = []
+            response = ''
+            for output in api_client.chat_completions_v1(
+                    model=MODEL_NAME,
+                    messages='Hi, pls intro yourself',
+                    stream=True,
+                    top_p=0.1,
+                    temperature=0.01):
+                outputList.append(output)
+            assert_chat_completions_stream_return(outputList[0], MODEL_NAME,
+                                                  True, False)
+            assert_chat_completions_stream_return(outputList[-1], MODEL_NAME,
+                                                  False, True)
+            for index in range(1, len(outputList) - 1):
+                assert_chat_completions_stream_return(outputList[index],
+                                                      MODEL_NAME)
+                response += outputList[index].get('choices')[0].get(
+                    'delta').get('content')
+            responseList.append(response)
+        assert responseList[0] == responseList[1]
+        assert responseList[1] == responseList[2]
+    def test_chat_completions_longinput_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, pls intro yourself' * 10000,
+                stream=True,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        for index in range(1, len(outputList) - 1):
+            assert_chat_completions_stream_return(outputList[index],
+                                                  MODEL_NAME)
+        assert outputList[1].get('choices')[0].get('finish_reason') == 'length'
+        assert outputList[1].get('choices')[0].get('delta').get(
+            'content') == ''
+        assert len(outputList) == 2
+@pytest.mark.order(8)
+@pytest.mark.pytorch
+@pytest.mark.flaky(reruns=2)
+class TestRestfulInterfaceChatInteractive:
+    def test_chat_interactive_ignore_eos_batch(self):
+        api_client = APIClient(BASE_URL)
+        for output in api_client.chat_interactive_v1(
+                prompt='Hi, what is your name?',
+                ignore_eos=True,
+                request_output_len=100,
+                temperature=0.01):
+            continue
+        assert_chat_interactive_batch_return(output)
+        assert output.get('tokens') == 100
+        assert output.get('finish_reason') == 'length'
+    def test_chat_interactive_ignore_eos_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_interactive_v1(
+                prompt='Hi, what is your name?',
+                ignore_eos=True,
+                stream=True,
+                request_output_len=100,
+                temperature=0.01):
+            outputList.append(output)
+            print(output)
+        assert_chat_interactive_stream_return(outputList[-1],
+                                              True,
+                                              index=len(outputList) - 2)
+        for index in range(0, len(outputList) - 1):
+            assert_chat_interactive_stream_return(outputList[index],
+                                                  index=index)
+        assert output.get('finish_reason') == 'length'
+        assert len(outputList) == 101
+    def test_chat_interactive_max_tokens_batch(self):
+        api_client = APIClient(BASE_URL)
+        for output in api_client.chat_interactive_v1(
+                prompt='Hi, pls intro yourself',
+                request_output_len=5,
+                temperature=0.01):
+            continue
+        assert_chat_interactive_batch_return(output)
+        assert output.get('finish_reason') == 'length'
+        assert output.get('tokens') == 5
+    def test_chat_interactive_max_tokens_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_interactive_v1(
+                prompt='Hi, pls intro yourself',
+                stream=True,
+                request_output_len=5,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_interactive_stream_return(outputList[-1],
+                                              True,
+                                              index=len(outputList) - 2)
+        for index in range(0, len(outputList) - 1):
+            assert_chat_interactive_stream_return(outputList[index],
+                                                  index=index)
+        assert output.get('finish_reason') == 'length'
+        assert len(outputList) == 6
+    def test_chat_interactive_topp_min_batch(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for i in range(3):
+            for output in api_client.chat_interactive_v1(prompt='Shanghai is',
+                                                         top_p=0.01,
+                                                         temperature=0.01):
+                continue
+            assert_chat_interactive_batch_return(output)
+            outputList.append(output)
+            print(output)
+        assert outputList[0] == outputList[1]
+        assert outputList[1] == outputList[2]
+    def test_chat_interactive_topp_min_stream(self):
+        api_client = APIClient(BASE_URL)
+        responseList = []
+        for i in range(3):
+            outputList = []
+            response = ''
+            for output in api_client.chat_interactive_v1(
+                    model=MODEL_NAME,
+                    prompt='Hi, pls intro yourself',
+                    stream=True,
+                    top_p=0.01,
+                    temperature=0.01):
+                outputList.append(output)
+            assert_chat_interactive_stream_return(outputList[-1],
+                                                  True,
+                                                  index=len(outputList) - 2)
+            for index in range(0, len(outputList) - 1):
+                assert_chat_interactive_stream_return(outputList[index],
+                                                      index=index)
+                response += outputList[index].get('text')
+            responseList.append(response)
+        assert responseList[0] == responseList[1]
+        assert responseList[1] == responseList[2]
--- a/autotest/interface/restful/test_restful_interface_func_turbomind.py
+++ b/autotest/interface/restful/test_restful_interface_func_turbomind.py
+import pytest
+from utils.restful_return_check import (assert_chat_completions_batch_return,
+                                        assert_chat_completions_stream_return,
+                                        assert_chat_interactive_batch_return,
+                                        assert_chat_interactive_stream_return)
+from lmdeploy.serve.openai.api_client import APIClient
+BASE_HTTP_URL = 'http://localhost'
+DEFAULT_PORT = 23333
+MODEL = 'internlm/internlm2-chat-20b'
+MODEL_NAME = 'internlm2-chat-20b'
+BASE_URL = ':'.join([BASE_HTTP_URL, str(DEFAULT_PORT)])
+@pytest.mark.order(8)
+@pytest.mark.turbomind
+@pytest.mark.flaky(reruns=2)
+class TestRestfulInterfaceChatCompletions:
+    def test_chat_completions_ignore_eos_batch(self):
+        api_client = APIClient(BASE_URL)
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, what is your name?',
+                ignore_eos=True,
+                max_tokens=100,
+                temperature=0.01):
+            continue
+        assert_chat_completions_batch_return(output, MODEL_NAME)
+        assert output.get('usage').get('completion_tokens') == 101
+        assert output.get('choices')[0].get('finish_reason') == 'length'
+    def test_chat_completions_ignore_eos_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, what is your name?',
+                ignore_eos=True,
+                stream=True,
+                max_tokens=100,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        assert_chat_completions_stream_return(outputList[-1], MODEL_NAME,
+                                              False, True)
+        for index in range(1, len(outputList) - 1):
+            assert_chat_completions_stream_return(outputList[index],
+                                                  MODEL_NAME)
+        assert outputList[-1].get('choices')[0].get(
+            'finish_reason') == 'length'
+        assert len(outputList) == 103
+    def test_chat_completions_max_tokens_batch(self):
+        api_client = APIClient(BASE_URL)
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, pls intro yourself',
+                max_tokens=5,
+                temperature=0.01):
+            continue
+        assert_chat_completions_batch_return(output, MODEL_NAME)
+        assert output.get('choices')[0].get('finish_reason') == 'length'
+        assert output.get('usage').get('completion_tokens') == 6
+    def test_chat_completions_max_tokens_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, pls intro yourself',
+                stream=True,
+                max_tokens=5,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        assert_chat_completions_stream_return(outputList[-1], MODEL_NAME,
+                                              False, True)
+        for index in range(1, len(outputList) - 1):
+            assert_chat_completions_stream_return(outputList[index],
+                                                  MODEL_NAME)
+        assert outputList[-1].get('choices')[0].get(
+            'finish_reason') == 'length'
+        assert len(outputList) == 8
+    def test_chat_completions_repetition_penalty_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        response = ''
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, pls intro yourself',
+                stream=True,
+                repetition_penalty=0.1,
+                temperature=0.01,
+                max_tokens=200):
+            outputList.append(output)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        assert_chat_completions_stream_return(outputList[-1], MODEL_NAME,
+                                              False, True)
+        for index in range(1, len(outputList) - 1):
+            assert_chat_completions_stream_return(outputList[index],
+                                                  MODEL_NAME)
+            response += outputList[index].get('choices')[0].get('delta').get(
+                'content')
+        assert 'pls pls ' * 5 in response or \
+            'Hi, pls intro yourself\n' * 5 in response
+    def test_chat_completions_topp_min_batch(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for i in range(3):
+            for output in api_client.chat_completions_v1(
+                    model=MODEL_NAME, messages='Shanghai is', top_p=0.1):
+                outputList.append(output)
+            assert_chat_completions_batch_return(output, MODEL_NAME)
+        assert outputList[0].get('choices')[0].get('message').get(
+            'content') == outputList[1].get('choices')[0].get('message').get(
+                'content')
+        assert outputList[1].get('choices')[0].get('message').get(
+            'content') == outputList[2].get('choices')[0].get('message').get(
+                'content')
+    def test_chat_completions_topp_min_stream(self):
+        api_client = APIClient(BASE_URL)
+        responseList = []
+        for i in range(3):
+            outputList = []
+            response = ''
+            for output in api_client.chat_completions_v1(
+                    model=MODEL_NAME,
+                    messages='Hi, pls intro yourself',
+                    stream=True,
+                    top_p=0.1):
+                outputList.append(output)
+            assert_chat_completions_stream_return(outputList[0], MODEL_NAME,
+                                                  True, False)
+            assert_chat_completions_stream_return(outputList[-1], MODEL_NAME,
+                                                  False, True)
+            for index in range(1, len(outputList) - 1):
+                assert_chat_completions_stream_return(outputList[index],
+                                                      MODEL_NAME)
+                response += outputList[index].get('choices')[0].get(
+                    'delta').get('content')
+            responseList.append(response)
+        assert responseList[0] == responseList[1]
+        assert responseList[1] == responseList[2]
+    def test_chat_completions_longinput_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_completions_v1(
+                model=MODEL_NAME,
+                messages='Hi, pls intro yourself' * 10000,
+                stream=True,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_completions_stream_return(outputList[0], MODEL_NAME, True,
+                                              False)
+        assert outputList[1].get('choices')[0].get('finish_reason') == 'length'
+        assert outputList[1].get('choices')[0].get('delta').get(
+            'content') == ''
+        assert len(outputList) == 2
+@pytest.mark.order(8)
+@pytest.mark.turbomind
+@pytest.mark.flaky(reruns=2)
+class TestRestfulInterfaceChatInteractive:
+    def test_chat_interactive_ignore_eos_batch(self):
+        api_client = APIClient(BASE_URL)
+        for output in api_client.chat_interactive_v1(
+                prompt='Hi, what is your name?',
+                ignore_eos=True,
+                request_output_len=100,
+                temperature=0.01):
+            continue
+        assert_chat_interactive_batch_return(output)
+        assert output.get('tokens') == 101
+        assert output.get('finish_reason') == 'length'
+    def test_chat_interactive_ignore_eos_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_interactive_v1(
+                prompt='Hi, what is your name?',
+                ignore_eos=True,
+                stream=True,
+                request_output_len=100,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_interactive_stream_return(outputList[-1],
+                                              True,
+                                              index=len(outputList) - 2)
+        for index in range(0, len(outputList) - 1):
+            assert_chat_interactive_stream_return(outputList[index],
+                                                  index=index)
+        assert output.get('finish_reason') == 'length'
+        assert len(outputList) == 102
+    def test_chat_interactive_max_tokens_batch(self):
+        api_client = APIClient(BASE_URL)
+        for output in api_client.chat_interactive_v1(
+                prompt='Hi, pls intro yourself',
+                request_output_len=5,
+                temperature=0.01):
+            continue
+        assert_chat_interactive_batch_return(output)
+        assert output.get('finish_reason') == 'length'
+        assert output.get('tokens') == 6
+    def test_chat_interactive_max_tokens_stream(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for output in api_client.chat_interactive_v1(
+                prompt='Hi, pls intro yourself',
+                stream=True,
+                request_output_len=5,
+                temperature=0.01):
+            outputList.append(output)
+        assert_chat_interactive_stream_return(outputList[-1],
+                                              True,
+                                              index=len(outputList) - 2)
+        for index in range(0, len(outputList) - 1):
+            assert_chat_interactive_stream_return(outputList[index],
+                                                  index=index)
+        assert output.get('finish_reason') == 'length'
+        assert len(outputList) == 7
+    def test_chat_interactive_topp_min_batch(self):
+        api_client = APIClient(BASE_URL)
+        outputList = []
+        for i in range(3):
+            for output in api_client.chat_interactive_v1(prompt='Shanghai is',
+                                                         top_p=0.01):
+                continue
+            assert_chat_interactive_batch_return(output)
+            outputList.append(output)
+        assert outputList[0] == outputList[1]
+        assert outputList[1] == outputList[2]
+    def test_chat_interactive_topp_min_stream(self):
+        api_client = APIClient(BASE_URL)
+        responseList = []
+        for i in range(3):
+            outputList = []
+            response = ''
+            for output in api_client.chat_interactive_v1(
+                    model=MODEL_NAME,
+                    prompt='Hi, pls intro yourself',
+                    stream=True,
+                    top_p=0.01):
+                outputList.append(output)
+            assert_chat_interactive_stream_return(outputList[-1],
+                                                  True,
+                                                  index=len(outputList) - 2)
+            for index in range(0, len(outputList) - 1):
+                assert_chat_interactive_stream_return(outputList[index],
+                                                      index=index)
+                response += outputList[index].get('text')
+            responseList.append(response)
+        assert responseList[0] == responseList[1]
+        assert responseList[1] == responseList[2]
--- a/autotest/prompt_case.yaml
+++ b/autotest/prompt_case.yaml
+common_case:
+    - 你好，你叫什么名字#hi, what's your name:
+    - 介绍相应美食#please introduce some delicious foods:
+        - not_contain:
+            - 成都
+            - chengdu
+        - internlm2-1_8b:
+            - len_g:
+                10
+memory_test:
+    - 介绍成都的景点#please introduce attractions in Chengdu:
+        - contain:
+            - chengdu
+            - 成都
+        - contain:
+            - 熊猫
+            - panda
+            - 宽窄巷子
+            - jinli
+            - leshan
+            - 历史悠久
+        - falcon:
+            - contain:
+                - chengdu
+                - 成都
+        - internlm2-1_8b:
+            - contain:
+                - chengdu
+                - 成都
+        - internlm2-20b:
+            - contain:
+                - chengdu
+                - 成都
+    - 介绍相应美食#please introduce some delicious foods:
+        - contain:
+            - 成都
+            - chengdu
+            - 四川
+        - contain:
+            - 火锅
+            - hotpot
+            - hot pot
+            - 夫妻肺片
+        - falcon:
+            - len_g:
+                10
+        - internlm2-1_8b:
+            - contain:
+                - chengdu
+                - 成都
+        - internlm2-20b:
+            - contain:
+                - chengdu
+                - 成都
+chinese_poem_case:
+    - 给我一首中文打油诗，需要添加标点符号。和，请用中文回答Give me a Chinese poem in Chinese:
+        - contain:
+            - "，"
+            - "。"
+        - len_g:
+            5
+        - llama-2:
+            - contain:
+                - poem
+                - poetry
+            - len_g:
+                5
+        - codellama:
+            - contain:
+                - poem
+                - poetry
+            - len_g:
+                5
+        - internlm2-1_8b:
+            - len_g:
+                5
+        - internlm2-20b:
+            - len_g:
+                5
+        - falcon:
+            - len_g:
+                5
+english_poem_case:
+    - write a romantic English poem:
+        - contain:
+            - " "
+        - contain:
+            - "."
+            - ","
+        - contain:
+            - love
+            - poem
+        - len_g:
+            100
+        - internlm2-1_8b:
+            - len_g:
+                100
+        - internlm2-20b:
+            - len_g:
+                100
+        - falcon:
+            - len_g:
+                1
+emoji_case:
+    - 请输出👍赞的emoji#print output the emoji of good👍:
+        - contain:
+            - 👍
+            - 😊
+        - baichuan2-7b:
+            - contain:
+                - 👍
+                - 😊
+                - \u2714
+                - 赞
+                - emoji
+                - '!'
+traditional_chinese_case:
+    - 使用繁體介紹香港維多利亞港:
+        - contain:
+            - victoria
+            - 維多利亞港
+            - 维多利亚港
+        - codellama:
+            - contain:
+                - victoria
+                - 維多利亞港
+                - 维多利亚港
+                - hong kong
+        - internlm2-20b:
+            - contain:
+                - victoria
+                - 維多利亞港
+                - 维多利亚港
+                - hong kong
+                - 香港
+        - llama-2:
+            - contain:
+                - victoria
+                - 維多利亞港
+                - 维多利亚港
+                - apologize
+        - falcon:
+            - len_g:
+                1
--- a/autotest/pytest.ini
+++ b/autotest/pytest.ini
+[pytest]
+python_files = test*_*.py  # test file
+python_classes = Test*     # test class
+python_functions = test_*  # test function
+pytest_runtest_call.tryfirst = True
+filterwarnings = ignore::UserWarning
+reruns = 2
+reruns_delay = 10
--- a/autotest/tools/chat/test_command_chat_hf_pytorch.py
+++ b/autotest/tools/chat/test_command_chat_hf_pytorch.py
+import allure
+import conftest
+import pytest
+from utils.config_utils import (get_cuda_prefix_by_workerid,
+                                get_torch_model_list)
+from utils.run_client_chat import hf_command_line_test
+conftest._init_cli_case_list()
+case_list = conftest.global_cli_case_List
+def getCaseList():
+    return case_list
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.hf_pytorch_chat
+@pytest.mark.gpu_num_1
+@pytest.mark.parametrize('usercase', getCaseList())
+@pytest.mark.parametrize('model', get_torch_model_list(tp_num=1))
+def test_hf_pytorch_chat_tp1(config, model, cli_case_config, usercase,
+                             worker_id):
+    result, chat_log, msg = hf_command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'torch',
+        cuda_prefix=get_cuda_prefix_by_workerid(worker_id))
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.hf_pytorch_chat
+@pytest.mark.gpu_num_2
+@pytest.mark.parametrize('usercase', getCaseList())
+@pytest.mark.parametrize('model', get_torch_model_list(tp_num=2))
+def test_hf_pytorch_chat_tp2(config, model, cli_case_config, usercase,
+                             worker_id):
+    result, chat_log, msg = hf_command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'torch',
+        cuda_prefix=get_cuda_prefix_by_workerid(worker_id, tp_num=2))
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.hf_pytorch_chat
+@pytest.mark.pr_test
+@pytest.mark.xdist_group(name='pr_test')
+@pytest.mark.parametrize('usercase', getCaseList())
+@pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+def test_hf_pytorch_chat_pr(config, model, cli_case_config, usercase):
+    result, chat_log, msg = hf_command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'torch',
+        cuda_prefix='CUDA_VISIBLE_DEVICES=5,6')
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
--- a/autotest/tools/chat/test_command_chat_hf_turbomind.py
+++ b/autotest/tools/chat/test_command_chat_hf_turbomind.py
+import allure
+import conftest
+import pytest
+from utils.config_utils import (get_cuda_prefix_by_workerid,
+                                get_turbomind_model_list)
+from utils.run_client_chat import hf_command_line_test
+conftest._init_cli_case_list()
+case_list = conftest.global_cli_case_List
+def getCaseList():
+    return case_list
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.hf_turbomind_chat
+@pytest.mark.gpu_num_1
+@pytest.mark.parametrize('usercase', getCaseList())
+@pytest.mark.parametrize('model', get_turbomind_model_list(tp_num=1))
+def test_hf_turbomind_chat_tp1(config, model, cli_case_config, usercase,
+                               worker_id):
+    result, chat_log, msg = hf_command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'turbomind',
+        cuda_prefix=get_cuda_prefix_by_workerid(worker_id))
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.hf_turbomind_chat
+@pytest.mark.gpu_num_2
+@pytest.mark.parametrize('usercase', getCaseList())
+@pytest.mark.parametrize('model', get_turbomind_model_list(tp_num=2))
+def test_hf_turbomind_chat_tp2(config, model, cli_case_config, usercase,
+                               worker_id):
+    result, chat_log, msg = hf_command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'turbomind',
+        cuda_prefix=get_cuda_prefix_by_workerid(worker_id, tp_num=2))
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.hf_turbomind_chat
+@pytest.mark.pr_test
+@pytest.mark.xdist_group(name='pr_test')
+@pytest.mark.parametrize('usercase', getCaseList())
+@pytest.mark.parametrize(
+    'model',
+    ['internlm/internlm2-chat-20b', 'internlm/internlm2-chat-20b-inner-w4a16'])
+def test_hf_turbomind_chat_pr(config, model, cli_case_config, usercase):
+    result, chat_log, msg = hf_command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'turbomind',
+        cuda_prefix='CUDA_VISIBLE_DEVICES=5,6')
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
--- a/autotest/tools/chat/test_command_chat_workspace.py
+++ b/autotest/tools/chat/test_command_chat_workspace.py
+import allure
+import conftest
+import pytest
+from utils.config_utils import (get_cuda_prefix_by_workerid,
+                                get_turbomind_model_list)
+from utils.run_client_chat import command_line_test
+conftest._init_cli_case_list()
+prompt_list = conftest.global_cli_case_List
+def getPromptCaseList():
+    return prompt_list
+def getModelList(tp_num):
+    return [
+        item for item in get_turbomind_model_list(tp_num)
+        if 'kvint8' not in item.lower()
+    ]
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.command_chat
+@pytest.mark.gpu_num_1
+@pytest.mark.parametrize('usercase', getPromptCaseList())
+@pytest.mark.parametrize('model', getModelList(tp_num=1))
+def test_workspace_chat_tp1(config, cli_case_config, usercase, model,
+                            worker_id):
+    result, chat_log, msg = command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'turbomind',
+        cuda_prefix=get_cuda_prefix_by_workerid(worker_id))
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.command_chat
+@pytest.mark.gpu_num_2
+@pytest.mark.parametrize('usercase', getPromptCaseList())
+@pytest.mark.parametrize('model', getModelList(tp_num=2))
+def test_workspace_chat_tp2(config, cli_case_config, usercase, model,
+                            worker_id):
+    result, chat_log, msg = command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'turbomind',
+        cuda_prefix=get_cuda_prefix_by_workerid(worker_id, tp_num=2))
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
+@pytest.mark.order(10)
+@pytest.mark.usefixtures('cli_case_config')
+@pytest.mark.command_chat
+@pytest.mark.pr_test
+@pytest.mark.parametrize('usercase', getPromptCaseList())
+@pytest.mark.parametrize(
+    'model',
+    ['internlm/internlm2-chat-20b', 'internlm/internlm2-chat-20b-inner-w4a16'])
+def test_workspace_chat_pr(config, cli_case_config, usercase, model):
+    result, chat_log, msg = command_line_test(
+        config,
+        usercase,
+        cli_case_config.get(usercase),
+        model,
+        'turbomind',
+        None,
+        cuda_prefix='CUDA_VISIBLE_DEVICES=5,6')
+    if chat_log is not None:
+        allure.attach.file(chat_log,
+                           attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
--- a/autotest/tools/convert/test_convert.py
+++ b/autotest/tools/convert/test_convert.py
+import os
+import subprocess
+from subprocess import PIPE
+import allure
+import pytest
+from utils.config_utils import (get_cuda_prefix_by_workerid,
+                                get_turbomind_model_list)
+from utils.get_run_config import get_command_with_extra, get_model_name
+@pytest.mark.order(5)
+@pytest.mark.convert
+@pytest.mark.parametrize('model', get_turbomind_model_list())
+def test_convert(config, model, worker_id):
+    convert(config, model, get_cuda_prefix_by_workerid(worker_id))
+@pytest.mark.order(5)
+@pytest.mark.convert
+@pytest.mark.pr_test
+@pytest.mark.xdist_group(name='pr_test')
+@pytest.mark.parametrize(
+    'model',
+    ['internlm/internlm2-chat-20b', 'internlm/internlm2-chat-20b-inner-w4a16'])
+def test_convert_pr(config, model):
+    convert(config, model, 'CUDA_VISIBLE_DEVICES=5')
+def convert(config, model_case, cuda_prefix):
+    origin_model_path = config.get('model_path') + '/' + model_case
+    dst_path = config.get('dst_path') + '/workspace_' + model_case
+    log_path = config.get('log_path')
+    model_name = get_model_name(model_case)
+    if 'w4' in model_case or '4bits' in model_case:
+        cmd = get_command_with_extra(' '.join([
+            'lmdeploy convert', model_name, origin_model_path, '--dst-path',
+            dst_path, '--model-format awq --group-size 128'
+        ]),
+                                     config,
+                                     model_name,
+                                     True,
+                                     cuda_prefix=cuda_prefix)
+    else:
+        cmd = get_command_with_extra(' '.join([
+            'lmdeploy convert', model_name, origin_model_path, '--dst-path',
+            dst_path
+        ]),
+                                     config,
+                                     model_name,
+                                     True,
+                                     cuda_prefix=cuda_prefix)
+    convert_log = os.path.join(log_path,
+                               'convert_' + model_case.split('/')[1] + '.log')
+    print('reproduce command convert: ' + cmd + '\n')
+    with open(convert_log, 'w') as f:
+        # remove existing workspace
+        subprocess.run([' '.join(['rm -rf', dst_path])],
+                       stdout=f,
+                       stderr=f,
+                       shell=True,
+                       text=True,
+                       encoding='utf-8')
+        f.writelines('reproduce command convert: ' + cmd + '\n')
+        # convert
+        convertRes = subprocess.run([cmd],
+                                    stdout=f,
+                                    stderr=PIPE,
+                                    shell=True,
+                                    text=True,
+                                    encoding='utf-8')
+        f.writelines(convertRes.stderr)
+        # check result
+        result = convertRes.returncode == 0
+    allure.attach.file(convert_log,
+                       attachment_type=allure.attachment_type.TEXT)
+    assert result, convertRes.stderr
--- a/autotest/tools/pipeline/pipeline_chat_script.py
+++ b/autotest/tools/pipeline/pipeline_chat_script.py
+import os
+import fire
+import yaml
+from lmdeploy import pipeline
+from lmdeploy.messages import (GenerationConfig, PytorchEngineConfig,
+                               TurbomindEngineConfig)
+cli_prompt_case_file = 'autotest/chat_prompt_case.yaml'
+common_prompt_case_file = 'autotest/prompt_case.yaml'
+config_file = 'autotest/config.yaml'
+def main(type: str, model, tp: int = 1):
+    config_path = os.path.join(config_file)
+    with open(config_path) as f:
+        env_config = yaml.load(f.read(), Loader=yaml.SafeLoader)
+    case_path = os.path.join(common_prompt_case_file)
+    with open(case_path) as f:
+        case_config = yaml.load(f.read(), Loader=yaml.SafeLoader)
+    run_pipeline_chat_test(env_config, case_config, model, tp, type)
+def run_pipeline_chat_test(config, cases_info, model_case, tp, type):
+    model_path = config.get('model_path')
+    hf_path = model_path + '/' + model_case
+    if 'pytorch' == type:
+        backend_config = PytorchEngineConfig(tp=tp)
+    else:
+        if 'kvint8' in model_case and ('w4' in model_case
+                                       or '4bits' in model_case):
+            backend_config = TurbomindEngineConfig(tp=tp,
+                                                   model_format='awq',
+                                                   quant_policy=4)
+        elif 'kvint8' in model_case:
+            backend_config = TurbomindEngineConfig(tp=tp,
+                                                   model_format='hf',
+                                                   quant_policy=4)
+        elif 'w4' in model_case or '4bits' in model_case:
+            backend_config = TurbomindEngineConfig(tp=tp, model_format='awq')
+        else:
+            backend_config = TurbomindEngineConfig(tp=tp)
+    pipe = pipeline(hf_path, backend_config=backend_config)
+    # run testcases
+    gen_config = GenerationConfig(temperature=0.01)
+    for case in cases_info.keys():
+        if (case == 'memory_test'
+                or case == 'emoji_case') and 'chat' not in model_case.lower():
+            continue
+        case_info = cases_info.get(case)
+        print('case:' + case)
+        prompts = []
+        for prompt_detail in case_info:
+            prompt = list(prompt_detail.keys())[0]
+            if 'chat' not in model_case.lower():  # base model
+                prompts.append(prompt)
+            else:  # chat model
+                prompts.append({'role': 'user', 'content': prompt})
+            print('prompt:' + prompt)
+            if 'chat' not in model_case.lower():  # base model
+                response = pipe(prompts, gen_config=gen_config)[-1].text
+            else:  # chat model
+                response = pipe([prompts], gen_config=gen_config)[0].text
+            if 'chat' in model_case.lower():
+                prompts.append({'role': 'assistant', 'content': response})
+            print('output:' + response)
+if __name__ == '__main__':
+    fire.Fire(main)
--- a/autotest/tools/pipeline/test_pipeline_chat_pytorch.py
+++ b/autotest/tools/pipeline/test_pipeline_chat_pytorch.py
+import os
+from multiprocessing import Process
+import pytest
+from utils.config_utils import get_cuda_id_by_workerid, get_torch_model_list
+from utils.pipeline_chat import (assert_pipeline_chat_log,
+                                 run_pipeline_chat_test)
+def getModelList(tp_num):
+    return [
+        item for item in get_torch_model_list(tp_num)
+        if 'falcon' not in item.lower() and 'chatglm2' not in item.lower()
+    ]
+@pytest.mark.order(6)
+@pytest.mark.usefixtures('common_case_config')
+@pytest.mark.pipeline_chat_pytorch
+@pytest.mark.gpu_num_1
+@pytest.mark.flaky(reruns=0)
+@pytest.mark.parametrize('model', getModelList(tp_num=1))
+def test_pipeline_chat_pytorch_tp1(config, common_case_config, model,
+                                   worker_id):
+    if 'gw' in worker_id:
+        os.environ['CUDA_VISIBLE_DEVICES'] = get_cuda_id_by_workerid(worker_id)
+    p = Process(target=run_pipeline_chat_test,
+                args=(config, common_case_config, model, 'pytorch'))
+    p.start()
+    p.join()
+    # assert script
+    assert_pipeline_chat_log(config, common_case_config, model)
+@pytest.mark.order(6)
+@pytest.mark.usefixtures('common_case_config')
+@pytest.mark.pipeline_chat_pytorch
+@pytest.mark.gpu_num_2
+@pytest.mark.flaky(reruns=0)
+@pytest.mark.parametrize('model', getModelList(tp_num=2))
+def test_pipeline_chat_pytorch_tp2(config, common_case_config, model,
+                                   worker_id):
+    if 'gw' in worker_id:
+        os.environ['CUDA_VISIBLE_DEVICES'] = get_cuda_id_by_workerid(worker_id,
+                                                                     tp_num=2)
+    p = Process(target=run_pipeline_chat_test,
+                args=(config, common_case_config, model, 'pytorch'))
+    p.start()
+    p.join()
+    # assert script
+    assert_pipeline_chat_log(config, common_case_config, model)
+@pytest.mark.order(6)
+@pytest.mark.usefixtures('common_case_config')
+@pytest.mark.pipeline_chat_pytorch
+@pytest.mark.flaky(reruns=0)
+@pytest.mark.pr_test
+@pytest.mark.parametrize('model', ['internlm/internlm2-chat-20b'])
+def test_pipeline_chat_pytorch_pr(config, common_case_config, model):
+    p = Process(target=run_pipeline_chat_test,
+                args=(config, common_case_config, model, 'pytorch'))
+    p.start()
+    p.join()
+    # assert script
+    assert_pipeline_chat_log(config, common_case_config, model)
--- a/autotest/tools/pipeline/test_pipeline_chat_turbomind.py
+++ b/autotest/tools/pipeline/test_pipeline_chat_turbomind.py
+import os
+from multiprocessing import Process
+import pytest
+from utils.config_utils import get_all_model_list, get_cuda_id_by_workerid
+from utils.pipeline_chat import (assert_pipeline_chat_log,
+                                 run_pipeline_chat_test)
+@pytest.mark.order(6)
+@pytest.mark.usefixtures('common_case_config')
+@pytest.mark.pipeline_chat
+@pytest.mark.gpu_num_1
+@pytest.mark.flaky(reruns=0)
+@pytest.mark.parametrize('model', get_all_model_list(tp_num=1))
+def test_pipeline_chat_tp1(config, common_case_config, model, worker_id):
+    if 'gw' in worker_id:
+        os.environ['CUDA_VISIBLE_DEVICES'] = get_cuda_id_by_workerid(worker_id)
+    p = Process(target=run_pipeline_chat_test,
+                args=(config, common_case_config, model, 'turbomind'))
+    p.start()
+    p.join()
+    assert_pipeline_chat_log(config, common_case_config, model)
+@pytest.mark.order(6)
+@pytest.mark.usefixtures('common_case_config')
+@pytest.mark.pipeline_chat
+@pytest.mark.gpu_num_2
+@pytest.mark.flaky(reruns=0)
+@pytest.mark.parametrize('model', get_all_model_list(tp_num=2))
+def test_pipeline_chat_tp2(config, common_case_config, model, worker_id):
+    if 'gw' in worker_id:
+        os.environ['CUDA_VISIBLE_DEVICES'] = get_cuda_id_by_workerid(worker_id,
+                                                                     tp_num=2)
+    p = Process(target=run_pipeline_chat_test,
+                args=(config, common_case_config, model, 'turbomind'))
+    p.start()
+    p.join()
+    assert_pipeline_chat_log(config, common_case_config, model)
+@pytest.mark.order(6)
+@pytest.mark.usefixtures('common_case_config')
+@pytest.mark.pipeline_chat
+@pytest.mark.flaky(reruns=0)
+@pytest.mark.pr_test
+@pytest.mark.parametrize(
+    'model',
+    ['internlm/internlm2-chat-20b', 'internlm/internlm2-chat-20b-inner-w4a16'])
+def test_pipeline_chat_pr(config, common_case_config, model):
+    p = Process(target=run_pipeline_chat_test,
+                args=(config, common_case_config, model, 'turbomind'))
+    p.start()
+    p.join()
+    assert_pipeline_chat_log(config, common_case_config, model)
--- a/autotest/tools/quantization/test_quantization_kvint8.py
+++ b/autotest/tools/quantization/test_quantization_kvint8.py
+import os
+import allure
+import pytest
+from utils.config_utils import get_cuda_prefix_by_workerid
+from utils.quantization_utils import quantization
+model_list = [
+    'meta-llama/Llama-2-7b-chat', 'internlm/internlm-chat-20b',
+    'internlm/internlm2-chat-20b', 'Qwen/Qwen-7B-Chat', 'Qwen/Qwen-14B-Chat',
+    'internlm/internlm2-20b', 'baichuan-inc/Baichuan2-7B-Chat'
+]
+@pytest.mark.order(1)
+@pytest.mark.quantization_kvint8
+@pytest.mark.timeout(900)
+@pytest.mark.parametrize('model', model_list)
+def test_quantization_kvint8(config, model, worker_id):
+    quantization_kvint8(config, model + '-inner-kvint8', model,
+                        get_cuda_prefix_by_workerid(worker_id))
+def quantization_kvint8(config, quantization_model_name, origin_model_name,
+                        cuda_prefix):
+    quantization_type = 'kvint8'
+    result, msg = quantization(config, quantization_model_name,
+                               origin_model_name, quantization_type,
+                               cuda_prefix)
+    log_path = config.get('log_path')
+    quantization_log = os.path.join(
+        log_path, '_'.join([
+            'quantization', quantization_type,
+            quantization_model_name.split('/')[1]
+        ]) + '.log')
+    allure.attach.file(quantization_log,
+                       attachment_type=allure.attachment_type.TEXT)
+    assert result, msg
--- a/autotest/tools/quantization/test_quantization_kvint8_w4a16.py
+++ b/autotest/tools/quantization/test_quantization_kvint8_w4a16.py
+import os
+import allure
+import pytest
+from utils.config_utils import get_cuda_prefix_by_workerid
+from utils.quantization_utils import quantization
+model_list = [
+    'meta-llama/Llama-2-7b-chat-inner-kvint8',
+    'internlm/internlm-chat-20b-inner-kvint8',
+    'internlm/internlm2-chat-20b-inner-kvint8',
+    'Qwen/Qwen-7B-Chat-inner-kvint8', 'Qwen/Qwen-14B-Chat-inner-kvint8',
+    'internlm/internlm2-20b-inner-kvint8',
+    'baichuan-inc/Baichuan2-7B-Chat-inner-kvint8'
+]
+@pytest.mark.order(4)
+@pytest.mark.quantization_kvint8_w4a16
+@pytest.mark.timeout(900)
+@pytest.mark.parametrize('model', model_list)
+def test_quantization_kvint8_w4a16(config, model, worker_id):
+    quantization_kvint8(config, model + '-w4a16', model,
+                        get_cuda_prefix_by_workerid(worker_id))
+def quantization_kvint8(config, quantization_model_name, origin_model_name,
+                        cuda_prefix):
+    quantization_type = 'w4a16'
+    result, msg = quantization(config, quantization_model_name,
+                               origin_model_name, quantization_type,
+                               cuda_prefix)
+    log_path = config.get('log_path')
+    quantization_log = os.path.join(
+        log_path, '_'.join([
+            'quantization', quantization_type,
+            quantization_model_name.split('/')[1]
+        ]) + '.log')
+    allure.attach.file(quantization_log,
+                       attachment_type=allure.attachment_type.TEXT)
+    assert result, msg