pytorch.md 2.09 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
# Pytorch

## Chat in command line

LMDeploy support chatting with PyTorch models with submodule `lmdeploy.pytorch.chat`.

This submodule allow user to chat with language model through command line, and optionally accelerate model using backends like deepspeed.

**Example 1**: Chat with default setting

RunningLeon's avatar
RunningLeon committed
11
```shell
12
lmdeploy chat torch $PATH_TO_HF_MODEL
13
14
15
16
```

**Example 2**: Disable sampling and chat history

RunningLeon's avatar
RunningLeon committed
17
```shell
18
lmdeploy chat torch \
19
    $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
20
    --temperature 0 --max-history 0
21
22
23
24
```

**Example 3**: Accelerate with deepspeed inference

RunningLeon's avatar
RunningLeon committed
25
```shell
26
lmdeploy chat torch \
27
28
29
30
31
32
33
34
    $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
    --accel deepspeed
```

Note: to use deepspeed, you need to install deepspeed, and if hope to accelerate InternLM, you need a customized version <https://github.com/wangruohui/DeepSpeed/tree/support_internlm_0.10.0>

**Example 4**: Tensor parallel the model on 2 GPUs

RunningLeon's avatar
RunningLeon committed
35
```shell
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \
    --accel deepspeed \
```

This module also allow the following control commands to change generation behaviors during chat.

- `exit`: terminate and exit chat
- `config set key=value`: change generation config `key` to `value`, e.g. config temperature=0 disable sampling for following chats
- `clear`: clear chat history

### Simple diagram of components

```mermaid
graph LR;
    subgraph model specific adapter
        p((user_input))-->tokenize-->id((input_ids))-->decorate
        tmpl_ids((template_ids))-->decorate;
    end
    subgraph generate
        model[CausalLM_model.generate]-->gen_result(("gen_result"))
        gen_result-->hid
        gen_result-->attn((attention))
    end
    subgraph streamer
        model-->s[streamer]--value-->decode_single--token-->output
    end
    subgraph session_manager
        prepend_history-->fullid((complete_ids));
        trim-->prepend_history
    end
    decorate-->prepend_history
    hid((history_ids))-->trim;
    attn-->trim;
    fullid-->model
    tokenizer((tokenizer))-->decode_single
    tokenizer-->tokenize
    p-->genconfig(GenConfig)-->model
```