"tests/wav2vec2/test_tokenization_wav2vec2.py" did not exist on "a6938c4721491d681b5520ea0611ceed56e74f22"
quickstart_inference.md 6.54 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# Quick Start: Inference

This script provides a quick guide on using the 102B model and the 51B model, including instructions for checkpoint (ckpt) conversion and utilizing inference services.

## Yuan 2.0-102B:

### step1:

Firstly, you need to convert the ckpt.

The parallelism method of the 102B-models is 32-pipeline-parallelism and 1-tensor--parallelism(32pp, 1tp). In order to improve the parallelism efficiency of the inference, you need to convert parallelism method of the 102B-models from  (32pp, 1tp) to (1pp, 8tp). (Apply to 80GB-GPU)

The conversion process is as follows:

(32pp, 1tp) -> (32pp, 8tp) -> (1pp, 8tp)

We provide an automatic conversion script that can be used as follows:

```
1. vim examples/ckpt_partitions_102B.sh

2. Set three environment variables: LOAD_CHECKPOINT_PATH, SAVE_SPLITED_CHECKPOINT_PATH, SAVE_CHECKPOINT_PATH:

LOAD_CHECKPOINT_PATH: The path to the base 102B-model(32pp, 1tp), this path needs to contain the 'latest_checkpointed_iteration.txt' file. An example is shown below:

LOAD_CHECKPOINT_PATH=/mnt/102B

SAVE_SPLITED_CHECKPOINT_PATH: The path to the temporary 102B-model(32pp, 8tp), which can be removed when all conversions are done. An example is shown below:

SAVE_SPLITED_CHECKPOINT_PATH=./ckpt-102B-mid

SAVE_CHECKPOINT_PATH: The path to the resulting 102B-model(1pp, 8tp). An example is shown below:

SAVE_CHECKPOINT_PATH=./ckpt-102B-8tp

If you run the script in the Yuan home directory, you can use the path: TOKENIZER_MODEL_PATH=./tokenizer (because the Yuan home directory contains the tokenizer), otherwise you need to specify the tokenizer path.

3. bash examples/ckpt_partitions_102B.sh
```

After the above steps are completed, an 8-way tensor parallel ckpt will be generated in the directory specified by `SAVE_CHECKPOINT_PATH`, which can be used for inference services.

### step2:

```
1. Set environment variable 'CHECKPOINT_PATH' in script 'examples/run_inference_server_102B.sh'.

vim examples/run_inference_server_102B.sh

Set environment variable 'CHECKPOINT_PATH' to 'SAVE_CHECKPOINT_PATH' specified in step-1. For example, if in step-1 SAVE_CHECKPOINT_PATH=./ckpt-102B-8tp, you should set CHECKPOINT_PATH=./ckpt-102B-8tp in script examples/run_inference_server_102B.sh


2. Start the inference service(Requires 8 x 80GB-GPU):

#The default port number of the script is 8000, if 8000 is occupied, you need to change the environment variable 'PORT' in examples/run_inference_server_102B.sh to the used port number.

bash examples/run_inference_server_102B.sh

After the program finishes loading the cpkt and the following information appears, you can perform the next step to call the inference service:

  successfully loaded checkpoint from ./ckpt-102B-8tp at iteration 1

 * Serving Flask app 'megatron.text_generation_server'
 * Debug mode: off
   WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8000
 * Running on http://127.0.0.1:8000

3. Use the inference service in the same docker:

#The default port number of the script is 8000, if 8000 is occupied, you need to change 'request_url="http://127.0.0.1:8000/yuan"' in script 'tools/start_inference_server_api.py' to the used port number.

python tools/start_inference_server_api.py

If the inference service runs successfully, the inference result will be returned
```


## Yuan 2.0-51B:

### step1:

Firstly, you need to convert the ckpt.

The parallelism method of the 51B-models is 16-pipeline-parallelism and 1-tensor--parallelism(16pp, 1tp). In order to improve the parallelism efficiency of the inference, you need to convert parallelism method of the 51B-models from  (16pp, 1tp) to (1pp, 4tp). (Apply to 80GB-GPU)

The conversion process is as follows:

(16pp, 1tp) -> (16pp, 4tp) -> (1pp, 4tp)

We provide an automatic conversion script that can be used as follows:

```
1. vim examples/ckpt_partitions_51B.sh

2. Set three environment variables: LOAD_CHECKPOINT_PATH, SAVE_SPLITED_CHECKPOINT_PATH, SAVE_CHECKPOINT_PATH:

LOAD_CHECKPOINT_PATH: The path to the base 51B-model(16pp, 1tp), this path needs to contain the 'latest_checkpointed_iteration.txt' file. An example is shown below:

LOAD_CHECKPOINT_PATH=/mnt/51B

SAVE_SPLITED_CHECKPOINT_PATH: The path to the temporary 51B-model(16pp, 4tp), which can be removed when all conversions are done. An example is shown below:

SAVE_SPLITED_CHECKPOINT_PATH=./ckpt-51B-mid

SAVE_CHECKPOINT_PATH: The path to the resulting 51B-model(1pp, 4tp). An example is shown below:

SAVE_CHECKPOINT_PATH=./ckpt-51B-4tp

If you run the script in the Yuan home directory, you can use the path: TOKENIZER_MODEL_PATH=./tokenizer (because the Yuan home directory contains the tokenizer), otherwise you need to specify the tokenizer path.

3. bash examples/ckpt_partitions_51B.sh
```

After the above steps are completed, an 4-way tensor parallel ckpt will be generated in the directory specified by `SAVE_CHECKPOINT_PATH`, which can be used for inference services.

### step2:

```
1. Set environment variable 'CHECKPOINT_PATH' in script 'examples/run_inference_server_51B.sh'.

vim examples/run_inference_server_51B.sh

Set environment variable 'CHECKPOINT_PATH' to 'SAVE_CHECKPOINT_PATH' specified in step-1. For example, if in step-1 SAVE_CHECKPOINT_PATH=./ckpt-51B-4tp, you should set CHECKPOINT_PATH=./ckpt-51B-4tp in script examples/run_inference_server_51B.sh

2. Start the inference service(Requires 4 x 80GB-GPU):

#The default port number of the script is 8000, if 8000 is occupied, you need to change the environment variable 'PORT' in examples/run_inference_server_51B.sh to the used port number.

bash examples/run_inference_server_51B.sh

After the program finishes loading the cpkt and the following information appears, you can perform the next step to call the inference service:

  successfully loaded checkpoint from ./ckpt-51B-4tp at iteration 1

 * Serving Flask app 'megatron.text_generation_server'
 * Debug mode: off
   WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8000
 * Running on http://127.0.0.1:8000

3. Use the inference service in the same docker:

#The default port number of the script is 8000, if 8000 is occupied, you need to change 'request_url="http://127.0.0.1:8000/yuan"' in script 'tools/start_inference_server_api.py' to the used port number.

python tools/start_inference_server_api.py

If the inference service runs successfully, the inference result will be returned
```