quickstart_inference.md

# Quick Start: Inference

This script provides a quick guide on using the 102B model and the 51B model, including instructions for checkpoint (ckpt) conversion and utilizing inference services.

## Yuan 2.0-102B:

### step1：

Firstly, you need to convert the ckpt.

The parallelism method of the 102B-models is 32-pipeline-parallelism and 1-tensor--parallelism(32pp, 1tp). In order to improve the parallelism efficiency of the inference, you need to convert parallelism method of the 102B-models from  (32pp, 1tp) to (1pp, 8tp). (Apply to 80GB-GPU)

The conversion process is as follows:

(32pp, 1tp) -> (32pp, 8tp) -> (1pp, 8tp)

We provide an automatic conversion script that can be used as follows:

```
1. vim examples/ckpt_partitions_102B.sh

2. Set three environment variables: LOAD_CHECKPOINT_PATH, SAVE_SPLITED_CHECKPOINT_PATH, SAVE_CHECKPOINT_PATH:

LOAD_CHECKPOINT_PATH: The path to the base 102B-model(32pp, 1tp), this path needs to contain the 'latest_checkpointed_iteration.txt' file. An example is shown below:

LOAD_CHECKPOINT_PATH=/mnt/102B

SAVE_SPLITED_CHECKPOINT_PATH: The path to the temporary 102B-model(32pp, 8tp), which can be removed when all conversions are done. An example is shown below:

SAVE_SPLITED_CHECKPOINT_PATH=./ckpt-102B-mid

SAVE_CHECKPOINT_PATH: The path to the resulting 102B-model(1pp, 8tp). An example is shown below:

SAVE_CHECKPOINT_PATH=./ckpt-102B-8tp

If you run the script in the Yuan home directory, you can use the path: TOKENIZER_MODEL_PATH=./tokenizer (because the Yuan home directory contains the tokenizer), otherwise you need to specify the tokenizer path.

3. bash examples/ckpt_partitions_102B.sh
```

After the above steps are completed, an 8-way tensor parallel ckpt will be generated in the directory specified by `SAVE_CHECKPOINT_PATH`, which can be used for inference services.

### step2：

```
1. Set environment variable 'CHECKPOINT_PATH' in script 'examples/run_inference_server_102B.sh'.

vim examples/run_inference_server_102B.sh

Set environment variable 'CHECKPOINT_PATH' to 'SAVE_CHECKPOINT_PATH' specified in step-1. For example, if in step-1 SAVE_CHECKPOINT_PATH=./ckpt-102B-8tp, you should set CHECKPOINT_PATH=./ckpt-102B-8tp in script examples/run_inference_server_102B.sh


2. Start the inference service（Requires 8 x 80GB-GPU）：

#The default port number of the script is 8000, if 8000 is occupied, you need to change the environment variable 'PORT' in examples/run_inference_server_102B.sh to the used port number.

bash examples/run_inference_server_102B.sh

After the program finishes loading the cpkt and the following information appears, you can perform the next step to call the inference service:

  successfully loaded checkpoint from ./ckpt-102B-8tp at iteration 1

 * Serving Flask app 'megatron.text_generation_server'
 * Debug mode: off
   WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8000
 * Running on http://127.0.0.1:8000

3. Use the inference service in the same docker：

#The default port number of the script is 8000, if 8000 is occupied, you need to change 'request_url="http://127.0.0.1:8000/yuan"' in script 'tools/start_inference_server_api.py' to the used port number.

python tools/start_inference_server_api.py

If the inference service runs successfully, the inference result will be returned
```


## Yuan 2.0-51B:

### step1：

Firstly, you need to convert the ckpt.

The parallelism method of the 51B-models is 16-pipeline-parallelism and 1-tensor--parallelism(16pp, 1tp). In order to improve the parallelism efficiency of the inference, you need to convert parallelism method of the 51B-models from  (16pp, 1tp) to (1pp, 4tp). (Apply to 80GB-GPU)

The conversion process is as follows:

(16pp, 1tp) -> (16pp, 4tp) -> (1pp, 4tp)

We provide an automatic conversion script that can be used as follows:

```
1. vim examples/ckpt_partitions_51B.sh

2. Set three environment variables: LOAD_CHECKPOINT_PATH, SAVE_SPLITED_CHECKPOINT_PATH, SAVE_CHECKPOINT_PATH:

LOAD_CHECKPOINT_PATH: The path to the base 51B-model(16pp, 1tp), this path needs to contain the 'latest_checkpointed_iteration.txt' file. An example is shown below:

LOAD_CHECKPOINT_PATH=/mnt/51B

SAVE_SPLITED_CHECKPOINT_PATH: The path to the temporary 51B-model(16pp, 4tp), which can be removed when all conversions are done. An example is shown below:

SAVE_SPLITED_CHECKPOINT_PATH=./ckpt-51B-mid

SAVE_CHECKPOINT_PATH: The path to the resulting 51B-model(1pp, 4tp). An example is shown below:

SAVE_CHECKPOINT_PATH=./ckpt-51B-4tp

If you run the script in the Yuan home directory, you can use the path: TOKENIZER_MODEL_PATH=./tokenizer (because the Yuan home directory contains the tokenizer), otherwise you need to specify the tokenizer path.

3. bash examples/ckpt_partitions_51B.sh
```

After the above steps are completed, an 4-way tensor parallel ckpt will be generated in the directory specified by `SAVE_CHECKPOINT_PATH`, which can be used for inference services.

### step2：

```
1. Set environment variable 'CHECKPOINT_PATH' in script 'examples/run_inference_server_51B.sh'.

vim examples/run_inference_server_51B.sh

Set environment variable 'CHECKPOINT_PATH' to 'SAVE_CHECKPOINT_PATH' specified in step-1. For example, if in step-1 SAVE_CHECKPOINT_PATH=./ckpt-51B-4tp, you should set CHECKPOINT_PATH=./ckpt-51B-4tp in script examples/run_inference_server_51B.sh

2. Start the inference service（Requires 4 x 80GB-GPU）：

#The default port number of the script is 8000, if 8000 is occupied, you need to change the environment variable 'PORT' in examples/run_inference_server_51B.sh to the used port number.

bash examples/run_inference_server_51B.sh

After the program finishes loading the cpkt and the following information appears, you can perform the next step to call the inference service:

  successfully loaded checkpoint from ./ckpt-51B-4tp at iteration 1

 * Serving Flask app 'megatron.text_generation_server'
 * Debug mode: off
   WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8000
 * Running on http://127.0.0.1:8000

3. Use the inference service in the same docker：

#The default port number of the script is 8000, if 8000 is occupied, you need to change 'request_url="http://127.0.0.1:8000/yuan"' in script 'tools/start_inference_server_api.py' to the used port number.

python tools/start_inference_server_api.py

If the inference service runs successfully, the inference result will be returned
```