inference_server.md 2.99 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# Yuan2.0 Inference-Server

## Introduction

This document provides instructions for inference-server of Yuan2.0.


  - [CKPT model Inference-Server](#CKPT model Inference-Server)
  - [HuggingFace model Inference-Server](#HuggingFace model Inference-Server)
  - [API Testing](#API Testing)
  

## CKPT model Inference-Server

- First step,modify the script file

   	`TOKENIZER_MODEL_PATH` indicates the storage path for TOKENIZER related files;
   	`CHECKPOINT_PATH` indicates the storage path for model related files;
   	`GPUS_PER_NODE` indicates the number of GPU cards used for this node, this number should be consistent with the number of parallel paths for model tensors;
   	`CUDA_VISIBLE_DEVICES` indicates the GPU number used, the number of used numbers should be consistent with `GPUS_PER_NODE`
   	`PORT` indicates the port number used by the service, one service occupies one port number, the user can modify it according to the actual situation.
  
- Second step,  run the script in the warehouse for deployment
    ```bash
    #2.1B deployment command
    bash examples/run_inference_server_2.1B.sh
    
    #51B deployment command
    bash examples/run_inference_server_51B.sh
    
    #102B deployment command
    bash examples/run_inference_server_102B.sh
    ```


## HuggingFace model Inference-Server

- First step,modify the script file: examples/run_inference_server_hf.sh

    	`HF_PATH` indicates the storage path for HuggingFace model related files;
    	`CUDA_VISIBLE_DEVICES` indicates the GPU number used;
    	`PORT` indicates the port number used by the service, one service occupies one port number, the user can modify it according to the actual situation.
  
- Second step,  run the script in the warehouse for deployment

   ```bash
   bash examples/run_inference_server_hf.sh
   ```
   
- Attention:if running in Windows/CPU, flash_atten needs to be turned off manually, and HuggingFace model file code needs to be modified as follows
   ```
   Modify "use_flash_attention" in config.json to false;
   Comment lines 35 and 36 in yuan_hf_model.py;
   Modify line 271 in yuan_hf_model.py to inference_hidden_states_memory = torch.empty(bsz, 2, hidden_states.shape[2], dtype=hidden_states.dtype)
   ```


## API Testing

- Testing with Python

Also, we have written a sample code to test the performance of the API calls. Before running, make sure to modify the 'ip' and 'port' in the code according to the API deployment situation.

```bash
python tools/start_inference_server_api.py
```

- Testing with Curl

```
#return the Unicode encoding
curl http://127.0.0.1:8000/yuan -X PUT   \
--header 'Content-Type: application/json' \
--data '{"ques_list":[{"id":"000","ques":"请帮忙作一首诗,主题是冬至"}], "tokens_to_generate":500, "top_k":5}'

# return the original form
echo -en "$(curl -s  http://127.0.0.1:8000/yuan -X PUT  --header 'Content-Type: application/json' --data '{"ques_list":[{"id":"000","ques":"作一首词 ,主题是冬至"}], "tokens_to_generate":500, "top_k":5}')"
```