# Yuan2.0 Inference-Server ## Introduction This document provides instructions for inference-server of Yuan2.0. - [CKPT model Inference-Server](#CKPT model Inference-Server) - [HuggingFace model Inference-Server](#HuggingFace model Inference-Server) - [API Testing](#API Testing) ## CKPT model Inference-Server - First step,modify the script file `TOKENIZER_MODEL_PATH` indicates the storage path for TOKENIZER related files; `CHECKPOINT_PATH` indicates the storage path for model related files; `GPUS_PER_NODE` indicates the number of GPU cards used for this node, this number should be consistent with the number of parallel paths for model tensors; `CUDA_VISIBLE_DEVICES` indicates the GPU number used, the number of used numbers should be consistent with `GPUS_PER_NODE` ; `PORT` indicates the port number used by the service, one service occupies one port number, the user can modify it according to the actual situation. - Second step, run the script in the warehouse for deployment ```bash #2.1B deployment command bash examples/run_inference_server_2.1B.sh #51B deployment command bash examples/run_inference_server_51B.sh #102B deployment command bash examples/run_inference_server_102B.sh ``` ## HuggingFace model Inference-Server - First step,modify the script file: examples/run_inference_server_hf.sh `HF_PATH` indicates the storage path for HuggingFace model related files; `CUDA_VISIBLE_DEVICES` indicates the GPU number used; `PORT` indicates the port number used by the service, one service occupies one port number, the user can modify it according to the actual situation. - Second step, run the script in the warehouse for deployment ```bash bash examples/run_inference_server_hf.sh ``` - Attention:if running in Windows/CPU, flash_atten needs to be turned off manually, and HuggingFace model file code needs to be modified as follows ``` Modify "use_flash_attention" in config.json to false; Comment lines 35 and 36 in yuan_hf_model.py; Modify line 271 in yuan_hf_model.py to inference_hidden_states_memory = torch.empty(bsz, 2, hidden_states.shape[2], dtype=hidden_states.dtype) ``` ## API Testing - Testing with Python Also, we have written a sample code to test the performance of the API calls. Before running, make sure to modify the 'ip' and 'port' in the code according to the API deployment situation. ```bash python tools/start_inference_server_api.py ``` - Testing with Curl ``` #return the Unicode encoding curl http://127.0.0.1:8000/yuan -X PUT \ --header 'Content-Type: application/json' \ --data '{"ques_list":[{"id":"000","ques":"请帮忙作一首诗,主题是冬至"}], "tokens_to_generate":500, "top_k":5}' # return the original form echo -en "$(curl -s http://127.0.0.1:8000/yuan -X PUT --header 'Content-Type: application/json' --data '{"ques_list":[{"id":"000","ques":"作一首词 ,主题是冬至"}], "tokens_to_generate":500, "top_k":5}')" ```