## Quick Start (run accuracy and performance tests)
Please be sure to complete the installation steps before proceeding with the following steps:
1. Modify task workload, for example, [chatglm2-torch-fp16-6b.json](https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/workloads/chatglm2-torch-fp16-6b.json)
2. Download model weights using prepare_model.sh or huggingface_cli.
3. Download model output logits in specific input case(.npy files) using prepare_model.sh.
4. Start accuracy and performance tests.
You can run following command automate all steps with chatglm2 model on GPU backend
Test server with single prompt, and you can get infer result, logits numpy file and model forward time. Output files will locate in `./reports/single_query/`
```shell
python3 ./script/single_query.py --prompt"What is 7 multiplied by 7?"--batch_size 8
```
## Test model_impl model forward performance
Only need to instantiate MpEngine running mixtral-8x22b (tp_size=8, max_batch_size=8) and feed proper inputs. Runing following command will get performance outputs. You can modify test cases in `./bench_model.py` currerntly.
The output will located in `./reports/{hardware_type}/{model_config}/bench_model`:
-**config.json**: perf config
-**context_perf.csv**: prefill, latency with specified {batch_size, seq_len}
-**decode_perf.csv**: decode, latency with specified {batch_size, seq_len}
-**output.txt**: raw latency data
## Demo Project
[GPU Backend](https://github.com/bytedance/ByteMLPerf/tree/main/byte_infer_perf/llm_perf/backends/GPU) provides a demo project that realizes llm inference of chatglm2-6b on A100 with following features:
- Separate functional components:
* Scheduler
- custom scheduling on tasks
* Inferencer
- transfer tasks to real inputs and get outputs
* Mp Engine
- deal with TP logic using multiple processes
* Sampler
- postprocess logic
* Ckpt Loader
- custom ckpt loader with split logic which matches TP logic.
* Custom model implementation
- custom model implementation using hardware backend torch realization
- Seperate scheduling logic
* Context: one task, input_ids shape is [1, q_len]
* Decode: multiple tasks, input_ids shape up to [max_batch_size, 1]
- Tensor parallelism
- kv cache
The demo project is intended to provide a reference implementation, and there's no guarantee of achieving optimal performance. More technical details will be provided later on [ByteMLPerf](https://bytemlperf.ai)
## Vendor Integration
Vendors can refer to this document for guidance on building backend: [Byte LLM Perf](https://bytemlperf.ai/zh/guide/inference_llm_vendor.html)