### Example: Serving with 16 A100/A800 with int8 Quantization
### Example: Serving with 16 A100/A800 with int8 Quantization
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
...
@@ -191,18 +192,31 @@ There are block-wise and per-channel quantization methods, and the quantization
...
@@ -191,18 +192,31 @@ There are block-wise and per-channel quantization methods, and the quantization
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can have following commands to launch the server:
> **Note that the launch command here enables `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
Then on the **master node**, supposing the ShareGPT data is located at `/path/to/ShareGPT_V3_unfiltered_cleaned_split.json`, you can run the following commands to benchmark the launched server: