#### Using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
NVIDIA Model Optimizer (ModelOpt) provides advanced quantization techniques optimized for NVIDIA hardware. SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment.
##### Installation
First, install ModelOpt. You can either install it directly or as an optional SGLang dependency:
```bash
# Option 1: Install ModelOpt directly
pip install nvidia-modelopt
# Option 2: Install SGLang with ModelOpt support (recommended)
pip install sglang[modelopt]
```
##### Quantization and Export Workflow
SGLang provides an example script that demonstrates the complete ModelOpt quantization and export workflow:
```bash
# Quantize and export a model using ModelOpt FP8 quantization
-**Hardware Optimization**: Specifically optimized for NVIDIA GPU architectures
-**Advanced Quantization**: Supports cutting-edge FP8 and FP4 quantization techniques
-**Seamless Integration**: Automatic export to HuggingFace format for easy deployment
-**Calibration-based**: Uses calibration datasets for optimal quantization quality
-**Production Ready**: Enterprise-grade quantization with NVIDIA support
## Online Quantization
## Online Quantization
To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`:
To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`: