# W8A8 LLM Model Deployment LMDeploy provides functions for quantization and inference of large language models using 8-bit integers. Before starting inference, ensure that lmdeploy and openai/triton are correctly installed. Execute the following commands to install these: ```shell pip install lmdeploy pip install triton>=2.1.0 ``` ## 8-bit Weight Model Inference For performing 8-bit weight model inference, you can directly download the pre-quantized 8-bit weight models from LMDeploy's [model zoo](https://huggingface.co/lmdeploy). For instance, the 8-bit Internlm-chat-7B model is available for direct download from the model zoo: ```shell git-lfs install git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon) ``` Alternatively, you can manually convert original 16-bit weights into 8-bit by referring to the content under the ["8bit Weight Quantization"](#8bit-weight-quantization) section. Save them in the internlm-chat-7b-w8 directory, using the command below: ```shell lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8 ``` Afterwards, use the following command to interact with the model via the terminal: ```shell lmdeploy chat torch ./internlm-chat-7b-w8 ``` ## Launching gradio service Coming soon... ## Inference Speed Coming soon... ## 8bit Weight Quantization Performing 4bit weight quantization involves three steps: 1. **Smooth Weights**: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing. 2. **Replace Modules**: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These 'Q' modules are available in the lmdeploy/pytorch/models/q_modules.py file. 3. **Save the Quantized Model**: Once you've made the necessary replacements, save the new quantized model. The script `lmdeploy/lite/apis/smooth_quant.py` accomplishes all three tasks detailed above. For example, you can obtain the model weights of the quantized Internlm-chat-7B model by running the following command: ```shell lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8 ``` After saving, you can instantiate your quantized model by calling the from_pretrained interface.