This module is used to export megatron core models to different inference frameworks.
Currently we support TRTLLM export . In the future we will be adding support for VLLM etc.
## PTQ AND EXPORT
Follow the examples of [TensorRT Model Optimizer](../post_training/modelopt) to perform post training quantization, followed by an export to a HF-like checkpoint for TensorRT-LLM, vLLM, and SGLang deployment.
# TRTLLM EXPORT
Follow the instructions in [trtllm_export](./trtllm_export/) to do export to TRTLLM checkpoint format alone.
This guide will walk you through how you can use the megatron core export for exporting models to trtllm format
### Contents
-[Megatron Core To TRTLLM Export Documentation](#megatron-core-to-trtllm-export-documentation)
-[Contents](#contents)
-[1. Quick Start](#1-quick-start)
-[1.1 Understanding The Code](#11-understanding-the-code)
-[1.2 Running The Code](#12-running-the-code)
-[2. GPU Export](#2-gpu-export)
-[3. Future work](#4-future-work)
#### 1. Quick Start
This will walk you through the flow of converting an mcore gpt model to trtllm format using single device mode. The file can be found at [gpt_single_device_cpu_export.py](./single_device_export/gpt_single_device_cpu_export.py)
NOTE: For faster performance, if your entire model will fit into gpu memory, pre transfer the model state dict to gpu and then call the get_trtllm_pretrained_config_and_model_weights function.
<br>
##### 1.1 Understanding The Code
***STEP 1 - We initialize model parallel and other default arguments***
We initalize tp and pp to 1 so that we can get the full model state dict on cpu
We instantiate the [TRTLLM Helper](../../../megatron/core/export/trtllm/trtllm_helper.py) For the GPT model we instantiate trtllm_helper as shown below.
To convert model weights to trtllm weights and configs, we use the [single_device_converter](../../../megatron/core/export/trtllm/trtllm_weights_converter/single_device_trtllm_model_weights_converter.py). We pass as inputs the model state dict, and export config. In this example we use inference tp size as 2 for the export.
```python
model_state_dict={}
forkey,valingpt_model.state_dict().items():
# val is non for _extra_state layers . We filter it out
You can use the [gpt_distributed_gpu_export.py](./distributed_export/gpt_distributed_gpu_export.py) to run a more optimized on device distributed. version of trtllm export. Internally this uses the [distributed_converter](../../../megatron/core/export/trtllm/trtllm_weights_converter/distributed_trtllm_model_weights_converter.py) to convert model weights on device.
In the single device version you collect all the model weights on CPU/GPU, convert it to trtllm format, and then store the engine back on disk. In the GPU version you load each individual state dict on the gpus, convert it on the device itself and store the engine on disk.