# When there are image tokens and this stage only receives vision embeddings, adjust the recv buffer seq length to match the image embeddings sequence length.
# If there are image tokens and this stage receives full embeddings, make sure we compensate for expansion of image tokens.
# Note that this will set a recv_buffer_seq_length for the encoder stage, this length is irrelevant since that recv buffer is never allocated.
The model card name (see the support list in `conf/`) is expected as an input to all the sample scripts.
Other arguments are specified as varibles (e.g. `TP=8`) where you can either set before `bash` or export
to the current bash environment upfront.
The script will perform per-tensor FP8 faked-quantization and generate some tokens as an indication thatthe quantized model still behaves correctly. The end results are stored in `/tmp/megatron_workspace/meta-llama/Llama-3.2-1B-Instruct_quant`. This is a Megatron Mcore distributed checkpoint (with additional states), which can be loaded for quantization-aware training (QAT) or exported for deployment.
## Export for TensorRT-LLM, vLLM, SGLang Deployment
For supported Hugging Face models, TensorRT Model Optimizer can export the quantized model to
> **NOTE:** The HF-like export only supports pipeline parallelism (`PP`). Other parallelism must be
> set to 1. The exported checkpoint is sharded with safetensors. Although it is HF-like, this format
> currently cannot be loaded by `from_pretrained()`.
The exported checkpoint is stored in `/tmp/megatron_workspace/meta-llama/Llama-3.1-8B-Instruct_export` which can be provided as an input to most of the `LLM` APIs. For examples,