"""TRTLLM Helper class to convert export and build TRTLLM model."""
def__init__(
self,
*,
transformer_config:TransformerConfig,
model_type:ModelType,
trtllm_conversion_dict:dict={},
position_embedding_type:str='learned_absolute',
max_position_embeddings:int=None,
rotary_percentage:int=1.0,
rotary_base:int=10000,
rope_scaling_factor:float=8.0,
moe_tp_mode:int=2,
multi_query_mode:bool=False,
activation:str="gelu",
seq_len_interpolation_factor:float=None,
moe_renorm_mode=None,
share_embeddings_and_output_weights=False,
):
"""Constructor for the TRTLLMHelper
There are two public API's supported by this helper.
a) get_trtllm_pretrained_config_and_model_weights
b) build_and_save_engine
Args:
transformer_config (TransformerConfig): The transformer config
model_type (ModelType): The type of the input model. Enum (megatron.core.export.model_type.ModelType)
trtllm_conversion_dict (dict, optional): A conversion dictionary that will map your model layer names to trtllm equivalent layer names. Default dictionary is given megatron/core/export/model_to_trtllm_mapping. This dict is merged into the default dict. NOTE: Ignore layer numbers in the model layer names. (e.g) decoder.layers.0.attention_qkv.weight will be decoder.layers.attention_qkv.weight in the mapping dictionary. Defaults to {}.
position_embedding_type (str, optional): The position embedding type. Defaults to None.
max_position_embeddings (int, optional): Max posistion embeddings value. Defaults to None.
rotary_percentage (int, optional): The rotary percentage if using rope embedding. Defaults to 1.0.
rotary_base (int, optional): The rotary base (theta value) if using rope embeddings. Defaults to 10000.
moe_tp_mode (int, optional): TRTLLM Config. Defaults to 2.
multi_query_mode (bool, optional): Defaults to False.
activation (str, optional): Defaults to "gelu".
seq_len_interpolation_factor (float, optional): The sequence length interpolation factor if using rope embeddings. Defaults to None.
moe_renorm_mode (optional) : Renormalization mode if using mixture of experts. Defaults to None.
share_embeddings_and_output_weights (bool, optional): True if input and output layers share weights. Defaults to False.
This function returns the trtllm model weights as a list.
There are two modes for conversion. The default is to use a single device cpu/gpu for conversion.
NOTE: For faster performance, if your entire model will fit in memory, pre transfer the model state dict to cuda device and then call this function.
For on device conversion it returns weights which will be used on the device itself.
Same thing happens with the pretrained config
Args:
model_state_dict (dict): The input model state dictionary (Entire model state loaded on CPU) or the model state dict of each GPU in the case of on_device conversion)
export_config (ExportConfig): The export config used to define inference tp size, pp size etc. Used only for on device conversion.
dtype (DataType): The data type of model precision
on_device_distributed_conversion (bool, optional): Convert on gpus in distributed setting. This assumes that the model state dict is sharded according to required inference model parallelism and that each gpu gets its part of the model state dict . Defaults to False.
vocab_size (int, optional): The vocabulary size. Defaults to None.
gpus_per_node (int, optional): The number of gpus per node. Used for on device conversion.
state_dict_split_by_layer_numbers (bool, optional): Are the model layers split by layer numbers in state dict. For example : mlp.fc1.weight can be represented like mlp.fc1.weight of shape [num_layers, hidden_dim, ffn_hidden_dim]} or it can be like mlp.fc1.layers.0.weight of shape [hidden_dim, ffn_hidden_dim], then mlp.fc1.layers.1.weight ... for all layers. If you use represenation 2 set this to True. Defaults to True
Returns:
Two lists . First list of trtllm converted model weights(Either on device, or a list of weights for each gpu) and the trtllm_model_configs.
"""
assertmodel_state_dictisnotNone,"Model state dict is not set"
"""Get the TRTLLM Pretrained config and model weights list (one per gpu rank) on single device (CPU/GPU)
This function assumes the entire model state dict is present in CPU or on one GPU
Args:
export_config (ExportConfig): The export config to set inference tp, pp size etc.
model_state_dict (dict): The model state dictionary (All collected on cpu)
dtype (DataType): The data type or model precision
gpus_per_node (int, optional): Number of gpus per node
state_dict_split_by_layer_numbers (bool, optional): Are the model layers split by layer numbers in state dict. For example : mlp.fc1.weight can be represented like mlp.fc1.weight of shape [num_layers, hidden_dim, ffn_hidden_dim]} or it can be like mlp.fc1.layers.0.weight of shape [hidden_dim, ffn_hidden_dim], then mlp.fc1.layers.1.weight ... for all layers. If you use represenation 2 set this to True. Defaults to True
scales (dict): Dictionary with fp8 scaling factors
fp8_quantized (bool): True for fp8 checkpoint export
fp8_kvcache (bool): True for fp8 KV-cache quantization
Returns:
Two lists . List of trtllm converted model weights and trtllm model configs (One for each gpu).
"""Helper function to rename model layer names to TRTLLM Layer names
We go through each layer (keys) in the model state dict,
and map it to the equivalent TRTLLMLayer name (megatron/core/export/trtllm/trtllm).
If we have a layer number associated with layer, we extract it out,
map the original layer name to equivalent trtllm layer name and add layer number back.
CPU Conversion will pass in model state dict without layer numbers
(i.e decoder.layers.mlp.linear_fc1.weight of shape [num_layers, hidden_dim, 4 * hidden_dim]) .
GPU conversion will pass model state dict with each layer seperated
(i.e decoder.layers.2.mlp.linear_fc1.weight of shape [hidden_dim, 4 * hidden_dim]).
Args:
model_state_dict (dict): The original model state dict
trtllm_conversion_dict (dict): The conversion dictionary mapping input model layer names to trtllm layer names
state_dict_split_by_layer_numbers (bool, optional): Are the model layers split by layer numbers in state dict. For example : mlp.fc1.weight can be represented like mlp.fc1.weight of shape [num_layers, hidden_dim, ffn_hidden_dim]} or it can be like mlp.fc1.layers.0.weight of shape [hidden_dim, ffn_hidden_dim], then mlp.fc1.layers.1.weight ... for all layers. If you use represenation 2 set this to True. Defaults to True
Raises:
ValueError: In case the keys dont match to trtllm keys or if all model layers are not mapped to equivalent trtllm keys
Returns:
dict: The model state dict with the key (i.e original model layer name) replaced by trtllm layer names
f'Unable to rename key {original_layer_name_without_number}. Provide an appropriate mapping in the trtllm_conversion_dict when you initialize TRTLLMHelper'
),f"{trtllm_layer} is not supported for conversion. Please use one of the TRTLLMLayerNames we provided in megatron/core/export/trtllm/trtllm_layer_names"
"""The TRTLLM Converter class used for GPU (on device) conversion
This class is used to convert models sharded and on gpus. (It assumes that the model is already sharded appropriate to how you want to export it). (i.e) If you want to export to tp2pp2, then load the model in tp2pp2 setting and pass in their respective state dictionaries
"""
def__init__(
self,
transformer_config:TransformerConfig,
dtype:DataType,
multi_query_mode:bool=False,
activation:str="gelu",
scales:Optional[dict]=None,
):
"""Constructor for the TRTLLMModelWeightsConverterGPU class
This class is responsible to convert the model weights to TRTLLM equivalent weights.
Args:
transformer_config (TransformerConfig): The transformer config
dtype (DataType): The data type or model precision
multi_query_mode (bool, optional): Defaults to False.
activation (str, optional): Defaults to "gelu".
scales (dict, optional): Dictionary with fp8 scaling factors.
Transformer layers referes to layers within the transformber block. They have a layer number associated with them. Depending on the layer we either directly save it to trtllm_model_weights, or split it across some dimension and save the splits
Args:
model_state_dict (dict): The input model state dictionary (All collected on CPU)
layer (TRTLLMLayerNames): The TRTLLM Layer that we want to change
"""Convert Non Transformer layers to TRTLLM weights
Non transformer layers referes to layers that occur only once in the model (e.g Embedding , final output layer etc. ) They dont have any layer number associated with them. We remove this layer from the original state dict and cast it to storage type and convert to numpy and add it to trtllm_model_weights
Args:
model_state_dict (dict): The input model state dictionary (All collected on CPU)
layer (TRTLLMLayerNames): The TRTLLM Layer that we want to change
This method goes through each layer in the model state dict and converts to equivalent trtllm model weights. It also handles splitting across TP dimension , expert split etc.
Args:
model_state_dict (dict): The full model state dict (all on CPU)
trtllm_conversion_dict (dict): The conversion dictionary used to convert model layer names to trtllm layer names
tokenizer_vocab_size (int): The vocab size of the tokenizer
"""
# First step is to convert input model layer names to equivalent trtllm layer names
"""Convert Non Transformer layers to TRTLLM weights
Non transformer layers referes to layers that occur only once in the model (e.g Embedding , final output layer etc. ) They dont have any layer number associated with them. We remove this layer from the original state dict and cast it to storage type and convert to numpy and add it to trtllm_model_weights
Args:
model_state_dict (dict): The input model state dictionary (All collected on CPU)
layer_name (str): The TRTLLM Layer name that we want to convert
Transformer layers referes to layers within the transformber block. They have a layer number associated with them. Depending on the layer we either directly save it to trtllm_model_weights, or split it across some dimension and save the splits
Args:
model_state_dict (dict): The input model state dictionary (All collected on CPU)
layer (TRTLLMLayerNames): The TRTLLM Layer that we want to change
This method goes through each layer in the model state dict and converts to equivalent trtllm model weights. It also handles splitting across TP dimension , expert split etc.
Args:
model_state_dict (dict): The full model state dict (all on CPU)
trtllm_conversion_dict (dict): The conversion dictionary used to convert model layer names to trtllm layer names
state_dict_split_by_layer_numbers (bool, optional): Are the model layers split by layer numbers in state dict. For example : mlp.fc1.weight can be represented like mlp.fc1.weight of shape [num_layers, hidden_dim, ffn_hidden_dim]} or it can be like mlp.fc1.layers.0.weight of shape [hidden_dim, ffn_hidden_dim], then mlp.fc1.layers.1.weight ... for all layers. If you use represenation 2 set this to True. Defaults to True
"""
# First step is to convert input model layer names to equivalent trtllm layer names
Given the trtllm mapping information (tp, pp rank etc) we split the model weights in a list, with each element of the list corresponding to the weights of each gpu rank
Args:
mapping : The trtllm mapping information
trtllm_model_config (dict): The trtllm model config
"""
def_split(torch_tensor,tp_size,idx,dim=0):
"""Splits the np tensor v on dim and return the idx's slice."""