Commit 4b097dee authored by liangjing's avatar liangjing
Browse files

update to core_v0.9

parent 3aca1415
# Mixtral 8x7B Model Inference and Finetuning
## Download Mixtral 8x7B Checkpoints
Download Mixtral 8x7B HF format checkpoint from [HF-hub](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/)
Or you can simply run this following script to download Mixtral 8x7B into a specific folder.
```python
from huggingface_hub import snapshot_download
SAVED_DIR = "" # Specify the saved directory
# Download HF checkpoints
snapshot_download(repo_id="mistralai/Mixtral-8x7B-v0.1", ignore_patterns=["*.pt"], local_dir=SAVED_DIR, local_dir_use_symlinks=False)
```
## Convert Mixtral 8x7B checkpoints from HF to MCore
The HF checkpoints can be converted to Megatron format by using the provided checkpoint converter for HF format.
The target model parallel size(e.g. TP,PP,EP) should be specified.
Currently the converter doesn't support distributed checkpointing yet, so each different parallel config requires a specific checkpoint.
- For training, the recommended model parallel config is TP1EP8PP4
- For inference, the recommended model parallel config is TP1EP1PP2
```
TOKENIZER_MODEL=/workspace/checkpoints/mixtral-hf/tokenizer.model
MEGATRON_PATH="/workspace/megatron-lm"
export PYTHONPATH=$MEGATRON_PATH:$PYTHONPATH
export CUDA_DEVICE_MAX_CONNECTIONS=1
TARGET_TP_SIZE=""
TARGET_EP_SIZE=""
TARGET_PP_SIZE=""
HF_FORMAT_DIR=/workspace/checkpoints/mixtral-hf
MEGATRON_FORMAT_DIR=/workspace/checkpoints/mixtral-mcore-TP${TARGET_TP_SIZE}PP${TARGET_PP_SIZE}EP${TARGET_EP_SIZE}
python tools/checkpoint/convert.py \
--model-type GPT \
--loader loader_mixtral_hf \
--saver mcore \
--target-tensor-parallel-size ${TARGET_TP_SIZE} \
--target-pipeline-parallel-size ${TARGET_PP_SIZE} \
--target-expert-parallel-size ${TARGET_EP_SIZE} \
--load-dir ${HF_FORMAT_DIR} \
--save-dir ${MEGATRON_FORMAT_DIR} \
--tokenizer-model ${TOKENIZER_MODEL}
```
## Text generation with Mixtral 8x7B
Inference with Mixtral 8x7B requires at least 2 GPUS, such that a distributed checkpoint with EP>=2 or PP>=2 converted with above script is needed.
The Megatron-LM have included a simple REST server to use for text generation in `tools/run_text_generation_server.py`, launch it with the following script:
```
#!/bin/bash
# This example will start serving the Mixtral 8x7B model.
DISTRIBUTED_ARGS="--nproc_per_node 2 \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"
CHECKPOINT=<Path to checkpoint>
TOKENIZER_MODEL=<Path to tokenizer (e.g. /tokenizer.model)>
export CUDA_DEVICE_MAX_CONNECTIONS=1
pip install flask-restful
torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 2 \
--expert-model-parallel-size 1 \
--load ${CHECKPOINT} \
--tokenizer-type Llama2Tokenizer \
--tokenizer-model $TOKENIZER_MODEL \
--use-mcore-models \
--max-position-embeddings 32768 \
--num-layers 32 \
--hidden-size 4096 \
--ffn-hidden-size 14336 \
--num-attention-heads 32 \
--normalization RMSNorm \
--disable-bias-linear \
--position-embedding-type rope \
--no-position-embedding \
--swiglu \
--untie-embeddings-and-output-weights \
--group-query-attention \
--num-query-groups 8 \
--bf16 \
--micro-batch-size 1 \
--seq-length 1024 \
--seed 42 \
--num-experts 8 \
--moe-router-topk 2 \
--moe-token-dispatcher-type alltoall \
--moe-grouped-gemm \
--mock-data \
--rotary-base 1000000
```
Once the server is running you can use `tools/text_generation_cli.py` to query it, it takes one argument which is the host the server is running on.
```
python tools/text_generation_cli.py localhost:5000
```
## Finetuning from pretrained Mixtral 8x7B
To finetuning pretrained Mixtral 8x7B, use the following scripts:
```bash
PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.04-py3
CHECKPOINT_PATH="" # Speicfy path to checkpoint dir
TOKENIZER_MODEL="" # Specify path to tokenizer.model
DATA_PATH="" # Specify path to data
docker run \
--gpus=all \
--ipc=host \
--workdir /workspace/megatron-lm \
-v /path/to/data:/path/to/data \
-v /path/to/megatron-lm:/workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/mixtral/train_mixtral_8x7b_distributed.sh $CHECKPOINT_PATH $TOKENIZER_MODEL $DATA_PATH
```
The above functionality also applys to Mixtral 8x22B actually, you should set the model config (including hidden_size/head_num/num_layers/ffn_hidden_size) properly according to the original [config](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1/blob/main/config.json).
## Acknowledgements
Contributors outside NVIDIA for the huggingface converter and example of Mixtral models in Megatron-Core:
- Peng Li <jerry.lp@alibaba-inc.com>
- Jun Huang <huangjun.hj@alibaba-inc.com>
#!/bin/bash
# Runs Mixtral 8x7B model
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=${MASTER_ADDR:-"localhost"}
MASTER_PORT=${MASTER_PORT:-"6000"}
NNODES=${SLURM_NNODES:-"1"}
NODE_RANK=${RANK:-"0"}
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CHECKPOINT_PATH=$1
TOKENIZER_MODEL=$2
DATA_PATH=$3
DISTRIBUTED_ARGS=(
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
)
MODEL_ARGS=(
--use-mcore-models
--disable-bias-linear
--seq-length 4096
--max-position-embeddings 32768
--num-layers 32
--hidden-size 4096
--ffn-hidden-size 14336
--num-attention-heads 32
--init-method-std 0.01
--attention-dropout 0.0
--hidden-dropout 0.0
--normalization RMSNorm
--position-embedding-type rope
--swiglu
--untie-embeddings-and-output-weights
--group-query-attention
--num-query-groups 8
--no-masked-softmax-fusion
--no-position-embedding
--rotary-base 1000000
)
MOE_ARGS=(
--num-experts 8
--moe-router-topk 2
--moe-router-load-balancing-type aux_loss
--moe-aux-loss-coeff 1e-2
--moe-grouped-gemm
--moe-token-dispatcher-type alltoall
--overlap-param-gather
--overlap-grad-reduce
)
DATA_ARGS=(
--tokenizer-type Llama2Tokenizer
--tokenizer-model ${TOKENIZER_MODEL}
--data-path $DATA_PATH
--split 99990,8,2
)
TRAINING_ARGS=(
--micro-batch-size 1
--global-batch-size 256
--lr 1e-4
--train-iters 500000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 1.0e-5
--weight-decay 0.1
--lr-warmup-iters 500
--clip-grad 1.0
--bf16
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 4
--expert-model-parallel-size 8
--use-distributed-optimizer
--sequence-parallel
)
LOGGING_ARGS=(
--log-interval 1 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--tensorboard-dir "${CHECKPOINT_PATH}/tensorboard" \
--no-load-optim \
--no-load-rng
)
if [ -n "${WANDB_API_KEY}" ]; then
LOGGING_ARGS+=(
--wandb-project ${WANDB_PROJECT:-"Mixtral"}
--wandb-exp-name ${WANDB_NAME:-"Mixtral_8x7B"}
)
fi
torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
${MODEL_ARGS[@]} \
${MOE_ARGS[@]} \
${DATA_ARGS[@]} \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${LOGGING_ARGS[@]}
FROM nvcr.io/nvidia/pytorch:24.02-py3
RUN apt update && \
apt -y upgrade && \
apt install -y --no-install-recommends \
software-properties-common \
build-essential \
python3-pip \
python3-dev \
bash \
git \
vim \
python-is-python3 \
default-jre
RUN pip install --upgrade pip
RUN pip install einops einops-exts sentencepiece braceexpand webdataset
RUN pip install transformers datasets
RUN pip install pytest-cov pytest_mock nltk wrapt
RUN pip install zarr "tensorstore==0.1.45"
RUN pip install git+https://github.com/fanshiqing/grouped_gemm@main
RUN pip install black isort click==8.0.2
RUN pip install pycocoevalcap megatron-energon
RUN pip install git+https://github.com/openai/CLIP.git
# Use --no-deps for the following to avoid outdated and unnecessary dependencies.
RUN pip install open-flamingo[eval] --no-deps
# Multimodal Example
*NOTE: This example is under active development and is expected change.*
The following walks through all the steps required to pretrain and instruction tune a llava architecture vision-language model (VLM). It is important to precisely follow all steps to obtain the benchmark scores at the end.
This example has been tested on an A100 based DGX cluster. Pretraining and instruction tuning took approximately 1 day and 11 hours respectively on 64 GPUs using four way tensor parallelism (tp=4). Training speed will scale approximately linearly with number of GPUs available.
Multimodal support in megatron is still under active development. This example is not intended to produce state-of-the-art model quality (that would require more data and model refinements), it is merely intended to demonstrate the multimodal functionality in megatron. If you hit any problems, please open a github issue.
## Setup
### Docker container
You can build a docker container using `examples/multimodal/Dockerfile` to run this example.
### Language model
Follow the instructions in `megatron-lm/docs/llama_mistral.md` to download weights for Mistral-7B-Instruct-v0.3 and convert to mcore format with tensor parallel size 4
### Vision model
This example uses the OpenAI CLIP `ViT-L/14@336px` Vision model. To download the weights from OpenAI and convert them to a format that can be loaded in megatron, please run the following:
```
python examples/multimodal/clip_converter.py --download-root /some/download/folder --output /some/output/folder --tensor-parallel-size 4 --use-te
```
### Combined model checkpoint
Update the paths to point to the mcore converted CLIP and Mistral models and run the following script to combine the Mistral and CLIP models into a single multimodal checkpoint folder:
```
examples/multimodal/combine_mistral_clip.sh /path/to/mistral/model /path/to/clip/model /output/dir
```
## Training
### Pretraining
1. Download the LLavA-Pretrain dataset from Hugging Face and unzip the images folder (NOTE: 79GB of disk space required):
```
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain
cd LLaVA-Pretrain
unzip images.zip
```
3. Run the following script to convert the data to webdataset format:
```
cd <megatron-lm dir>
python examples/multimodal/convert_llava_pretrain_to_wds.py
```
4. Run the following command to convert to megatron-energon format:
```
cd <LLaVA-Pretrain dir>/wds
energon ./
```
select the following values for the presented options:
```
> Please enter a desired train/val/test split like "0.5, 0.2, 0.3" or "8,1,1": 9,1,0
> Do you want to create a dataset.yaml interactively? [Y/n]: Y
> Please enter a number to choose a class: 10 (VQAWebdataset)
> Do you want to set a simple field_map[Y] (or write your own sample_loader [n])? [Y/n]: Y
> Please enter a webdataset field name for 'image' (<class 'torch.Tensor'>): jpg
> Please enter a webdataset field name for 'context' (<class 'str'>): json[0][value]
> Please enter a webdataset field name for 'answers' (typing.Optional[typing.List[str]], default: None): json[1][value]
> Please enter a webdataset field name for 'answer_weights' (typing.Optional[torch.Tensor], default: None):
```
5. Update `pretrain_dataset.yaml` so that both `path` variables point to `LLaVA-Pretrain/wds`
6. Run the following script to pretrain a llava model for image captioning:
```
cd <megatron-lm dir>
examples/multimodal/pretrain_mistral_clip.sh
```
All being well you should observe training and validation loss curves similar to the following:
<img src="assets/pretrain_curves.png" alt="Pretraining loss curves" width="600"/>
These curves were obtained with global batch size of 256. Changing this value will likely change the curves. For pretraining and instruction tuning llava models we have found that loss curves are an unreliable predictor of downstream task performance. Therefore it is necessary to run test generation and evaluation on a range of metrics to understand model quality. We intend to add training time zero-shot evaluation in a future update.
You can execute the pretraining script multiple times to resume training. On resuming, the latest model, optimizer, and dataloader state are loaded.
### SFT
1. Prepare an instruction tuning dataset such in [megatron-energon format](https://nvidia.github.io/Megatron-Energon/data_prep.html#). NOTE: we do not provide instructions for this.
2. Update `sft_dataset.yaml` so that both `path` variables point to the train and val splits of your instruction tuning dataset.
Run the following script to instruction tune the pre-trained llava model:
```
examples/multimodal/sft_mistral_clip.sh
```
You can execute the SFT script multiple times to resume training. On resuming, the latest model, optimizer, and dataloader state are loaded.
## Evaluation
### Generation
Run the following script:
```
examples/multimodal/text_generation_mistral_clip.sh --input-image-path /path/to/input/images --output-path /some/output/directory \
--model-path /path/to/model.pt --tokenizer-path /path/to/tokenizer.model --gt-path /path/to/groundtruth/file --task generation-task-name
```
where `--task generation-task-name` is the name of the evaluation benchmark such as `captioning` or `MMMU`.
### After pretraining
#### COCO captioning
1. Download the COCO 2014 test image set:
```wget http://images.cocodataset.org/zips/test2014.zip```
2. Download COCO test image annotations:
```https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json```
3. First, run text generation using `--task captioning`.
4. Run the following command:
```
python examples/multimodal/evaluate_coco.py --input-path /output/directory/from/generation --groundtruth-path /path/to/groundtruth/file
```
For the mistral-7b-instruct plus clip llava model you should obtain a COCO CIDer score of approximately 94.
### After SFT
#### MMMU
The official MMMU repository is not pip installable currently so please clone their code in `examples/multimodal` by running `git clone https://github.com/MMMU-Benchmark/MMMU.git`.
The MMMU dataset is loaded from HuggingFace automatically as part of the code.
Run text generation using `--task MMMU`. Then, run the following command:
```
python examples/multimodal/evaluate_mmmu.py --input-path /output/directory/from/generation
```
For the mistral-7b-instruct plus clip instruction tuned llava model you should obtain a MMMU score of approximately 38.
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
import argparse
import os
import torch
import clip
def convert(download_root, output_path, tensor_parallel_size, use_te):
device = "cuda"
model, _ = clip.load("ViT-L/14@336px", device=device, download_root=download_root)
state_dict = model.state_dict()
new_state_dicts = [{"model": dict()} for _ in range(tensor_parallel_size)]
# Indices from mapping pytorch multihead attention to megatron.
kv_channels = 64
hidden_dim = 1024
num_heads = 16
indices = []
for i in range(num_heads):
lb = i * kv_channels
ub = (i + 1) * kv_channels
indices.append(torch.arange(lb, ub, dtype=torch.int))
indices.append(torch.arange(hidden_dim + lb, hidden_dim + ub, dtype=torch.int))
indices.append(torch.arange(2 * hidden_dim + lb, 2 * hidden_dim + ub, dtype=torch.int))
indices = torch.cat(indices)
for name, tensor in state_dict.items():
# Skip text model.
if "visual" not in name:
continue
# Skip final layers not used in our model.
if name == "visual.proj" or "ln_post" in name:
continue
# Map parameter names to ones used in megatron.
new_name = ""
new_tensor = tensor
if new_tensor.dtype == torch.float16:
new_tensor = new_tensor.to(torch.float32)
# This is used for chunking some tensors to target tensor parallel size.
chunk_dim = None
if "class_embedding" in name:
new_name = "class_token"
# Our model uses class token that is expanded to input dimensions already.
new_tensor = new_tensor.expand(1, 1, -1)
elif "positional_embedding" in name:
new_name = "position_embeddings.weight"
elif "conv1" in name:
new_name = "conv1.weight"
elif "ln_pre.weight" in name:
new_name = "ln_pre.weight"
elif "ln_pre.bias" in name:
new_name = "ln_pre.bias"
elif "transformer.resblocks" in name:
layer_idx = name.split(".")[3]
base = f"decoder.layers.{layer_idx}"
if "attn.in_proj_weight" in name:
new_name = f"{base}.self_attention.linear_qkv.weight"
new_tensor = new_tensor[indices]
chunk_dim = 0
elif "attn.in_proj_bias" in name:
new_name = f"{base}.self_attention.linear_qkv.bias"
new_tensor = new_tensor[indices]
chunk_dim = 0
elif "attn.out_proj.weight" in name:
new_name = f"{base}.self_attention.linear_proj.weight"
chunk_dim = 1
elif "attn.out_proj.bias" in name:
new_name = f"{base}.self_attention.linear_proj.bias"
elif "ln_1.weight" in name:
new_name = f"{base}.input_layernorm.weight"
if use_te:
new_name = f"{base}.self_attention.linear_qkv.layer_norm_weight"
elif "ln_1.bias" in name:
new_name = f"{base}.input_layernorm.bias"
if use_te:
new_name = f"{base}.self_attention.linear_qkv.layer_norm_bias"
elif "mlp.c_fc.weight" in name:
new_name = f"{base}.mlp.linear_fc1.weight"
chunk_dim = 0
elif "mlp.c_fc.bias" in name:
new_name = f"{base}.mlp.linear_fc1.bias"
chunk_dim = 0
elif "mlp.c_proj.weight" in name:
new_name = f"{base}.mlp.linear_fc2.weight"
chunk_dim = 1
elif "mlp.c_proj.bias" in name:
new_name = f"{base}.mlp.linear_fc2.bias"
elif "ln_2.weight" in name:
new_name = f"{base}.pre_mlp_layernorm.weight"
if use_te:
new_name = f"{base}.mlp.linear_fc1.layer_norm_weight"
elif "ln_2.bias" in name:
new_name = f"{base}.pre_mlp_layernorm.bias"
if use_te:
new_name = f"{base}.mlp.linear_fc1.layer_norm_bias"
assert new_name != "", f"unexpected layer name {name}"
if chunk_dim is None:
new_tensors = [new_tensor for _ in range(tensor_parallel_size)]
else:
new_tensors = torch.chunk(new_tensor, tensor_parallel_size, dim=chunk_dim)
for i in range(tensor_parallel_size):
# chunk() creates a view of a bigger tensor. clone() is used here to avoid excessive storage.
new_state_dicts[i]["model"][new_name] = new_tensors[i].clone()
# TE sets _extra_state (for FP8 purposes), so set an empty one here for compatibility.
extra_state_layers = ("linear_qkv", "linear_proj", "linear_fc1", "linear_fc2")
is_extra_state_layer = any([l in new_name for l in extra_state_layers])
if use_te and is_extra_state_layer:
layer = new_name.split(".")[-2]
if layer in extra_state_layers:
extra_state_name = (
new_name[: new_name.rfind(".") + 1] + "_extra_state"
) # Replace the weight name.
new_state_dicts[i]["model"][extra_state_name] = None
for i in range(tensor_parallel_size):
output_dir_tp = os.path.join(output_path, "iter_0000001", f"mp_rank_0{i}")
os.makedirs(output_dir_tp)
output_path_tp = os.path.join(output_dir_tp, "model_optim_rng.pt")
torch.save(new_state_dicts[i], output_path_tp)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="""
Convert OpenAI CLIP VIT weights to megatron format.
Example usage:
python clip_converter.py --download-root /some/download/folder --output /some/output/folder --tensor-parallel-size 4
""",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"--download-root", type=str, required=True, help="Download folder for OpenAI CLIP weights"
)
parser.add_argument(
"--output", type=str, required=True, help="output directory for megatron state dict file(s)"
)
parser.add_argument(
"--tensor-parallel-size", type=int, default=1, help="model tensor parallel size"
)
parser.add_argument("--use-te", action="store_true", help="Use Transformer Engine")
args = parser.parse_args()
convert(args.download_root, args.output, args.tensor_parallel_size, args.use_te)
print("done.")
#/bin/bash
MCORE_MISTRAL=$1 # <path_to_mcore_mistral_model_folder>
MCORE_CLIP=$2 # <path_to_mcore_clip_model_folder>
OUTPUT_DIR=$3 # <path_to_output_folder_for_combined_checkpoint>
python examples/multimodal/combine_state_dicts.py \
--input \
${MCORE_MISTRAL}/iter_0000001/mp_rank_00/model_optim_rng.pt \
${MCORE_CLIP}/iter_0000001/mp_rank_00/model_optim_rng.pt \
${MCORE_MISTRAL}/iter_0000001/mp_rank_01/model_optim_rng.pt \
${MCORE_CLIP}/iter_0000001/mp_rank_01/model_optim_rng.pt \
${MCORE_MISTRAL}/iter_0000001/mp_rank_02/model_optim_rng.pt \
${MCORE_CLIP}/iter_0000001/mp_rank_02/model_optim_rng.pt \
${MCORE_MISTRAL}/iter_0000001/mp_rank_03/model_optim_rng.pt \
${MCORE_CLIP}/iter_0000001/mp_rank_03/model_optim_rng.pt \
--prefixes language_model vision_model language_model vision_model language_model vision_model language_model vision_model \
--output \
${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/iter_0000001/mp_rank_00/model_optim_rng.pt \
${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/iter_0000001/mp_rank_01/model_optim_rng.pt \
${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/iter_0000001/mp_rank_02/model_optim_rng.pt \
${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/iter_0000001/mp_rank_03/model_optim_rng.pt
echo 1 > ${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/latest_checkpointed_iteration.txt
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
import argparse
import os
import sys
import torch
# Add megatron to the path.
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir, os.path.pardir))
)
def combine(input_files, module_prefixes, output_files):
num_inputs_per_output = int(len(input_files) / len(output_files))
for output_idx, output_file in enumerate(output_files):
combined_state_dict = None
lb = output_idx * num_inputs_per_output
ub = (output_idx + 1) * num_inputs_per_output
current_input_files = input_files[lb:ub]
current_module_prefixes = module_prefixes[lb:ub]
for i, (input_file, module_prefix) in enumerate(
zip(current_input_files, current_module_prefixes)
):
# initialize the combined state dict using the first provided input file
current_state_dict = torch.load(input_file)
if i == 0:
combined_state_dict = current_state_dict.copy()
combined_state_dict["model"] = dict()
# copy model state dict and prefix names with the given module keys.
for k, v in current_state_dict["model"].items():
combined_state_dict["model"]["%s.%s" % (module_prefix, k)] = v
output_dir = os.path.dirname(output_file)
if not os.path.exists(output_dir):
os.makedirs(output_dir, exist_ok=True)
torch.save(combined_state_dict, output_file)
print("saved:", output_file)
print("done.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="""
Combine multiple state dicts into a single state dict.
The combined state dict is first initialized by taking a copy of the first provided input state dict.
To avoid conflicts in model parameter names, a prefix must be provided for each input file.
Model parameter names will be renamed from <original name> to <model prefix>.<original name>.
Example usage:
python combine_state_dicts.py --input language_model.pt vision_model.pt --prefixes language_model vision_model --output multimodal.pt
""",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument("--input", nargs="*", required=True, help="paths to input state dict files")
parser.add_argument(
"--prefixes",
nargs="*",
required=True,
help="prefixes to use with each input model's parameters",
)
parser.add_argument(
"--output", nargs="*", required=True, help="path(s) to output state dict file"
)
args = parser.parse_args()
assert len(args.input) > 1, "must provide more than 1 input model to combine"
assert len(args.input) == len(args.prefixes), "each input model must have a corresponding key"
assert (
len(args.input) % len(args.output) == 0
), "each output file must use the same number of input files"
combine(args.input, args.prefixes, args.output)
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
import torch
from megatron.training.activations import quick_gelu, squared_relu
def get_language_model_config(config):
if config.language_model_type == "2b":
config.add_bias_linear = False
config.bias_activation_fusion = False
config.gated_linear_unit = True
config.apply_query_key_layer_scaling = True
config.layernorm_zero_centered_gamma = True
config.bias_dropout_fusion = False
config.rotary_percent = 0.5
config.apply_rope_fusion = False
config.attention_softmax_in_fp32 = True
elif config.language_model_type == "8b":
config.add_bias_linear = False
config.bias_activation_fusion = False
config.gated_linear_unit = False
config.apply_query_key_layer_scaling = True
config.layernorm_zero_centered_gamma = True
config.bias_dropout_fusion = False
config.rotary_percent = 0.5
config.attention_dropout = 0.0
config.apply_rope_fusion = False
config.activation_func = squared_relu
config.ffn_hidden_size = 16384
config.masked_softmax_fusion = True
config.attention_softmax_in_fp32 = True
config.num_query_groups = 32
config.kv_channels = 128
config.rotary_interleaved = False
elif config.language_model_type == "llama3_8b":
config.activation_func = torch.nn.functional.silu
config.add_bias_linear = False
config.bias_activation_fusion = False
config.gated_linear_unit = True
config.apply_query_key_layer_scaling = False
config.layernorm_zero_centered_gamma = (
False # Zero centered gamma not supported for RMSNorm
)
config.bias_dropout_fusion = False
config.apply_rope_fusion = False
config.attention_softmax_in_fp32 = True
config.ffn_hidden_size = 14336
elif config.language_model_type == "mistral_7b":
config.activation_func = torch.nn.functional.silu
config.add_bias_linear = False
config.bias_activation_fusion = False
config.gated_linear_unit = True
config.apply_query_key_layer_scaling = False
config.layernorm_zero_centered_gamma = (
False # Zero centered gamma not supported for RMSNorm
)
config.bias_dropout_fusion = False
config.apply_rope_fusion = False
config.attention_softmax_in_fp32 = True
config.ffn_hidden_size = 14336
return config
def get_vision_model_config(config, apply_query_key_layer_scaling):
if config.vision_model_type == "clip":
config.num_layers = 24
config.num_attention_heads = 16
config.add_bias_linear = True
config.add_qkv_bias = True
config.hidden_size = 1024
config.hidden_dropout = 0.0
config.attention_dropout = 0.0
config.ffn_hidden_size = 4096
config.gated_linear_unit = False
config.activation_func = quick_gelu
config.kv_channels = 64
config.num_attention_heads = 16
config.num_query_groups = 16
config.layernorm_zero_centered_gamma = False
config.apply_query_key_layer_scaling = apply_query_key_layer_scaling
config.bias_activation_fusion = False
config.bias_dropout_fusion = False
config.attention_softmax_in_fp32 = True
config.normalization = 'LayerNorm'
config.apply_rope_fusion = False
return config
def get_vision_projection_config(config, hidden_size):
config.gated_linear_unit = False
config.bias_activation_fusion = False
config.add_bias_linear = False
config.hidden_size = hidden_size # Used as the vision projection output size, i.e., the input to the language model.
if config.language_model_type == "2b":
config.ffn_hidden_size = 5440
config.activation_func = torch.nn.functional.gelu
if config.language_model_type == "8b":
config.ffn_hidden_size = 16384
config.activation_func = squared_relu
elif config.language_model_type == "llama3_8b":
config.ffn_hidden_size = 14336
config.activation_func = torch.nn.functional.gelu
elif config.language_model_type == "mistral_7b":
config.ffn_hidden_size = 14336
config.activation_func = torch.nn.functional.gelu
return config
# From https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/conversation.py
import dataclasses
from enum import auto, Enum
from typing import List
class SeparatorStyle(Enum):
"""Different separator style."""
SINGLE = auto()
TWO = auto()
MPT = auto()
PLAIN = auto()
LLAMA_2 = auto()
@dataclasses.dataclass
class Conversation:
"""A class that keeps all conversation history."""
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle = SeparatorStyle.SINGLE
sep: str = "###"
sep2: str = None
real_sep2: str = None
version: str = "Unknown"
skip_next: bool = False
def get_prompt(self):
messages = self.messages
if len(messages) > 0 and type(messages[0][1]) is tuple:
messages = self.messages.copy()
init_role, init_msg = messages[0].copy()
init_msg = init_msg[0].replace("<image>", "").strip()
if 'mmtag' in self.version:
messages[0] = (init_role, init_msg)
messages.insert(0, (self.roles[0], "<Image><image></Image>"))
messages.insert(1, (self.roles[1], "Received."))
else:
messages[0] = (init_role, "<image>\n" + init_msg)
if self.sep_style == SeparatorStyle.SINGLE:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + self.sep
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.TWO:
seps = [self.sep, self.sep2]
ret = self.system + seps[0]
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + seps[i % 2]
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.MPT:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.LLAMA_2:
wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
ret = ""
for i, (role, message) in enumerate(messages):
if i == 0:
assert message, "first message should not be none"
assert role == self.roles[0], "first message should come from user"
if message:
if type(message) is tuple:
message, _, _ = message
if i == 0: message = wrap_sys(self.system) + message
if i % 2 == 0:
message = wrap_inst(message)
ret += self.sep + message
else:
ret += " " + message + " " + self.sep2
else:
ret += ""
ret = ret.lstrip(self.sep)
elif self.sep_style == SeparatorStyle.PLAIN:
seps = [self.sep, self.sep2]
ret = self.system
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += message + seps[i % 2]
else:
ret += ""
else:
raise ValueError(f"Invalid style: {self.sep_style}")
return ret
def append_message(self, role, message):
self.messages.append([role, message])
def get_images(self, return_pil=False):
images = []
for i, (role, msg) in enumerate(self.messages[self.offset:]):
if i % 2 == 0:
if type(msg) is tuple:
import base64
from io import BytesIO
from PIL import Image
msg, image, image_process_mode = msg
if image_process_mode == "Pad":
def expand2square(pil_img, background_color=(122, 116, 104)):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image)
elif image_process_mode in ["Default", "Crop"]:
pass
elif image_process_mode == "Resize":
image = image.resize((336, 336))
else:
raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
max_len, min_len = 800, 400
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if longest_edge != max(image.size):
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
if return_pil:
images.append(image)
else:
buffered = BytesIO()
image.save(buffered, format="PNG")
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
images.append(img_b64_str)
return images
def to_gradio_chatbot(self):
ret = []
for i, (role, msg) in enumerate(self.messages[self.offset:]):
if i % 2 == 0:
if type(msg) is tuple:
import base64
from io import BytesIO
msg, image, image_process_mode = msg
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
max_len, min_len = 800, 400
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
buffered = BytesIO()
image.save(buffered, format="JPEG")
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
img_str = f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
msg = img_str + msg.replace('<image>', '').strip()
ret.append([msg, None])
else:
ret.append([msg, None])
else:
ret[-1][-1] = msg
return ret
def copy(self):
return Conversation(
system=self.system,
roles=self.roles,
messages=[[x, y] for x, y in self.messages],
offset=self.offset,
sep_style=self.sep_style,
sep=self.sep,
sep2=self.sep2,
real_sep2=self.real_sep2,
version=self.version)
def dict(self):
if len(self.get_images()) > 0:
return {
"system": self.system,
"roles": self.roles,
"messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
"real_sep2": self.real_sep2
}
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
"real_sep2": self.real_sep2
}
conv_mpt = Conversation(
system="""<|im_start|>system
A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=(),
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
### Used for llava-pretraining
conv_llava_plain = Conversation(
system="",
roles=("", ""),
messages=(
),
offset=0,
sep_style=SeparatorStyle.PLAIN,
sep="\n",
)
conv_llava_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=(
),
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_llava_v0_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: <Image>visual content</Image>.",
roles=("Human", "Assistant"),
messages=(
),
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
version="v0_mmtag",
)
conv_llava_v1 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=(),
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="</s>",
)
conv_llava_v1_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: <Image>visual content</Image>.",
roles=("USER", "ASSISTANT"),
messages=(),
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="</s>",
version="v1_mmtag",
)
chatqa_sft = Conversation(
system="System: This is a chat between a user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions.",
roles=("User", "Assistant"),
version="chatqa",
messages=(),
offset=0,
sep_style=SeparatorStyle.TWO,
sep="\n\n",
sep2="\n\n",
real_sep2="\n\n"
)
conv_chatml = Conversation(
system="""<|im_start|>system
Answer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=(),
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
mistral_instruct = Conversation(
system="",
roles=("user", "assistant"),
version="mpt",
messages=(),
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="</s>",
)
llama3_instruct = Conversation(
system="<|start_header_id|>system<|end_header_id|>\n\nAnswer the questions.",
roles=("<|start_header_id|>user<|end_header_id|>\n\n", "<|start_header_id|>assistant<|end_header_id|>\n\n"),
version="mpt",
messages=(),
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|eot_id|>",
)
conv_templates = {
"plain": conv_llava_plain,
"v0_plain": conv_llava_plain,
"llava_v0": conv_llava_v0,
"v0_mmtag": conv_llava_v0_mmtag,
"llava_v1": conv_llava_v1,
"v1_mmtag": conv_llava_v1_mmtag,
"mpt": conv_mpt,
}
import json
import os
import webdataset as wds
from tqdm import tqdm
llava_pretrain_dir = '<path_to_LLaVA-Pretrain>'
# Paths to the dataset files
json_file = os.path.join(llava_pretrain_dir, 'blip_laion_cc_sbu_558k.json')
output = os.path.join(llava_pretrain_dir, 'wds')
if not os.path.exists(output):
os.mkdir(output)
# Load data
with open(json_file, 'r') as f:
data = json.load(f)
with wds.ShardWriter(os.path.join(output, 'pretrain-%d.tar'), maxcount=10000) as shard_writer:
for entry in tqdm(data):
with open(os.path.join(llava_pretrain_dir, entry['image']), "rb") as img_file:
image_data = img_file.read()
sample = {
"__key__": entry['id'],
"jpg": image_data,
"json": json.dumps(entry['conversations']).encode("utf-8"),
}
shard_writer.write(sample)
print(f"Dataset successfully converted to wds")
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
import os
import torch
from dataset_helpers import TaskEncoder, print_error_handler
from megatron.core import mpu
from megatron.energon import (
LimitDataset,
RepeatDataset,
WorkerConfig,
get_loader,
get_savable_loader,
get_train_dataset,
get_val_datasets,
)
from megatron.core.num_microbatches_calculator import get_num_microbatches
from megatron.core.parallel_state import get_tensor_model_parallel_rank
from megatron.training import get_args, print_rank_0
from megatron.training.checkpointing import get_checkpoint_name
def datasets_provider(worker_config=None):
"""Create multimodal train, validation and test datasets."""
args = get_args()
dname = args.data_path[0] if type(args.data_path) is list else args.data_path
train_dataset = get_train_dataset(
dname,
batch_size=args.micro_batch_size,
task_encoder=TaskEncoder(),
worker_config=worker_config,
virtual_epoch_length=1000,
max_samples_per_sequence=100,
shuffle_buffer_size=100,
handler=print_error_handler,
image_decode="pil",
)
val_datasets = get_val_datasets(
dname,
batch_size=args.micro_batch_size,
# This is the total number over all workers
# limit=args.eval_iters * get_num_microbatches(),
task_encoder=TaskEncoder(),
worker_config=worker_config,
handler=print_error_handler,
image_decode="pil",
)
val_datasets_without_source_datasets = [
# Limit the dataset to eval_iters * num_microbatches
LimitDataset(
# Repeat the inner dataset in case it's too short
RepeatDataset(val_ds, worker_config=worker_config),
length=args.eval_iters * get_num_microbatches(),
worker_config=worker_config,
reset_after_epoch=True,
)
for val_ds, _src_ds in val_datasets
]
return train_dataset, val_datasets_without_source_datasets, None
def train_valid_test_dataloaders_provider(train_val_test_num_samples):
"""Build multimodal train, validation and test dataloaders."""
if get_tensor_model_parallel_rank() != 0:
return None, None, None
args = get_args()
worker_debug_path = None
worker_log_level = 0
rank = mpu.get_data_parallel_rank()
world_size = mpu.get_data_parallel_world_size()
data_parallel_group = mpu.get_data_parallel_group()
worker_config = WorkerConfig(
rank=rank,
world_size=world_size,
num_workers=args.num_workers,
data_parallel_group=data_parallel_group,
worker_debug_path=worker_debug_path,
worker_log_level=worker_log_level,
)
train_ds, valid_ds1, test_ds = datasets_provider(worker_config)
train_dataloader = get_savable_loader(train_ds, worker_config=worker_config)
if args.load is not None:
if getattr(args, "dataloader_save", None):
dp_rank = mpu.get_data_parallel_rank()
data_save_name = get_checkpoint_name(
args.dataloader_save,
args.iteration,
basename=f"train_dataloader_dprank{dp_rank:03d}.pt",
)
if os.path.exists(data_save_name):
try:
dataset_state_dict = torch.load(data_save_name, map_location="cpu")
train_dataloader.restore_state_rank(dataset_state_dict["dataloader_state_dict"])
print_rank_0(f"restored dataset state from {data_save_name}")
except Exception as e:
print_rank_0("loading dataloader checkpoint failed. Skipping. " + str(e))
valid_dataloader = [
EnergonDataloader(get_loader(valid_ds, worker_config=worker_config))
for valid_ds in valid_ds1
]
test_dataloader = None
return EnergonDataloader(train_dataloader), valid_dataloader, EnergonDataloader(test_dataloader)
class EnergonDataloader:
"""A wrapper to use Megatron Energon dataloader with the Megatron-LM training loop."""
def __init__(self, dataloader):
self._dataloader = dataloader
self._iter = iter(cyclic_iter(dataloader))
def __next__(self):
return self._iter.__next__()
def __iter__(self):
return self._iter.__iter__()
def save_state(self):
return self._dataloader.save_state_rank()
def cyclic_iter(iter):
while True:
for x in iter:
yield x
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
import dataclasses
import json
import random
import re
import sys
import traceback
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Tuple, Union
from image_processing import get_visual_transform
import conversation as conversation_lib
import numpy as np
import torch
from PIL import Image, ImageDraw
from torchvision import transforms as T
from megatron.core.models.multimodal.llava_model import IGNORE_INDEX, IMAGE_TOKEN_INDEX
from megatron.energon import (
Batch,
CaptioningSample,
DefaultTaskEncoder,
OCRSample,
SimilarityInterleavedSample,
VQASample,
)
from megatron.energon.transforms import CustomTransform, MergeTransform
from megatron.training import get_args
from megatron.training.tokenizer import build_tokenizer
class RandomResize(CustomTransform):
"""Resizes the image by a random scale factor in the given interval, but at most max_size"""
def __init__(self, min_scale: float, max_scale: float, max_size: int):
self._min_scale = min_scale
self._max_scale = max_scale
self._max_size = max_size
def apply_transform(self, matrix: np.ndarray, dst_size: np.ndarray) -> Tuple[Any, Any, Any]:
scale = random.uniform(self._min_scale, self._max_scale)
new_size = tuple(int(x * scale) for x in dst_size)
if max(new_size) > self._max_size:
scale = self._max_size / max(new_size)
new_size = tuple(int(x * scale) for x in dst_size)
matrix = self.scale(scale, scale) @ matrix
dst_size = np.array(new_size, dtype=dst_size.dtype)
return matrix, dst_size, (self.__class__.__name__, scale)
class RandomResizeLongEdge(CustomTransform):
"""Resizes the image's longer edge to a random length between min_size and max_size pixels."""
def __init__(self, min_size: int, max_size: int):
self._min_size = min_size
self._max_size = max_size
def apply_transform(self, matrix: np.ndarray, dst_size: np.ndarray) -> Tuple[Any, Any, Any]:
new_long = random.randint(self._min_size, self._max_size)
if dst_size[0] > dst_size[1]: # h > w
new_w, new_h = int(new_long * dst_size[1] / dst_size[0]), new_long
else: # w > h
new_w, new_h = new_long, int(new_long * dst_size[0] / dst_size[1])
new_size = (new_h, new_w)
matrix = self.scale(new_w / dst_size[1], new_h / dst_size[0]) @ matrix
dst_size = np.array(new_size, dtype=dst_size.dtype)
return matrix, dst_size, (self.__class__.__name__, new_size)
class RandomPad(CustomTransform):
"""Pads the image to the given size, randomly choosing the position of the image within the new larger image.
If the image is already larger than the given size, it will not be padded in that direction(s)."""
def __init__(self, size: Tuple[int, int]):
self._new_size = size # h, w
def apply_transform(self, matrix: np.ndarray, dst_size: np.ndarray) -> Tuple[Any, Any, Any]:
h_pad = max(self._new_size[0] - dst_size[0], 0)
w_pad = max(self._new_size[1] - dst_size[1], 0)
if h_pad == 0 and w_pad == 0:
return matrix, dst_size, (self.__class__.__name__, None)
else:
# TODO: fix me
# top = random.randint(0, h_pad)
# left = random.randint(0, w_pad)
top = 0
left = 0
matrix = self.translate(left, top) @ matrix
dst_size = np.array(self._new_size, dtype=dst_size.dtype)
return matrix, dst_size, (self.__class__.__name__, (top, left))
def _get_ocr_document_visual_transform(IMG_H=1024, IMG_W=1024):
document_visual_transform = T.Compose(
[
MergeTransform(
[
# T.RandomResizedCrop(size=FINAL_SIZE, scale=(0.5, 1.0), ratio=(0.8, 1.2)),
RandomResizeLongEdge(960, 1008), # Note: 1008 comes from list(range(960, 1024, 16))[-1]
T.RandomRotation(5, interpolation=T.InterpolationMode.BILINEAR),
T.RandomPerspective(distortion_scale=0.1, p=0.1),
RandomPad((IMG_H, IMG_W)),
]
),
T.ColorJitter(brightness=(0.8, 1.2), contrast=(0.7, 1.0)),
T.RandomGrayscale(p=0.5),
T.RandomInvert(p=0.5),
T.RandomAdjustSharpness(sharpness_factor=0.0, p=0.5),
T.RandomAdjustSharpness(sharpness_factor=2.0, p=0.5),
# LogImage(),
# T.ToTensor(),
# T.Normalize(IMAGE_MEAN, IMAGE_STD),
]
)
return document_visual_transform
def _get_ocr_document_identity_transform(IMG_H=1024, IMG_W=1024):
long_edge = max(IMG_H, IMG_W)
document_identity_transform = T.Compose(
[
MergeTransform(
[
RandomResizeLongEdge(long_edge, long_edge),
RandomPad((long_edge, long_edge)),
]
)
]
)
return document_identity_transform
def _get_ocr_paragraph_visual_transform(IMG_H=1024, IMG_W=1024):
paragraph_visual_transform = T.Compose(
[
MergeTransform(
[
# T.RandomResizedCrop(size=FINAL_SIZE, scale=(0.5, 1.0), ratio=(0.8, 1.2)),
RandomResize(0.5, 2.0, min(IMG_H, IMG_W)), #FINAL_SIZE),
T.RandomRotation(1, interpolation=T.InterpolationMode.BILINEAR),
T.RandomPerspective(distortion_scale=0.1, p=0.1),
RandomPad((IMG_H, IMG_W)),
]
),
T.ColorJitter(brightness=(0.8, 1.2), contrast=(0.7, 1.0)),
T.RandomGrayscale(p=0.5),
T.RandomInvert(p=0.5),
# T.RandomAdjustSharpness(sharpness_factor=0.0, p=0.5),
# T.RandomAdjustSharpness(sharpness_factor=2.0, p=0.5),
# LogImage(),
# T.ToTensor(),
# T.Normalize(IMAGE_MEAN, IMAGE_STD),
]
)
return paragraph_visual_transform
# Type for intermediate batch, after batch()
@dataclass
class ImageTaskSample:
__key__: str
__subflavors__: Dict
# (c, h, w)
imgs: List[torch.Tensor]
num_tiles: List[int]
text: np.ndarray
prompt_len: np.int64
target: torch.Tensor = None
# Typing for the resulting batch data after encode_batch()
@dataclass
class ImageTaskBatch(Batch):
__keys__: List[str]
__subflavors__: List[Dict]
# (num_tiles, c, h, w)
imgs: torch.Tensor
num_tiles: List[int]
# (n, seq_len)
text: torch.Tensor
# (n, 1)
prompt_len: torch.Tensor
# (n, seq_len)
target: torch.Tensor
class IdentitySplitter(object):
def tokenize(self, *text):
return text
class Tokenizer:
def __init__(self):
args = get_args()
self.args = args
self.initializer()
def initializer(self):
# Use Encoder class as a container for global data
Tokenizer.tokenizer = build_tokenizer(self.args)
if hasattr(Tokenizer.tokenizer, 'eod'):
self.eod_token = Tokenizer.tokenizer.eod
elif hasattr(Tokenizer.tokenizer, 'eos_id'):
self.eod_token = Tokenizer.tokenizer.eos_id
else:
raise AttributeError('No eod token found in Tokenizer')
self.split_token = 313131
if (
hasattr(self.args, "split_sentences") and self.args.split_sentences
): # default false
if not nltk_available:
print("NLTK is not available to split sentences.")
exit()
library = "tokenizers/punkt/{}.pickle".format("english")
# print("loading: " + library)
splitter = nltk.load(library)
if self.args.keep_newlines:
# this prevents punkt from eating newlines after sentences
Tokenizer.splitter = nltk.tokenize.punkt.PunktSentenceTokenizer(
train_text=splitter._params, lang_vars=CustomLanguageVars()
)
else:
Tokenizer.splitter = splitter
else:
Tokenizer.splitter = IdentitySplitter()
def __call__(self, text: str, padded: bool = True): # -> torch.Tensor:
sentence = Tokenizer.splitter.tokenize(text)[0]
sentence = Tokenizer.tokenizer.tokenize(sentence)
return sentence
def pad(self, content, seq_len=1024):
out = np.pad(content, pad_width=(0,max(0,seq_len-len(content))), mode='constant', constant_values=self.eod_token)
return out
class TaskEncoder(DefaultTaskEncoder[OCRSample, OCRSample, ImageTaskBatch, dict]):
"""A simple task encoder for captioning."""
def __init__(
self
):
# Specify the batch_type for default batching (batching is performed here "manually" by
# overwriting the `batch` method)
super().__init__()
self.args = get_args()
self.tokenizer = Tokenizer()
self.manual_prompts = json.load(open(self.args.prompt_path))
self.seq_len = self.args.decoder_seq_length - self.args.seq_length
self.max_seq_len = self.seq_len
self.txt_to_token_dict = {}
self.img_h, self.img_w = self.args.img_h, self.args.img_w
self.ocr_document_visual_transform = _get_ocr_document_visual_transform(self.img_h, self.img_w)
self.ocr_document_identity_transform = _get_ocr_document_identity_transform(self.img_h, self.img_w)
self.ocr_paragraph_visual_transform = _get_ocr_paragraph_visual_transform(self.img_h, self.img_w)
def encode_sample(self, sample: Union[CaptioningSample, OCRSample, VQASample, SimilarityInterleavedSample]):
if isinstance(sample, OCRSample):
yield self.encode_ocr(sample)
elif isinstance(sample, CaptioningSample):
yield self.encode_captioning(sample)
elif isinstance(sample, VQASample):
is_llava_training = sample.__subflavors__['is_llava_training'] if 'is_llava_training' in sample.__subflavors__ else False
if "llava" in sample.__key__ or is_llava_training:
yield self.encode_llava_pretrain(sample)
else:
yield self.encode_vqa(sample)
elif isinstance(sample, SimilarityInterleavedSample):
if "llava" in sample.__key__:
yield self.encode_llava_sft(sample)
else:
raise NotImplementedError('Sample format not supported')
else:
raise NotImplementedError('Sample format not supported')
def encode_captioning(self, sample: CaptioningSample):
augment = sample.__subflavors__.get("augmentation")
conv_format = sample.__subflavors__['conv_format'] if 'conv_format' in sample.__subflavors__ else 'mistral'
imgs = get_visual_transform(
sample.image, self.img_h, self.img_w, self.args.use_tiling, self.args.max_num_tiles, self.args.use_thumbnail, augment,
)
num_tiles = [len(imgs)]
prompt_list = self.manual_prompts["CaptioningPretraining"]["llava"]
prompt_idx = np.random.randint(len(prompt_list))
cur_prompt = prompt_list[prompt_idx]
cur_prompt = "<image>\n" + cur_prompt + "\n"
caption = sample.caption.strip()
split_by_line_flag = sample.__subflavors__.get("SplitByLine")
if split_by_line_flag:
caption_list = caption.split('\n')
caption = np.random.choice(caption_list)
if conv_format == 'llama3_sft':
conv = conversation_lib.llama3_instruct.copy()
sep = conv.sep
elif conv_format == "mistral":
conv = conversation_lib.mistral_instruct.copy()
conv = conv.sep2
conversation = cur_prompt + caption + sep
input_ids = np.array(tokenizer_image_token(self.args, conversation, self.tokenizer, has_image=True))
target = input_ids.copy()
prompt_len = len(tokenizer_image_token(self.args, cur_prompt, self.tokenizer))
target[:prompt_len] = IGNORE_INDEX
input_ids = self.tokenizer.pad(input_ids, self.max_seq_len+1) # pad with EOD
target = self.tokenizer.pad(target, self.max_seq_len+1) #, pad_value=IGNORE_INDEX) # pad with ignore_index. this will be used to create loss_mask
return ImageTaskSample(
__key__=sample.__key__,
__subflavors__=sample.__subflavors__,
imgs=imgs,
num_tiles=num_tiles,
text=input_ids,
prompt_len=prompt_len,
target=target,
)
def encode_llava_pretrain(self, sample: VQASample):
augment = sample.__subflavors__['augmentation'] if 'augmentation' in sample.__subflavors__ else False
use_chat_format = sample.__subflavors__['use_chat_format'] if 'use_chat_format' in sample.__subflavors__ else False
conv_format = sample.__subflavors__['conv_format'] if 'conv_format' in sample.__subflavors__ else "mistral"
imgs = get_visual_transform(
sample.image, self.img_h, self.img_w, self.args.use_tiling, self.args.max_num_tiles, self.args.use_thumbnail, augment,
)
num_tiles = [len(imgs)]
assert "<image>" in sample.context
has_image = True
if use_chat_format:
prompt_idx = np.random.randint(len(self.manual_prompts["Captioning"]["raw"]))
prompt = self.manual_prompts["Captioning"]["raw"][prompt_idx]
sample.context = "User: <image>" + "\n" + prompt + " Assistant: "
conversation = sample.context + sample.answers + conversation_lib.mistral_instruct.sep
else:
# LLAVA training: override text-prompt with just IMAGE_TOKEN_INDEX
sample.context = "<image>" + "\n"
if conv_format == 'llama3_sft':
conversation = sample.context + sample.answers + conversation_lib.llama3_instruct.sep
elif conv_format == "mistral":
conversation = sample.context + sample.answers + conversation_lib.mistral_instruct.sep2
input_ids = np.array(tokenizer_image_token(self.args, conversation, self.tokenizer, has_image=has_image))
target = input_ids.copy()
prompt_len = len(tokenizer_image_token(self.args, sample.context, self.tokenizer, has_image=has_image))
target[:prompt_len] = IGNORE_INDEX
input_ids = self.tokenizer.pad(input_ids, self.max_seq_len+1) # pad with EOD
target = self.tokenizer.pad(target, self.max_seq_len+1) #, pad_value=IGNORE_INDEX) # pad with ignore_index. this will be used to create loss_mask
return ImageTaskSample(
__key__=sample.__key__,
__subflavors__=sample.__subflavors__,
imgs=imgs,
num_tiles=num_tiles,
text=input_ids,
prompt_len=prompt_len,
target=target,
)
# Based on https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/train/train.py#L500
def encode_llava_sft(self, sample: SimilarityInterleavedSample):
augment = sample.__subflavors__['augmentation'] if 'augmentation' in sample.__subflavors__ else False
use_chat_format = sample.__subflavors__['use_chat_format'] if 'use_chat_format' in sample.__subflavors__ else False
has_image = sample.__subflavors__['has_image'] if 'has_image' in sample.__subflavors__ else False
conv_format = sample.__subflavors__['conv_format'] if 'conv_format' in sample.__subflavors__ else "mistral"
if has_image:
imgs = get_visual_transform(
sample.images[0], self.img_h, self.img_w, self.args.use_tiling, self.args.max_num_tiles, self.args.use_thumbnail, augment,
)
num_tiles = [len(imgs)]
else:
imgs = num_tiles = []
sample.__key__ = "{}-{}".format("no-image", sample.__key__)
if conv_format == 'llama3_sft':
conv = conversation_lib.llama3_instruct.copy()
elif conv_format == "mistral":
conv = conversation_lib.mistral_instruct.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
if use_chat_format:
source = sample.texts
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], sentence
conv.append_message(role, sentence["value"])
conversation = conv.get_prompt()
### Tokenize conversations
input_ids = tokenizer_image_token(self.args, conversation, self.tokenizer, has_image)
input_ids = torch.LongTensor(input_ids)
target = input_ids.clone()
if conv.sep_style == conversation_lib.SeparatorStyle.MPT:
# Mask targets
sep = conv.sep + conv.roles[1]
total_len = int((target != self.tokenizer.eod_token).sum())
rounds = conversation.split(conv.sep)
re_rounds = [conv.sep.join(rounds[:3])] # system + user + gpt
for conv_idx in range(3, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx:conv_idx+2])) # user + gpt
cur_len = 0
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
rou += conv.sep
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
round_len = len(tokenizer_image_token(self.args, rou, self.tokenizer, has_image))
instruction_len = len(tokenizer_image_token(self.args, parts[0], self.tokenizer, has_image))
if conv_format == 'llama3_sft' and i > 0:
round_len -= 1
instruction_len -= 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
elif conv.sep_style == conversation_lib.SeparatorStyle.TWO:
### Mask targets
sep = conv.sep + conv.roles[1] + ": "
total_len = int((target != self.tokenizer.eod_token).sum())
rounds = conversation.split(conv.sep2)
cur_len = 0
for i, rou in enumerate(rounds):
if rou == "":
break
rou += conv.sep2 # put back conv.sep2 since we will lose it while we conversation.split above with conv.sep2
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
round_len = len(tokenizer_image_token(self.args, rou, self.tokenizer, has_image))
instruction_len = len(tokenizer_image_token(self.args, parts[0], self.tokenizer, has_image)) - 2
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
elif conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_2:
raise NotImplementedError("this tokenizer is not supported yet with this data type")
if cur_len < self.max_seq_len:
if cur_len != total_len:
target[:] = IGNORE_INDEX
raise Exception(
f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}. Something is wrong, please fix!"
)
else:
return NotImplementedError
# pad to max_seq_len
input_ids = self.tokenizer.pad(input_ids, self.max_seq_len+1) # pad with EOD
target = self.tokenizer.pad(target, self.max_seq_len+1)
return ImageTaskSample(
__key__=sample.__key__,
__subflavors__=sample.__subflavors__,
imgs=imgs,
num_tiles=num_tiles,
text=input_ids,
prompt_len=instruction_len,
target=target,
)
def encode_vqa(self, sample: VQASample):
augment = sample.__subflavors__['augmentation'] if 'augmentation' in sample.__subflavors__ else False
imgs = get_visual_transform(
sample.image, self.img_h, self.img_w, self.args.use_tiling, self.args.max_num_tiles, self.args.use_thumbnail, augment,
)
num_tiles = [len(imgs)]
if sample.context[-1:] != "\n":
sample.context = sample.context + "\n"
question_token = self.tokenizer(sample.context)
if isinstance(sample.answers, list):
answer_list = sample.answers
weight_list = np.array(sample.answer_weights).astype(np.float32)
weight_list = weight_list / np.sum(weight_list)
answer_idx = np.random.choice(weight_list.shape[0], 1, p=weight_list)[0]
answer = answer_list[answer_idx]
answer_token = self.tokenizer(answer)
else:
answer_token = self.tokenizer(sample.answers)
prompt_len = len(question_token)
seq_len = self.max_seq_len + 4
text_sample = np.concatenate([[IMAGE_TOKEN_INDEX], question_token, answer_token])
text_sample = self.tokenizer.pad(text_sample, seq_len)
target = text_sample.copy()
target[:max(0, prompt_len - 1)] = IGNORE_INDEX
return ImageTaskSample(
__key__=sample.__key__,
__subflavors__=sample.__subflavors__,
imgs=imgs,
num_tiles=num_tiles,
text=text_sample,
prompt_len=prompt_len,
target=target,
)
def encode_ocr(self, sample: OCRSample) -> ImageTaskSample:
if sample.__subflavors__["type"] == "document":
visual_transform = self.ocr_document_visual_transform
elif sample.__subflavors__["type"] == "paragraph":
visual_transform = self.ocr_paragraph_visual_transform
elif sample.__subflavors__["augmentation"] == False:
visual_transform = self.ocr_document_identity_transform
else:
raise ValueError(f"Unknown subflavor {sample.__subflavors__}")
if sample.words_boxes is not None and sample.words_boxes.shape[1] >= 5:
# Boxes with conf below 0.9 are skipped
filter_words_mask = sample.words_boxes[:, 4] < 0.9
filter_boxes = sample.words_boxes[filter_words_mask, :4]
for x, y, x2, y2 in filter_boxes:
if isinstance(sample.image, Image.Image):
draw = ImageDraw.Draw(sample.image)
draw.rectangle([int(x), int(y), (int(x2), int(y2))], fill=0)
else:
sample.image[:, int(y) : int(y2) + 1, int(x) : int(x2) + 1] = 0
text = " ".join(
text for skip, text in zip(filter_words_mask, sample.words_text) if not skip
)
else:
text = " ".join(sample.text.splitlines())
match = re.search(r'"text_sequence": "(.*?)"', text)
if match:
text = match.group(1)
img = visual_transform(sample.image)
img = (torch.Tensor(np.array(img)).permute(2, 0, 1) - self.pixel_mean) / self.pixel_std
img = torch.nn.functional.pad(img, (0, self.img_w - img.shape[2], 0, self.img_h - img.shape[1]))
# randomly select a prompt
prompt_idx = np.random.randint(len(self.manual_prompts["OCR"]["raw"]))
cur_prompt = self.manual_prompts["OCR"]["raw"][prompt_idx]
if cur_prompt not in self.txt_to_token_dict:
self.txt_to_token_dict[cur_prompt] = self.tokenizer(cur_prompt)
cur_prompt = self.txt_to_token_dict[cur_prompt]
text_sample = self.tokenizer(text)
prompt_len = len(cur_prompt)
seq_len = self.seq_len + 4
text_sample = np.concatenate([cur_prompt, text_sample])
text_sample = self.tokenizer.pad(text_sample, seq_len=seq_len)
text_sample = text_sample[:seq_len]
return ImageTaskSample(
__key__=sample.__key__,
__subflavors__=sample.__subflavors__,
imgs=[img],
num_tiles=[1],
text=text_sample,
prompt_len=prompt_len
)
def batch(self, samples: List[ImageTaskSample]) -> ImageTaskBatch:
# Stack images to [num_tiles, c, h, w]. If there are no images (text-only), then use a dummy image.
imgs = [img for s in samples for img in s.imgs]
if len(imgs) > 0:
imgs = torch.stack(imgs)
else:
imgs = torch.tensor([[0]], dtype=torch.float32)
# Put tile counts to a single tensor. If there are no images (text-only), then use a dummy tensor.
num_tiles = torch.tensor([n for s in samples for n in s.num_tiles], dtype=torch.int)
if len(num_tiles) == 0:
num_tiles = torch.tensor([[0]], dtype=torch.int)
batch = ImageTaskBatch(
__keys__=[s.__key__ for s in samples],
__subflavors__=[s.__subflavors__ for s in samples],
imgs=imgs,
num_tiles=num_tiles,
text=torch.from_numpy(np.stack([s.text for s in samples], axis=0).astype(np.int64)),
prompt_len=torch.from_numpy(np.array([s.prompt_len for s in samples], dtype=np.int64)),
target=torch.from_numpy(np.stack([s.target for s in samples], axis=0).astype(np.int64)),
)
return batch
def encode_batch(self, batch: ImageTaskBatch) -> dict:
raw = dataclasses.asdict(batch)
del raw["__subflavors__"]
return raw
def print_error_handler(exc: Exception, key: Optional[str]):
print(
f"The following exception occurred in the dataloader for sample {key} and is skipped",
file=sys.stderr,
)
traceback.print_exc()
# From https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/mm_utils.py#L185
def tokenizer_image_token(args, prompt, tokenizer, has_image=True, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
if not has_image:
input_ids = tokenizer(prompt)
else:
prompt_chunks = [tokenizer(chunk) for chunk in prompt.split('<image>')]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if args.tokenizer_type in ['Llama2Tokenizer', 'Llama3Tokenizer'] and len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == 'pt':
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f'Unsupported tensor type: {return_tensors}')
# # remove BOS token
# if args.tokenizer_type in ['Llama2Tokenizer', 'Llama3Tokenizer']:
# return input_ids[1:]
return input_ids
import argparse
import glob
import json
from evaluate_vqav2 import compute_vqa_accuracy
def merge_input_files(input_path):
"""Merge input files to a format compatible with the evaluator."""
output_file_path = input_path + "-ChartQA-merged.json"
pattern = input_path + "-ChartQA-[0-9].*jsonl"
input_file_paths = glob.glob(pattern)
results = []
for input_file_path in input_file_paths:
with open(input_file_path, "r") as input_file:
for line in input_file:
res = json.loads(line)
res["question_id"] = res["sample_id"]
results.append(res)
with open(output_file_path, "w") as output_file:
json.dump(results, output_file)
return output_file_path
def chartqa_eval(input_path):
"""Run ChartQA evaluation."""
result_file_path = merge_input_files(input_path)
compute_vqa_accuracy(result_file_path, use_chartqa_metric=True)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--input-path', type=str, help="Path to input file(s)")
args = parser.parse_args()
chartqa_eval(args.input_path)
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
import argparse
import glob
import json
from pycocoevalcap.eval import COCOEvalCap
from pycocotools.coco import COCO
def convert_to_coco_format(input_path):
"""Convert input files to COCO compatible format."""
output_file_path = input_path + "-captioning-merged.json"
pattern = input_path + "-captioning-[0-9].*jsonl"
input_file_paths = glob.glob(pattern)
captions = []
for input_file_path in input_file_paths:
with open(input_file_path, "r") as input_file:
for line in input_file:
res = json.loads(line)
question_id = res['sample_id']
caption = res['caption'].rstrip('.').lower()
captions.append({"image_id": question_id, "caption": caption})
with open(output_file_path, "w") as output_file:
json.dump(captions, output_file, indent=4)
return output_file_path
def coco_captioning_eval(input_path, groundtruth_file):
"""Run COCO captioning evaluation."""
coco = COCO(groundtruth_file)
input_file = convert_to_coco_format(input_path)
coco_result = coco.loadRes(input_file)
coco_eval = COCOEvalCap(coco, coco_result)
# Evaluate on the input subset of images.
coco_eval.params["image_id"] = coco_result.getImgIds()
coco_eval.evaluate()
print("========== COCO captioning scores ==========")
for metric, score in coco_eval.eval.items():
print(f"{metric} {score * 100:.3f}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--input-path", type=str, required=True, help="Path to input file(s)")
parser.add_argument(
"--groundtruth-path", type=str, required=True, help="Path to groundtruth file"
)
args = parser.parse_args()
coco_captioning_eval(args.input_path, args.groundtruth_path)
import argparse
import glob
import json
import subprocess
def convert_to_mmmu_format(input_path):
"""Convert input files to MMMU compatible format."""
output_file_path = input_path + "-MMMU-merged.json"
pattern = input_path + "-MMMU-[0-9].*jsonl"
input_file_paths = glob.glob(pattern)
output = dict()
for input_file_path in input_file_paths:
with open(input_file_path, "r") as input_file:
for line in input_file:
res = json.loads(line)
sample_id = res["sample_id"]
prediction = res["prediction"]
output[sample_id] = prediction
with open(output_file_path, "w") as output_file:
json.dump(output, output_file)
return output_file_path
def mmmu_eval(input_path, groundtruth_path):
"""Run MMMU evaluation."""
result_file = convert_to_mmmu_format(input_path)
# The MMMU repo has a script for running the actual evaluation but no API. So launching the script here.
output = subprocess.run(
[
"python",
"examples/multimodal/MMMU/eval/main_eval_only.py",
"--output_path",
result_file,
"--answer_path",
groundtruth_path,
],
capture_output=True,
text=True,
)
print(output.stdout)
def main():
"""Run MMMU evaluation."""
# Using the validation groundtruth file from the MMMU repo by default. This assumes you have cloned the MMMU github repo here.
default_groundtruth_path = "examples/multimodal/MMMU/eval/answer_dict_val.json"
parser = argparse.ArgumentParser()
parser.add_argument("--input-path", type=str, required=True, help="Path to input file(s)")
parser.add_argument(
"--groundtruth-path",
type=str,
default=default_groundtruth_path,
help="Path to groundtruth file. Defaults to the validation file in the MMMU repo.",
)
args = parser.parse_args()
mmmu_eval(args.input_path, args.groundtruth_path)
if __name__ == "__main__":
main()
import argparse
import glob
import json
from evaluate_vqav2 import compute_vqa_accuracy
def merge_input_files(input_path):
"""Merge input files to a format compatible with the evaluator."""
output_file_path = input_path + "-TextVQA-merged.json"
pattern = input_path + "-TextVQA-[0-9].*jsonl"
input_file_paths = glob.glob(pattern)
results = []
for input_file_path in input_file_paths:
with open(input_file_path, "r") as input_file:
for line in input_file:
res = json.loads(line)
results.append(
{
"question_id": res["sample_id"],
"answer": res["answer"],
"gt_answer": res["gt_answer"],
}
)
with open(output_file_path, "w") as output_file:
json.dump(results, output_file)
return output_file_path
def textvqa_eval(input_path):
"""Run TextVQA evaluation."""
result_file_path = merge_input_files(input_path)
compute_vqa_accuracy(result_file_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--input-path', type=str, help="Path to input file(s)")
args = parser.parse_args()
textvqa_eval(args.input_path)
import argparse
import glob
import json
from open_flamingo.eval.vqa_metric import VQAEval
def merge_input_files(input_path):
"""Merge input files to a format compatible with the evaluator."""
output_file_path = input_path + "-VQAv2-merged.json"
pattern = input_path + "-VQAv2-[0-9].*jsonl"
input_file_paths = glob.glob(pattern)
results = []
for input_file_path in input_file_paths:
with open(input_file_path, "r") as input_file:
for line in input_file:
res = json.loads(line)
res["question_id"] = res["sample_id"]
results.append(res)
with open(output_file_path, "w") as output_file:
json.dump(results, output_file)
return output_file_path
def is_number(n: str):
try:
float(n)
return True
except ValueError:
return False
def compute_vqa_accuracy(result_file, use_chartqa_metric=False):
"""Compute VQA accuracy."""
merged_results = json.load(open(result_file))
vqa = VQAEval(vqa=None, vqaRes=None)
all_acc = []
for res in merged_results:
pred = res["answer"]
pred = vqa.processPunctuation(pred)
pred = vqa.processDigitArticle(pred)
gt = res["gt_answer"]
gt = [vqa.processPunctuation(ans) for ans in gt]
gt = [vqa.processDigitArticle(ans) for ans in gt]
# ChartQA uses relaxed accuracy:
# "We consider an answer to be correct if it is within 5% of the gold answer.
# For non-numeric answers, we still need an exact match to consider an answer to be correct."
if use_chartqa_metric:
acc = 0.
assert len(gt) == 1, "expected exactly one groundtruth answer."
gt = gt[0]
if is_number(pred) and is_number(gt):
pred = float(pred)
gt = float(gt)
if pred >= (gt * 0.95) and pred <= (gt * 1.05):
acc = 1.0
elif pred == gt:
acc = 1.0
all_acc.append(acc)
else:
num_match = sum([pred == ans for ans in gt])
acc = min(1.0, num_match / 3.0)
all_acc.append(acc)
acc_avg = sum(all_acc) / len(all_acc) * 100
print(f"===== Accuracy {acc_avg:.2f}% =====")
def vqav2_eval(input_path):
"""Run VQAv2 evaluation."""
result_file = merge_input_files(input_path)
compute_vqa_accuracy(result_file)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--input-path', type=str, help="Path to input file(s)")
args = parser.parse_args()
vqav2_eval(args.input_path)
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. Except portions as noted which are Copyright (c) 2023 OpenGVLab and licensed under the MIT license found in LICENSE.
import numpy as np
import torch
from PIL import Image, ImageDraw
from torchvision import transforms as T
from torchvision.transforms import Compose, RandAugment, RandomResizedCrop, Resize, ToPILImage
# Imagenet's mean and std.
pixel_mean = [123.675, 116.28, 103.53]
pixel_std = [58.395, 57.12, 57.375]
# Reshape for broadcasting.
pixel_mean = torch.Tensor(pixel_mean).view(-1, 1, 1)
pixel_std = torch.Tensor(pixel_std).view(-1, 1, 1)
def convert_to_rgb(image):
return image.convert("RGB")
def _transform_train_aug(img_h, img_w):
return Compose([
ToPILImage(),
RandomResizedCrop((img_h, img_w), scale=(0.5, 1.0)),
convert_to_rgb,
RandAugment(2, 5, isPIL=True, augs=['Identity', 'AutoContrast', 'Brightness', 'Sharpness', 'Equalize',
'ShearX', 'ShearY', 'TranslateX', 'TranslateY', 'Rotate']),
])
def _transform_test(img_h, img_w):
return Compose([
ToPILImage(),
Resize((img_h, img_w)),
convert_to_rgb,
])
def standardize_image(img):
"""Standardize image pixel values."""
return (torch.Tensor(np.array(img)).permute(2, 0, 1) - pixel_mean) / pixel_std
def get_visual_transform(img, img_h, img_w, use_tiling=False, max_num_tiles=1, use_thumbnail=False, augment=False):
if use_tiling:
assert img_h == img_w, "dynamic tiling expects equal tile height and width"
imgs = dynamic_preprocess(img, min_num=1, max_num=max_num_tiles, image_size=img_h, use_thumbnail=use_thumbnail)
imgs = [standardize_image(img.convert("RGB")) for img in imgs]
else:
img = np.array(img)
original_h, original_w = img.shape[0], img.shape[1]
ratio = float(max(img_h, img_w)) / max(original_h, original_w)
scaled_h, scaled_w = int(original_h * ratio + 0.5), int(original_w * ratio + 0.5)
if augment:
visual_transform = _transform_train_aug(scaled_h, scaled_w)
else:
visual_transform = _transform_test(scaled_h, scaled_w)
img = visual_transform(img)
# Standardize pixel values.
img = standardize_image(img)
# Pad to target image size.
delta_h, delta_w = img_h - scaled_h, img_w - scaled_w
img = torch.nn.functional.pad(img, (0, delta_w, 0, delta_h))
imgs = [img]
return imgs
# From https://github.com/OpenGVLab/InternVL/blob/c62fa4f7c850165d7386bdc48ac6bc5a6fab0864/internvl_chat/internvl/train/dataset.py#L685
# Copyright (c) 2023 OpenGVLab.
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
# print(f'width: {width}, height: {height}, best_ratio: {best_ratio}')
return best_ratio
# From https://github.com/OpenGVLab/InternVL/blob/c62fa4f7c850165d7386bdc48ac6bc5a6fab0864/internvl_chat/internvl/train/dataset.py#L702
# Copyright (c) 2023 OpenGVLab.
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
import torch
from megatron.core.fusions.fused_bias_dropout import get_bias_dropout_add
from megatron.core.tensor_parallel.layers import ColumnParallelLinear, RowParallelLinear
from megatron.core.transformer.attention import SelfAttention, SelfAttentionSubmodules
from megatron.core.transformer.dot_product_attention import DotProductAttention
from megatron.core.transformer.enums import AttnMaskType
from megatron.core.transformer.identity_op import IdentityOp
from megatron.core.transformer.mlp import MLP, MLPSubmodules
from megatron.core.transformer.spec_utils import ModuleSpec
from megatron.core.transformer.transformer_layer import TransformerLayer, TransformerLayerSubmodules
try:
from megatron.core.transformer.custom_layers.transformer_engine import (
TEColumnParallelLinear,
TEDotProductAttention,
TELayerNormColumnParallelLinear,
TENorm,
TERowParallelLinear,
)
HAVE_TE = True
except ImportError:
HAVE_TE = False
try:
import apex
from megatron.core.fusions.fused_layer_norm import FusedLayerNorm
HAVE_APEX = True
LNImpl = FusedLayerNorm
except ImportError:
import warnings
from megatron.core.transformer.torch_layer_norm import WrappedTorchLayerNorm
warnings.warn(f'Apex is not installed. Falling back to Torch LayerNorm')
LNImpl = WrappedTorchLayerNorm
def get_layer_spec(is_vit, normalization) -> ModuleSpec:
attn_mask_type = AttnMaskType.no_mask if is_vit else AttnMaskType.causal
if normalization == "LayerNorm":
norm = LNImpl
elif normalization == "RMSNorm":
norm = TENorm
else:
raise RuntimeError("unknown normalization", normalization)
mlp = get_mlp_module_spec(use_te=False) # doesn't include norm.
return ModuleSpec(
module=TransformerLayer,
submodules=TransformerLayerSubmodules(
input_layernorm=norm,
self_attention=ModuleSpec(
module=SelfAttention,
params={"attn_mask_type": attn_mask_type},
submodules=SelfAttentionSubmodules(
linear_qkv=ColumnParallelLinear,
core_attention=DotProductAttention,
linear_proj=RowParallelLinear,
q_layernorm=IdentityOp,
k_layernorm=IdentityOp,
),
),
self_attn_bda=get_bias_dropout_add,
pre_mlp_layernorm=norm,
mlp=mlp,
mlp_bda=get_bias_dropout_add,
),
)
def get_layer_spec_te(is_vit=False) -> ModuleSpec:
attn_mask_type = AttnMaskType.no_mask if is_vit else AttnMaskType.causal
mlp = get_norm_mlp_module_spec_te()
return ModuleSpec(
module=TransformerLayer,
submodules=TransformerLayerSubmodules(
self_attention=ModuleSpec(
module=SelfAttention,
params={"attn_mask_type": attn_mask_type},
submodules=SelfAttentionSubmodules(
linear_qkv=TELayerNormColumnParallelLinear,
core_attention=TEDotProductAttention,
linear_proj=TERowParallelLinear,
q_layernorm=IdentityOp,
k_layernorm=IdentityOp,
),
),
self_attn_bda=get_bias_dropout_add,
pre_mlp_layernorm=IdentityOp,
mlp=mlp,
mlp_bda=get_bias_dropout_add,
),
)
def get_mlp_module_spec(use_te: bool = True) -> ModuleSpec:
# Dense MLP w/ or w/o TE modules.
return ModuleSpec(
module=MLP,
submodules=MLPSubmodules(
linear_fc1=TEColumnParallelLinear if use_te else ColumnParallelLinear,
linear_fc2=TERowParallelLinear if use_te else RowParallelLinear,
),
)
def get_norm_mlp_module_spec_te() -> ModuleSpec:
return ModuleSpec(
module=MLP,
submodules=MLPSubmodules(
linear_fc1=TELayerNormColumnParallelLinear, linear_fc2=TERowParallelLinear
),
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment