# Define your own inference samples here. The following is an example.
samples=[
# Use <image> token to indicate the image position.
{"image_paths":["examples/multimodal/assets/pretrain_curves.png"],"question":"<image>\nWhat is the curve?"},
# Optional: if you have an answer for the question.
{"image_paths":["examples/multimodal/assets/pretrain_curves.png"],"question":"What is the curve?<image>","answer":"It's a loss function curve."},
# If you have multiple images for the question, then use <image> token to indicate the image positions.
{"image_paths":["examples/multimodal/assets/pretrain_curves.png","examples/multimodal/assets/pretrain_curves.png"],"question":"<image>What is the curve?<image>"},
# - prompt_style="vlmevalkit" is similar to https://github.com/open-compass/VLMEvalKit/blob/5d3cebcf18ef4bfbadc3bd3ef80bdc7aad2c6557/vlmeval/vlm/internvl_chat.py#L499
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. Except portions as noted which are Copyright (c) 2023 OpenGVLab and licensed under the MIT license found in LICENSE.
See `examples/multimodal/llama_3p1_nemotron_nano_vl_8b_v1/Dockerfile`.
## Dataset preparation
We use [Megatron Energon](https://github.com/NVIDIA/Megatron-Energon) for multimodal dataloading.
## Model
You can download trained tensor parallel size 1 and 4 Megatron checkpoints [here](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1-mcore).
Alternatively, you can follow the steps in [Model conversion](#model-conversion) and [Training](#training) below to prepare a model
and run pretraining and SFT from scratch using a prepared dataset.
### Model conversion
#### Language model conversion
We start from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) from HuggingFace.
Please download it and run the following command to convert it to Megatron format.
```
export LLAMA_DOWNLOAD_DIR=<downloaded hf model directory>
examples/multimodal/combine_lm_vision_checkpoints.sh <language model directory> <vision model directory> <output directory>
```
# Training
1. Pretraining: we provide an example pretraining script at `examples/multimodal/llama_3p1_nemotron_nano_vl_8b_v1/pretraining_llama_3p1_nemotron_nano_vl_8b_v1.sh`.
2. SFT: we provide an example SFT script at `examples/multimodal/llama_3p1_nemotron_nano_vl_8b_v1/sft_llama_3p1_nemotron_nano_vl_8b_v1.sh`.
# Inference and evaluation
To run a simple inference example:
```
export LLAMA_NEMOTRON_NANO_VL_PATH=<path to the megatron tp=4 checkpoint>
"COMMENT":"Sources for these prompts include https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/viewer and https://huggingface.co/datasets/HuggingFaceM4/M3IT",
"Captioning":{
"raw":[
"Can you briefly explain what you see in the image?",
"Describe what's happening in this image in one short sentence.",
"Write a short caption that accurately represents the content of this image.",
"Please generate a descriptive caption for the image provided.",
"How would you summarize the scene depicted in the picture in short?",
"Describe the image briefly.",
"Write a succinct description of the image, capturing its main components, the relationships between them, and any notable details.",
"Create a concise caption that accurately describes the main elements in the image provided.",
"Write a brief, yet comprehensive, description of the image.",
"Describe the image in a clear and concise manner.",
"For the given image, provide a one-sentence summary that captures the most important details.",
"Generate a short caption for the picture.",
"Write a short and informative description that highlights the primary subjects and actions occurring in the given image.",
"Provide a concise and informative caption for the image, focusing on the primary subjects.",
"Write a clear description of the image, make sure the key features are well covered.",
"Offer a succinct explanation of the picture presented."
]
},
"CaptioningPretraining":{
"raw":[
"Generate a short caption of the image.",
"Describe the image concisely.",
"Provide a brief description of the given image."
],
"llava":[
"Give a brief description of image.",
"Give a brief description of the image.",
"Provide a brief description of the given image.",
"Provide a one-sentence caption for the provided image.",
"Write a terse but informative summary of the picture.",
"Describe the image concisely.",
"Generate a clear and concise summary of the photo."
]
},
"OCR":{
"raw":[
"Can you read the text from image and output here?",
"Extract and document the text from the provided image.",
"Converting the text embedded in this image into a readable document.",
"Transcribe all the text you find.",
"Can you extract all visible text from the image here?"
parser.add_argument("--model-type",required=True,type=str,choices=['radio_v2.5-h','radio_v2.5-g'],help="Type of radio to load for conversion")
parser.add_argument("--version",type=str,default=None,help="Version to pass to torch.hub.load. Can be a local path or a version RADIO on torch hub. By default use the version from the model type.")