README.md 2.21 KB
Newer Older
wangsen's avatar
wangsen committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# Multimodal Example

NOTE: This is work in progress and not fully functional yet.

## Setup

### Docker container

You can build a docker container using `examples/multimodal/Dockerfile` to run this example.

### Vision model

This example uses the OpenAI CLIP `ViT-L/14@336px` Vision model. To download the weights from OpenAI and convert them to a format that can be loaded in megatron, please run the following:

```
python examples/multimodal/clip_converter.py --download-root /some/download/folder --output /some/output/folder --tensor-parallel-size 4
```

## Training

### Pretraining

Run the following script:
```
examples/multimodal/pretrain_8b.sh
```

### SFT

Run the following script:
```
examples/multimodal/sft_8b.sh
```

## Evaluation

### Generation

Run the following script:

```
examples/multimodal/text_generation_8b.sh --input-image-path /path/to/input/images --output-path /some/output/directory \
    --model-path /path/to/model.pt --tokenizer-path /path/to/tokenizer.model --gt-path /path/to/groundtruth/file --task generation-task-name
```

### COCO captioning

First, run text generation using `--task captioning`. Then, run the following command:

```
python examples/multimodal/evaluate_coco.py --input-path /output/directory/from/generation --groundtruth-path /path/to/groundtruth/file
```

### TextVQA

First, run text generation using `--task TextVQA`. Then, run the following command:

```
python examples/multimodal/evaluate_textvqa.py --input-path /output/directory/from/generation --groundtruth-path /path/to/groundtruth/file
```

### VQAv2

First, run text generation using `--task VQAv2`. Then, run the following command:

```
python examples/multimodal/evaluate_textvqa.py --input-path /output/directory/from/generation --groundtruth-path /path/to/groundtruth/file --question-path /path/to/question/file
```

### MMMU

The official MMMU repository is not pip installable currently so please clone their code in `examples/multimodal` by running `git clone https://github.com/MMMU-Benchmark/MMMU.git`.

The MMMU dataset is loaded from HuggingFace.

Run text generation using `--task MMMU`. Then, run the following command:

```
python examples/multimodal/evaluate_mmmu.py --input-path /output/directory/from/generation
```