README.md 2.38 KB
Newer Older
chenzk's avatar
v1.0  
chenzk committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Training on DataCompDR with OpenCLIP
We provide release code and a patch to
[OpenCLIP](https://github.com/mlfoundations/open_clip/tree/main/src/open_clip) 
for training models on DataCompDR.

## Data
Our reinforcements to DataComp are available on HuggingFace.
- [DataCompDR-12M](https://huggingface.co/datasets/apple/DataCompDR-12M)
- [DataCompDR-12M-BFloat16](https://huggingface.co/datasets/apple/DataCompDR-12M-bf16)
- [DataCompDR-1B](https://huggingface.co/datasets/apple/DataCompDR-1B)

Our data does not include the original images and captions. For DataCompDR-12M, 
there is a corresponding 
[DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M) with 
original captions. One needs to download both datasets, then run the following 
script to join them:
```bash
#!/bin/bash
DATACOMP12M_PATH="./datasets/DataComp-12M/" # Download path of DataComp-12M from HF
DATACOMPDR12M_NOIMG_PATH="./datasets/DataCompDR-12M-noimage/" # Download path of DataCompDR-12M from HF
DATACOMPDR12M_PATH="./datasets/DataCompDR-12M/"
for  i in {00000000..00001023}
do
  mkdir tmp
  tar -xf $DATACOMP12M_PATH/${i}.tar -C tmp
  tar -xf $DATACOMP12M_NOIMG_PATH/${i}.tar -C tmp
  tar -cf $DATACOMPDR12M_PATH/${i}.tar -C tmp *.*
  rm -rf tmp
done
```

The images have to be downloaded separately. See 
[hf_dataset_example.py](../hf_dataset_example.py) for an example of downloading 
a single image.

## Installing dependencies

We use OpenCLIP for training. We have made minor modifications to OpenCLIP for 
support of loading reinforcements and the training loss. To checkout the 
specific version of each library and apply our corresponding patch run the 
following commands in order:
```bash
# Clone MobileCLIP repository
git clone git@github.com:apple/ml-mobileclip.git
cd ml-mobileclip/

# Clone OpenCLIP repository, apply patch, and install
git clone https://github.com/mlfoundations/open_clip.git
cd open_clip
git checkout cf86ee7ec4658845f640858ecd34d0f15588271a
git apply ../open_clip.patch  # Support for sampling without replacement
cp ../configs/ ./ -r
cp ../dr/ ./src/training/ -r
```

## Training

We provide scripts for training on DataCompDR-12M and DataCompDR-1B.

```bash
cd open_clip/
bash configs/run_datacomp12m.sh  # Train a ViT-B/16 on DataComp-12M without DR
bash configs/run_datacompdr12m.sh  # Train a ViT-B/16 on DataComp-12M with DR
bash configs/run_datacompdr1B.sh  # Train a ViT-B/16 on DataComp-1B with DR
```