open_clip

6f43e8fa · mashun1 · 6f43e8fa · 6f43e8fa · 6f43e8fa · 6f43e8fa
Commit 6f43e8fa authored Sep 14, 2024 by mashun1
20 changed files
--- a/docs/clip_conceptual_captions.md
+++ b/docs/clip_conceptual_captions.md
+## Additional training curves for CLIP on Conceptual Captions
+
+# Zero shot accuracy
+![](/docs/clip_zeroshot.png)
+
+# Training loss curve
+![](/docs/clip_loss.png)
+
+# Validation loss curve
+![](/docs/clip_val_loss.png)
+
+# Validation recall
+![](/docs/clip_recall.png)
\ No newline at end of file
--- a/docs/clip_loss.png
+++ b/docs/clip_loss.png
--- a/docs/clip_recall.png
+++ b/docs/clip_recall.png
--- a/docs/clip_val_loss.png
+++ b/docs/clip_val_loss.png
--- a/docs/clip_zeroshot.png
+++ b/docs/clip_zeroshot.png
--- a/docs/clipa.md
+++ b/docs/clipa.md
+## CLIPA
+
+In this work, we present a surprising finding that there exists an _inverse_ scaling law for CLIP training, 
+whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. 
+Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law.
+
+![](/docs/inverse_scaling_law.png)
+
+As a result of this finding, we are able to successfully train CLIP even by using academic resources. 
+For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of **63.2%** in about **2 days**, 
+**67.8%** in about **3 days**, and **69.3%** in about **4 days**.
+
+Moreover, We find that CLIPA at scale leads to state-of-the-art performance. For example, our CLIPA-v2 H/14 achieves a zero-shot top-1 ImageNet accuracy of **81.8%**,
+with a budget less than **$15000**.
+
+![](/docs/clipa_acc_compute.png)
+
+For more details, please see our paper [An Inverse Scaling Law for CLIP Training](https://arxiv.org/abs/2305.07017) and 
+[CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy](https://arxiv.org/abs/2306.15658).
+
+
+Eight token length reduction strategies are investigated in this work, detailed as follows.
+
+
+## Image token length reduction
+
+![](/docs/clipa_reduce_image_token.png)
+
+* `resize`: use `--force-image-size` to specify the image size you want to adopt. We find this strategy generally works the best as it retains full image information.
+
+* `random mask`: Randomly mask out image patches. use `--force-patch-dropout` to specify the mask ratio you want to adopt. 
+
+* `grid mask`: Preserve one patch in each 2 × 2 grid window. We do not provide implementation for grid masking, as it is only experimental and we generally find resizing works better.
+
+* `block mask`: Keep a single block and remove other patches. We do not provide implementation for block masking, as it is only experimental and we generally find resizing works better.
+
+
+## Text token length reduction
+
+* `syntax mask`: Assign different masking priorities to parts of speech. Specify `"text_mask": syntax` in `"tokenizer_kwargs"` in `"text_cfg"` of model config `json` file to use. 
+Specifically, we prioritize retaining nouns, followed by adjectives, and then other words. 
+We find this strategy generally works the best as it retains critical information for contrastive learning.
+
+* `truncate`: Truncation selects the first N text tokens and discards the rest. This is the default setting of `open_clip`. 
+
+* `random mask`: Randomly drops a portion of the text tokens. Specify `"text_mask": random` in `"tokenizer_kwargs"` in `"text_cfg"` of model config `json` file to use. 
+
+* `block mask`: Randomly preserves consecutive text sequences. Specify `"text_mask": block` in `"tokenizer_kwargs"` in `"text_cfg"` of model config `json` file to use. 
+
+
+## Installation
+
+The installation is really the same as `open_clip`, except for the usage of Natural Language Toolkit (NLTK) in `syntax mask` of text token length reduction.
+Please follow the [official doc](https://www.nltk.org/) to install NLTK.
+
+Note that the the usage of NLTK brings two constraints:
+* Because certain functions like `nltk.pos_tag` from NLTK only support English and Russian for now, the `syntax mask` only works for English. 
+we have not tested it on Russian or any other language. Theoretically, it should work the same, given a proper language processing toolkit for other languages.
+If you still want to apply `syntax mask` on other languages, try finding the right toolkit. Otherwise, use other text token length reduction strategies
+* some modules of NLTK like `punkt` or `averaged_perceptron_tagger` need to be downloaded first before using NLTK.
+We have included the downloading code in `tokenizer.py`, but this might cause trouble in certain cases.
+You may want to manually download those modules first, by `nltk.download('punkt')` and `nltk.download('averaged_perceptron_tagger')`,
+and then setup the environmental variable before running the script `export NLTK_DATA=cache`. 
+Note that this is a one-time effort. Remember to comment out those `nltk.download` lines in `tokenizer.py` afterwards.
+
+## Training
+We provide example scripts to reproduce our CLIPA results on an A100 eight-GPU machine under path `docs/script_examples/clipa`.
+
+For instance, to reproduce the CLIPA-L16(I37,T8) results, first run the pre-training script
+```
+bash docs/script_examples/clipa/vit_l16/i37_t8_pretrain.sh
+```
+and fine-tune the pre-trained checkpoint with
+```
+bash docs/script_examples/clipa/vit_l16/i37_t8_finetune.sh
+```
+- Remember to change the path to dataset to your own path.
+- This is a two-stage training pipeline. Remember to change the path to pre-trained checkpoint to your own when fine-tuning.
+- The training time is ~3 days for pre-training and ~1 day for fine-tuning on an A100 eight-GPU machine.
+
+## Model Weights
+Below are CLIPA trained weights on LAION-400M with an A100 eight-GPU machine. 
+All models are pre-trained for 6 epochs with reduced input token lengths and subsequently fine-tuned for 0.36 epoch with full input token lengths.
+
+
+|                     |                                      Pre-trained Weights                                       | zero-shot IN-1K |
+|---------------------|:----------------------------------------------------------------------------------------------:|:---------------:|
+| CLIPA-B/16(I50,T16) | [download](https://drive.google.com/file/d/1MDpz8gV2Vjaazk16rBhLxU8811U7_cGL/view?usp=sharing) |      59.7       |
+| CLIPA-L/16(I17,T16) | [download](https://drive.google.com/file/d/1Tr2GYiKAaMH6EGIn5l7eX_1K20eaA3WA/view?usp=sharing) |      60.3       |
+| CLIPA_L/16(I37,T8)  | [download](https://drive.google.com/file/d/1EM1ChRNARpLckkJjf6m7njCY3xyvpGBu/view?usp=sharing) |      57.9       |
+
+|                     |                                       Fine-tuned Weights                                       | zero-shot IN-1K |
+|---------------------|:----------------------------------------------------------------------------------------------:|:-----:|
+| CLIPA-B/16(I50,T16) | [download](https://drive.google.com/file/d/1fURK0K_a3-83jVEI4PVEbnEJb_V6UbGv/view?usp=sharing) | 63.2  |
+| CLIPA-L/16(I17,T16) | [download](https://drive.google.com/file/d/18qqZGOTGOgb3I3JWONuat6qObsgLq7sR/view?usp=sharing) | 67.8  |
+| CLIPA_L/16(I37,T8)  | [download](https://drive.google.com/file/d/1lV7pLORUK04T9QKKx9TpYtMws-AZrib0/view?usp=sharing) | 69.3  |
+
+
+## CLIPA-v2
+We also provide example scripts to reproduce our CLIPA-v2 H/14 results under path `docs/script_examples/clipav2`.
+Note that the original results are obtained with [our JAX implementation](https://github.com/UCSC-VLAA/CLIPA/tree/master/clipa_jax).
+These scripts are written after manually scanning the JAX config files.
+As it is infeasible for us to retrain those models again with pytorch, its correctness cannot be verified with 100% confidence. Use them at your own discretion.
--- a/docs/clipa_acc_compute.png
+++ b/docs/clipa_acc_compute.png
--- a/docs/clipa_reduce_image_token.png
+++ b/docs/clipa_reduce_image_token.png
--- a/docs/clipa_reduce_text_token.png
+++ b/docs/clipa_reduce_text_token.png
--- a/docs/datacomp_models.md
+++ b/docs/datacomp_models.md
+## CommonPool and DataComp models
+
+As part of [DataComp](https://github.com/mlfoundations/datacomp), we trained models on CommonPool using various data filtering strategies.
+We release models for all four scales of the competition, small, medium, large and xlarge, corresponding to a pool size and number of samples seen of 12.8M, 128M, 1.28B and 12.8B, respectively.
+
+The models are specified below, see our paper [DataComp: In seearch of the next generation of multimodal datasets](https://arxiv.org/abs/2304.14108) for more details.
+
+
+## xlarge scale models
+
+* `datacomp_xl_s13b_b90k`: A ViT-L/14 trained on DataComp-1B for 12.8B steps and batch size 90k. Achieves 79.2% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K. 
+
+* `commonpool_xl_clip_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using CLIP scores, for 12.8B steps and batch size 90k. Achieves 76.4% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K.
+
+* `commonpool_xl_laion_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using the LAION-2B filtering scheme, for 12.8B steps and batch size 90k. Achieves 75.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K.
+
+* `commonpool_xl_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL without any filtering, for 12.8B steps and batch size 90k. Achieves 72.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL-s13B-b90K.
+
+
+## large scale models
+
+* `datacomp_l_s1b_b8k`: A ViT-B/16 trained on a 140M subset of DataComp-1B, for 1.28B steps and batch size 8k. Achieves 63.1% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-16-DataComp.L-s1B-b8K.
+
+* `commonpool_l_clip_s1b_b8k`: A ViT-B/16 trained on CommonPool-L filtered using CLIP scores, for 1.28B steps and batch size 8k. Achieves 57.8% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-16-CommonPool.L.clip-s1B-b8K.
+
+* `commonpool_l_laion_s1b_b8k`: A ViT-B/16 trained on CommonPool-L filtered using the LAION-2B filtering scheme, for 1.28B steps and batch size 8k. Achieves 55.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-16-CommonPool.L.laion-s1B-b8K.
+
+* `commonpool_l_image_s1b_b8k`: A ViT-B/16 trained on CommonPool-L filtered using image-based filtering, for 1.28B steps and batch size 8k. Achieves 57.2% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-16-CommonPool.L.image-s1B-b8K.
+
+* `commonpool_l_text_s1b_b8k`: A ViT-B/16 trained on CommonPool-L filtered using text-based filtering, for 1.28B steps and batch size 8k. Achieves 56.1% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-16-CommonPool.L.text-s1B-b8K.
+
+* `commonpool_l_basic_s1b_b8k`: A ViT-B/16 trained on CommonPool-L filtered using basic filtering (English filtering + caption length and image size filtering), for 1.28B steps and batch size 8k. Achieves 51.6% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-16-CommonPool.L.basic-s1B-b8K.
+
+* `commonpool_l_s1b_b8k`: A ViT-B/16 trained on CommonPool-L without any filtering, for 1.28B steps and batch size 8k. Achieves 45.9% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-16-CommonPool.L-s1B-b8K.
+
+
+## medium scale models
+
+* `datacomp_m_s128m_b4k`: A ViT-B/32 trained on a 14M subset of DataComp-1B, for 128M steps and batch size 4k. Achieves 29.7% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-DataComp.M-s128M-b4K.
+
+* `commonpool_m_clip_s128m_b4k`: A ViT-B/32 trained on CommonPool-M filtered using CLIP scores, for 128M steps and batch size 4k. Achieves 27.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.M.clip-s128M-b4K.
+
+* `commonpool_m_laion_s128m_b4k`: A ViT-B/32 trained on CommonPool-M filtered using the LAION-2B filtering scheme, for 128M steps and batch size 4k. Achieves 23.0% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.M.laion-s128M-b4K.
+
+* `commonpool_m_image_s128m_b4k`: A ViT-B/32 trained on CommonPool-M filtered using image-based filtering, for 128M steps and batch size 4k. Achieves 26.8% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.M.image-s128M-b4K.
+
+* `commonpool_m_text_s128m_b4k`:  A ViT-B/32 trained on CommonPool-M filtered using text-based filtering, for 128M steps and batch size 4k. Achieves 25.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.M.text-s128M-b4K.
+
+* `commonpool_m_basic_s128m_b4k`:  A ViT-B/32 trained on CommonPool-M filtered using basic filtering (English filtering + caption length and image size filtering), for 128M steps and batch size 4k. Achieves 22.6% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.M.basic-s128M-b4K.
+
+* `commonpool_m_s128m_b4k`: A ViT-B/32 trained on CommonPool-M without any filtering, for 128M steps and batch size 4k. Achieves 17.6% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.M-s128M-b4K.
+
+
+## small scale models
+
+* `datacomp_s_s13m_b4k`: A ViT-B/32 trained on a 1.4M subset of DataComp-1B, for 12.8M steps and batch size 4k. Achieves 3.9% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-DataComp.S-s13M-b4K.
+
+* `commonpool_s_clip_s13m_b4k`: A ViT-B/32 trained on CommonPool-S filtered using CLIP scores, for 12.8M steps and batch size 4k. Achieves 5.1% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.S.clip-s13M-b4K.
+
+* `commonpool_s_laion_s13m_b4k`: A ViT-B/32 trained on CommonPool-S filtered using the LAION-2B filtering scheme scores, for 12.8M steps and batch size 4k. Achieves 3.1% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.S.laion-s13M-b4K.
+
+* `commonpool_s_image_s13m_b4k`: A ViT-B/32 trained on CommonPool-S filtered using image-based filtering, for 12.8M steps and batch size 4k. Achieves 4.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.S.image-s13M-b4K.
+
+* `commonpool_s_text_s13m_b4k`: A ViT-B/32 trained on CommonPool-S filtered using text-based filtering, for 12.8M steps and batch size 4k. Achieves 4.6% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.S.text-s13M-b4K.
+
+* `commonpool_s_basic_s13m_b4k`: A ViT-B/32 trained on CommonPool-S filtered using basic filtering (English filtering + caption length and image size filtering), for 12.8M steps and batch size 4k. Achieves 3.0% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.S.basic-s13M-b4K.
+
+* `commonpool_s_s13m_b4k`: A ViT-B/32 trained on CommonPool-S without any filtering, for 12.8M steps and batch size 4k. Achieves 2.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-B-32-CommonPool.S-s13M-b4K.
+
--- a/docs/effective_robustness.png
+++ b/docs/effective_robustness.png
--- a/docs/inverse_scaling_law.png
+++ b/docs/inverse_scaling_law.png
--- a/docs/laion2b_clip_zeroshot_b32.png
+++ b/docs/laion2b_clip_zeroshot_b32.png
--- a/docs/laion_clip_zeroshot.png
+++ b/docs/laion_clip_zeroshot.png
--- a/docs/laion_clip_zeroshot_b16.png
+++ b/docs/laion_clip_zeroshot_b16.png
--- a/docs/laion_clip_zeroshot_b16_plus_240.png
+++ b/docs/laion_clip_zeroshot_b16_plus_240.png
--- a/docs/laion_clip_zeroshot_l14.png
+++ b/docs/laion_clip_zeroshot_l14.png
--- a/docs/laion_openai_compare_b32.jpg
+++ b/docs/laion_openai_compare_b32.jpg
--- a/docs/model_profile.csv
+++ b/docs/model_profile.csv
+model,image_size,image_width,text_width,embed_dim,mparams,image_mparams,text_mparams,gflops,image_gflops,text_gflops
+ViT-S-32-alt,224,384,256,256,43.22,22.59,20.63,3.56,2.29,1.27
+ViT-S-32,224,384,384,384,63.09,22.64,40.44,5.66,2.29,3.38
+ViT-M-32-alt,224,512,384,384,80.07,39.63,40.44,7.37,3.99,3.38
+ViT-M-32,224,512,512,512,103.12,39.69,63.43,9.95,3.99,5.96
+ViT-S-16-alt,224,384,256,256,42.4,21.76,20.63,10.47,9.2,1.27
+ViT-S-16,224,384,384,384,62.26,21.81,40.44,12.58,9.2,3.38
+ViT-B-32,224,768,512,512,151.28,87.85,63.43,14.78,8.82,5.96
+ViT-B-32-quickgelu,224,768,512,512,151.28,87.85,63.43,14.78,8.82,5.96
+convnext_tiny,224,768,512,1024,92.3,28.61,63.69,14.87,8.91,5.96
+ViT-B-32-256,256,768,512,512,151.29,87.86,63.43,17.46,11.5,5.96
+RN50,224,64,512,1024,102.01,38.32,63.69,18.18,12.22,5.96
+RN50-quickgelu,224,64,512,1024,102.01,38.32,63.69,18.18,12.22,5.96
+ViT-M-16-alt,224,512,384,384,78.98,38.53,40.44,19.36,15.98,3.38
+ViT-M-16,224,512,512,512,102.02,38.59,63.43,21.94,15.98,5.96
+vit_relpos_medium_patch16_cls_224,224,768,512,512,101.94,38.51,63.43,21.99,16.03,5.96
+mt5-base-ViT-B-32,224,768,512,512,365.71,87.85,277.86,22.12,8.82,13.3
+convnext_small,224,768,512,512,113.28,49.85,63.43,23.33,17.37,5.96
+ViT-B-32-plus-256,256,896,640,640,210.3,119.13,91.16,24.83,15.56,9.27
+RN101,224,64,512,512,119.69,56.26,63.43,25.5,19.54,5.96
+RN101-quickgelu,224,64,512,512,119.69,56.26,63.43,25.5,19.54,5.96
+vit_medium_patch16_gap_256,256,768,512,512,102.04,38.61,63.43,27.1,21.14,5.96
+coca_ViT-B-32,224,768,512,512,253.56,89.16,63.43,33.34,9.19,5.96
+convnext_base,224,768,512,512,151.52,88.09,63.43,36.67,30.71,5.96
+swin_base_patch4_window7_224,224,768,640,640,178.56,87.4,91.16,40.13,30.86,9.27
+ViT-B-16,224,768,512,512,149.62,86.19,63.43,41.09,35.13,5.96
+ViT-B-16-quickgelu,224,768,512,512,149.62,86.19,63.43,41.09,35.13,5.96
+EVA02-B-16,224,768,512,512,149.69,86.26,63.43,41.09,35.13,5.96
+ViT-B-16-SigLIP,224,768,768,768,203.16,92.88,110.27,46.44,35.42,11.02
+convnext_base_w,256,768,640,640,179.39,88.22,91.16,49.38,40.11,9.27
+RN50x4,288,80,640,640,178.3,87.14,91.16,51.82,42.56,9.27
+coca_roberta-ViT-B-32,224,768,768,512,420.37,87.85,124.45,53.12,8.82,13.12
+ViT-B-16-plus,224,896,640,640,208.35,117.19,91.16,56.75,47.49,9.27
+ViT-B-16-SigLIP-256,256,768,768,768,203.2,92.93,110.27,57.84,46.82,11.02
+ViT-B-16-SigLIP-i18n-256,256,768,768,768,370.63,92.93,277.7,57.84,46.82,11.02
+ViT-B-16-plus-240,240,896,640,640,208.38,117.21,91.16,64.03,54.76,9.27
+convnext_base_w_320,320,768,640,640,179.39,88.22,91.16,71.94,62.67,9.27
+convnext_large,224,768,768,768,321.06,197.41,123.65,82.02,68.72,13.3
+coca_base,288,768,768,512,440.34,86.4,134.66,99.09,46.47,13.3
+roberta-ViT-B-32,224,768,512,512,212.72,87.85,124.87,105.87,8.82,97.05
+xlm-roberta-base-ViT-B-32,224,768,512,512,366.12,87.85,278.27,105.87,8.82,97.05
+convnext_large_d,256,768,768,768,351.77,199.77,152.0,107.5,89.76,17.73
+ViT-B-16-SigLIP-384,384,768,768,768,203.45,93.18,110.27,123.15,112.13,11.02
+ViT-L-16,224,1024,768,768,427.74,304.09,123.65,136.41,123.11,13.3
+convnext_large_d_320,320,768,768,768,351.77,199.77,152.0,157.98,140.25,17.73
+RN50x16,384,96,768,768,290.98,167.33,123.65,162.69,149.39,13.3
+ViT-L-14-CLIPA,224,1024,768,768,414.21,303.96,110.25,167.5,162.03,5.47
+EVA02-L-14,224,768,768,768,427.76,304.11,123.65,175.3,162.0,13.3
+ViT-L-14,224,1024,768,768,427.62,303.97,123.65,175.33,162.03,13.3
+ViT-L-14-quickgelu,224,1024,768,768,427.62,303.97,123.65,175.33,162.03,13.3
+convnext_xlarge,256,768,1024,1024,653.89,350.25,303.65,198.38,159.14,39.24
+ViT-L-16-SigLIP-256,256,768,1024,1024,652.15,315.96,336.19,201.62,162.56,39.06
+coca_ViT-L-14,224,1024,768,768,638.45,306.72,123.65,214.52,163.64,13.3
+ViT-B-16-SigLIP-512,512,768,768,768,203.79,93.52,110.27,227.26,216.24,11.02
+ViT-SO400M-14-SigLIP,224,768,1152,1152,877.36,427.68,449.68,233.54,220.35,13.19
+ViT-L-14-280,280,1024,768,768,427.76,304.11,123.65,271.79,258.49,13.3
+ViT-L-16-320,320,1024,768,768,427.95,304.3,123.65,271.93,258.63,13.3
+ViT-H-16,224,1280,1024,1024,986.26,632.23,354.03,301.72,254.63,47.09
+ViT-H-14-CLIPA,224,1280,1024,1024,968.24,632.07,336.16,354.02,334.59,19.43
+nllb-clip-base,224,768,512,512,501.89,87.85,414.04,369.6,8.82,360.78
+ViT-H-14,224,1280,1024,1024,986.11,632.08,354.03,381.68,334.59,47.09
+ViT-H-14-quickgelu,224,1280,1024,1024,986.11,632.08,354.03,381.68,334.59,47.09
+ViT-L-14-CLIPA-336,336,1024,768,768,414.54,304.29,110.25,387.39,381.92,5.47
+EVA02-L-14-336,336,768,768,768,428.08,304.43,123.65,395.16,381.86,13.3
+ViT-L-14-336,336,1024,768,768,427.94,304.29,123.65,395.22,381.92,13.3
+ViT-L-16-SigLIP-384,384,768,1024,1024,652.48,316.28,336.19,422.91,383.85,39.06
+convnext_xxlarge,256,768,1024,1024,1200.58,846.54,354.03,443.03,395.94,47.09
+nllb-clip-base-siglip,384,768,512,768,507.47,93.18,414.3,472.91,112.13,360.78
+mt5-xl-ViT-H-14,224,1280,512,1024,2306.75,632.08,1674.68,514.04,334.59,179.45
+EVA01-g-14,224,768,768,1024,1136.44,1012.59,123.85,547.36,534.06,13.3
+RN50x64,448,128,1024,1024,623.26,420.38,202.88,552.65,529.11,23.55
+EVA01-g-14-plus,224,768,1024,1024,1366.62,1012.59,354.03,581.15,534.06,47.09
+ViT-g-14,224,1408,1024,1024,1366.68,1012.65,354.03,581.15,534.06,47.09
+convnext_xxlarge_320,320,768,1024,1024,1200.58,846.54,354.03,665.74,618.65,47.09
+xlm-roberta-large-ViT-H-14,224,1280,512,1024,1193.01,632.08,560.94,671.01,334.59,336.42
+ViT-SO400M-14-SigLIP-384,384,768,1152,1152,877.96,428.23,449.73,723.48,670.35,53.13
+ViT-H-14-CLIPA-336,336,1280,1024,1024,968.64,632.48,336.16,800.88,781.45,19.43
+ViT-bigG-14-CLIPA,224,1664,1280,1280,2517.22,1844.9,672.32,1007.93,967.5,40.44
+ViT-H-14-378-quickgelu,378,1280,1024,1024,986.71,632.68,354.03,1054.05,1006.96,47.09
+ViT-bigG-14,224,1664,1280,1280,2539.57,1844.91,694.66,1065.36,967.5,97.86
+nllb-clip-large,224,1280,512,1024,1399.22,632.08,767.14,1468.46,334.59,1133.87
+nllb-clip-large-siglip,384,768,512,1152,1195.5,428.23,767.27,1804.22,670.35,1133.87
+ViT-e-14,224,1792,1280,1280,4581.09,3807.72,773.37,2091.45,1981.35,110.1
+ViT-bigG-14-CLIPA-336,336,1664,1280,1280,2517.76,1845.44,672.32,2271.58,2231.15,40.44
+EVA02-E-14,224,768,1024,1024,4704.59,4350.56,354.03,2311.42,2264.33,47.09
+EVA02-E-14-plus,224,768,1280,1024,5044.89,4350.56,694.33,2362.19,2264.33,97.86
--- a/docs/openclip_classification_results.csv
+++ b/docs/openclip_classification_results.csv