Commit 1d5a34cf authored by wanglch's avatar wanglch
Browse files

Initial commit

parents
Pipeline #1446 canceled with stages
359
368
460
475
486
492
496
514
516
525
547
548
556
563
575
641
648
723
733
765
801
826
852
858
878
896
900
905
908
910
935
946
947
994
999
1003
1005
1010
1027
1029
1048
1055
1064
1065
1069
1075
1079
1081
1085
1088
1093
1106
1143
1144
1145
1147
1168
1171
1178
1187
1190
1197
1205
1216
1223
1230
1236
1241
1245
1257
1259
1260
1267
1268
1269
1271
1272
1273
1277
1303
1344
1349
1355
1357
1384
1388
1391
1427
1429
1432
1437
1450
1461
1462
1474
1502
1503
1512
1552
1555
1577
1584
1587
1589
1599
1615
1616
1681
1692
1701
1716
1729
1757
1759
1764
1777
1786
1822
1841
1842
1848
1850
1856
1860
1861
1864
1876
1897
1898
1910
1913
1918
1922
1928
1932
1935
1947
1951
1953
1970
1977
1979
2001
2017
2067
2081
2087
2112
2128
2135
2147
2174
2175
2176
2177
2178
2181
2183
2184
2187
2189
2190
2191
2192
2193
2197
2202
2203
2206
2208
2209
2211
2212
2213
2214
2215
2216
2217
2219
2222
2223
2224
2225
2226
2227
2228
2229
2230
2236
2238
2240
2241
2242
2243
2244
2245
2247
2248
2249
2250
2251
2252
2255
2256
2257
2262
2263
2264
2265
2266
2268
2270
2271
2272
2273
2275
2276
2279
2280
2281
2282
2285
2289
2292
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2309
2310
2312
2313
2314
2315
2316
2318
2319
2321
2322
2326
2329
2330
2331
2332
2334
2335
2336
2337
2338
2339
2341
2342
2343
2344
2346
2348
2349
2351
2352
2353
2355
2357
2358
2359
2360
2364
2365
2368
2369
2377
2382
2383
2385
2397
2398
2400
2402
2405
2412
2421
2428
2431
2432
2433
2436
2441
2445
2450
2453
2454
2465
2469
2532
2533
2538
2544
2547
2557
2565
2578
2612
2658
2702
2722
2731
2738
2741
2747
2810
2818
2833
2844
2845
2867
2874
2882
2884
2888
2889
3008
3012
3019
3029
3033
3042
3091
3106
3138
3159
3164
3169
3280
3296
3311
3318
3320
3324
3330
3366
3375
3381
3406
3419
3432
3434
3435
3493
3495
3503
3509
3511
3513
3517
3521
3526
3546
3554
3600
3601
3606
3612
3613
3616
3622
3623
3627
3632
3634
3636
3638
3644
3646
3649
3650
3651
3656
3663
3673
3674
3689
3690
3702
3733
3769
3971
3974
4065
4068
4073
4102
4136
4140
4151
4159
4165
4207
4219
4226
4249
4256
4263
4270
4313
4321
4378
4386
4478
4508
4512
4536
4542
4550
4560
4562
4570
4571
4572
4583
4588
4594
4604
4608
4623
4634
4636
4646
4651
4652
4686
4688
4691
4699
4724
4727
4737
4770
4774
4789
4802
4807
4819
4880
4886
4908
4927
4931
4936
4964
4976
4993
5028
5033
5043
5046
5096
5111
5114
5131
5132
5183
5199
5235
5275
5291
5293
5294
5343
5360
5362
5364
5390
5402
5418
5428
5430
5437
5443
5473
5484
5486
5505
5507
5508
5510
5567
5578
5580
5584
5606
5613
5629
5672
5676
5692
5701
5760
5769
5770
5779
5814
5850
5871
5893
5911
5949
5954
6005
6006
6012
6017
6023
6024
6040
6050
6054
6087
6105
6157
6235
6237
6256
6259
6286
6291
6306
6339
6341
6343
6379
6383
6393
6405
6479
6511
6517
6541
6561
6608
6611
6615
6678
6682
6707
6752
6798
6850
6880
6885
6890
6920
6981
7000
7009
7038
7049
7050
7052
7073
7078
7098
7111
7165
7198
7204
7280
7283
7286
7287
7293
7294
7305
7318
7341
7346
7354
7382
7427
7428
7435
7445
7450
7455
7467
7469
7497
7502
7506
7514
7523
7651
7661
7664
7672
7679
7685
7696
7730
7871
7873
7895
7914
7915
7920
7934
7935
7949
8009
8036
8051
8065
8074
8090
8112
8140
8164
8168
8178
8182
8198
8212
8216
8230
8242
8288
8289
8295
8318
8352
8368
8371
8375
8376
8401
8416
8419
8436
8460
8477
8478
8482
8498
8500
8539
8543
8552
8555
8580
8584
8586
8594
8598
8601
8606
8610
8611
8622
8627
8639
8649
8650
8653
8654
8667
8672
8673
8674
8676
8684
8720
8723
8750
8753
8801
8815
8831
8835
8842
8845
8858
8897
8916
8951
8954
8959
8970
8976
8981
8983
8989
8991
8993
9019
9039
9042
9043
9056
9057
9070
9087
9098
9106
9130
9131
9155
9171
9183
9198
9199
9201
9204
9211
9220
9224
9228
9249
9259
9270
9278
9294
9299
9309
9321
9344
9351
9375
9376
9381
9391
9400
9404
9440
9448
9463
9474
9501
9504
9513
9514
9544
9566
9575
9607
9608
9623
9632
9638
9642
9655
9673
9739
9751
9759
9766
9777
9801
9819
9838
9878
9923
9955
9960
9962
9969
9996
10009
10030
10039
10051
10072
10074
10077
10093
10096
10108
10117
10120
10123
10157
10225
10275
10303
10306
10313
10314
10331
10336
10337
10412
10422
10450
10462
10464
10486
10518
10521
10522
10531
10533
10534
10550
10558
10573
10582
10585
10588
10611
10625
10634
10637
10676
10682
10725
10775
10781
10782
10806
10836
10839
10847
10858
10870
10880
10883
10907
10913
10920
10935
10946
10950
10951
10956
10998
11002
11017
11022
11024
11026
11044
11054
11094
11109
11136
11136
11167
11185
11220
11222
11241
11254
11258
11278
11305
11310
11330
11366
11376
11388
11391
11400
11406
11436
11448
11465
11468
11472
11477
11482
11483
11506
11535
11557
11565
11574
11583
11593
11610
11611
11618
11620
11639
11642
11663
11673
11688
11708
11709
11715
11720
11725
11728
11742
11759
11770
11836
11838
11855
11875
11877
11883
11888
11895
11916
11922
11929
11943
11951
11979
11983
12213
12228
12238
12240
12241
12246
12282
12348
12368
12372
12421
12559
12565
12574
12687
12754
12767
12777
12779
12811
12831
12834
12835
12842
12846
12848
12849
12855
12857
12872
12937
12970
13016
13037
13045
13058
13084
13085
13087
13093
13133
13181
13229
13405
13443
13613
13689
13697
13708
13748
13803
13981
14050
14058
14218
14245
14255
14263
14293
14323
14366
14388
14393
14437
14441
14964
15730
16742
18035
18203
18533
18790
19100
20017
20460
21024
21043
21161
21169
21179
21194
21198
21367
21815
This source diff could not be displayed because it is too large. You can view the blob instead.
# --------------------------------------------------------
# InternVL
# Copyright (c) 2023 OpenGVLab
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------
from .build import build_model
# --------------------------------------------------------
# InternVL
# Copyright (c) 2023 OpenGVLab
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------
from .intern_vit_6b import InternViT6B
def build_model(config):
model_type = config.MODEL.TYPE
if model_type == 'intern_vit_6b':
model = InternViT6B(
num_classes=config.MODEL.NUM_CLASSES,
patch_size=config.MODEL.INTERN_VIT_6B.PATCH_SIZE,
img_size=config.DATA.IMG_SIZE,
pretrain_size=config.MODEL.INTERN_VIT_6B.PRETRAIN_SIZE,
qkv_bias=config.MODEL.INTERN_VIT_6B.QKV_BIAS,
drop_path_rate=config.MODEL.DROP_PATH_RATE,
embed_dim=config.MODEL.INTERN_VIT_6B.EMBED_DIM,
num_heads=config.MODEL.INTERN_VIT_6B.NUM_HEADS,
mlp_ratio=config.MODEL.INTERN_VIT_6B.MLP_RATIO,
init_values=config.MODEL.INTERN_VIT_6B.INIT_VALUES,
qk_normalization=config.MODEL.INTERN_VIT_6B.QK_NORMALIZATION,
depth=config.MODEL.INTERN_VIT_6B.DEPTH,
use_flash_attn=config.MODEL.INTERN_VIT_6B.USE_FLASH_ATTN,
with_cp=config.TRAIN.USE_CHECKPOINT,
freeze_vit=config.MODEL.INTERN_VIT_6B.FREEZE_VIT,
pretrained=config.MODEL.INTERN_VIT_6B.PRETRAINED,
cls_target=config.MODEL.INTERN_VIT_6B.CLS_TARGET,
head_norm_type=config.MODEL.INTERN_VIT_6B.HEAD_NORM_TYPE,
)
else:
raise NotImplementedError(f'Unkown model: {model_type}')
return model
import torch
import torch.nn as nn
from einops import rearrange
try: # v1
from flash_attn.flash_attn_interface import \
flash_attn_unpadded_qkvpacked_func
except: # v2
from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
from flash_attn.bert_padding import pad_input, unpad_input
class FlashAttention(nn.Module):
"""Implement the scaled dot product attention with softmax.
Arguments
---------
softmax_scale: The temperature to use for the softmax attention.
(default: 1/sqrt(d_keys) where d_keys is computed at
runtime)
attention_dropout: The dropout rate to apply to the attention
(default: 0.0)
"""
def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
super().__init__()
self.softmax_scale = softmax_scale
self.dropout_p = attention_dropout
def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
max_s=None, need_weights=False):
"""Implements the multihead softmax attention.
Arguments
---------
qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) if key_padding_mask is None
if unpadded: (nnz, 3, h, d)
key_padding_mask: a bool tensor of shape (B, S)
"""
assert not need_weights
assert qkv.dtype in [torch.float16, torch.bfloat16]
assert qkv.is_cuda
if cu_seqlens is None:
batch_size = qkv.shape[0]
seqlen = qkv.shape[1]
if key_padding_mask is None:
qkv = rearrange(qkv, 'b s ... -> (b s) ...')
max_s = seqlen
cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
device=qkv.device)
output = flash_attn_unpadded_qkvpacked_func(
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
output = rearrange(output, '(b s) ... -> b s ...', b=batch_size)
else:
nheads = qkv.shape[-2]
x = rearrange(qkv, 'b s three h d -> b s (three h d)')
x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
output_unpad = flash_attn_unpadded_qkvpacked_func(
x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'),
indices, batch_size, seqlen),
'b s (h d) -> b s h d', h=nheads)
else:
assert max_s is not None
output = flash_attn_unpadded_qkvpacked_func(
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
return output, None
# --------------------------------------------------------
# InternVL
# Copyright (c) 2023 OpenGVLab
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------
from functools import partial
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as checkpoint
from einops import rearrange
from timm.models.layers import DropPath, to_2tuple
try:
from .flash_attention import FlashAttention
has_flash_attn = True
except:
print('FlashAttention is not installed.')
has_flash_attn = False
def _freeze_params(module):
for param in module.parameters():
param.requires_grad = False
class CrossAttention(nn.Module):
def __init__(
self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.,
proj_drop=0., attn_head_dim=None, out_dim=None):
super().__init__()
if out_dim is None:
out_dim = dim
self.num_heads = num_heads
head_dim = dim // num_heads
if attn_head_dim is not None:
head_dim = attn_head_dim
all_head_dim = head_dim * self.num_heads
self.scale = qk_scale or head_dim ** -0.5
assert all_head_dim == dim
self.q = nn.Linear(dim, all_head_dim, bias=False)
self.k = nn.Linear(dim, all_head_dim, bias=False)
self.v = nn.Linear(dim, all_head_dim, bias=False)
if qkv_bias:
self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
self.k_bias = nn.Parameter(torch.zeros(all_head_dim))
self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
else:
self.q_bias = None
self.k_bias = None
self.v_bias = None
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(all_head_dim, out_dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x, k=None, v=None):
B, N, C = x.shape
N_k = k.shape[1]
N_v = v.shape[1]
q_bias, k_bias, v_bias = None, None, None
if self.q_bias is not None:
q_bias = self.q_bias
k_bias = self.k_bias
v_bias = self.v_bias
q = F.linear(input=x, weight=self.q.weight, bias=q_bias)
q = q.reshape(B, N, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0) # (B, N_head, N_q, dim)
k = F.linear(input=k, weight=self.k.weight, bias=k_bias)
k = k.reshape(B, N_k, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0)
v = F.linear(input=v, weight=self.v.weight, bias=v_bias)
v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0)
q = q * self.scale
attn = (q @ k.transpose(-2, -1)) # (B, N_head, N_q, N_k)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x = self.proj(x)
x = self.proj_drop(x)
return x
class AttentiveBlock(nn.Module):
def __init__(self, dim, num_heads, qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
drop_path=0., norm_layer=nn.LayerNorm, attn_head_dim=None, out_dim=None):
super().__init__()
self.norm1_q = norm_layer(dim)
self.norm1_k = norm_layer(dim)
self.norm1_v = norm_layer(dim)
self.cross_attn = CrossAttention(
dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop,
proj_drop=drop, attn_head_dim=attn_head_dim, out_dim=out_dim)
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self, x_q, x_kv, pos_q, pos_k, bool_masked_pos, rel_pos_bias=None):
x_q = self.norm1_q(x_q + pos_q)
x_k = self.norm1_k(x_kv + pos_k)
x_v = self.norm1_v(x_kv)
x = self.cross_attn(x_q, k=x_k, v=x_v)
return x
class AttentionPoolingBlock(AttentiveBlock):
def forward(self, x):
x_q = x.mean(1, keepdim=True)
x_kv, pos_q, pos_k = x, 0, 0
x = super().forward(x_q, x_kv, pos_q, pos_k, bool_masked_pos=None, rel_pos_bias=None)
x = x.squeeze(1)
return x
class RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
try:
from apex.normalization import FusedRMSNorm
RMSNorm = FusedRMSNorm # noqa
print('Discovered apex.normalization.FusedRMSNorm - will use it instead of RMSNorm')
except ImportError:
# using the normal RMSNorm
pass
except Exception:
print('discovered apex but it failed to load, falling back to RMSNorm')
pass
class LayerScale(nn.Module):
def __init__(self, dim, init_values=1e-5, inplace=False, force_fp32=False):
super().__init__()
self.inplace = inplace
self.gamma = nn.Parameter(init_values * torch.ones(dim))
self.force_fp32 = force_fp32
@torch.cuda.amp.autocast(enabled=False)
def forward(self, x):
if self.force_fp32:
output_type = x.dtype
out = x.float().mul_(self.gamma.float()) if self.inplace else x.float() * self.gamma.float()
return out.to(dtype=output_type)
else:
out = x.mul_(self.gamma) if self.inplace else x * self.gamma
return out
class Attention(nn.Module):
def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0., use_flash_attn=False,
causal=False, norm_layer=nn.LayerNorm, qk_normalization=False):
super().__init__()
assert dim % num_heads == 0, 'dim should be divisible by num_heads'
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = head_dim ** -0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
self.use_flash_attn = use_flash_attn
if use_flash_attn:
self.causal = causal
self.inner_attn = FlashAttention(attention_dropout=attn_drop)
self.qk_normalization = qk_normalization
self.q_norm = norm_layer(dim) if qk_normalization else nn.Identity()
self.k_norm = norm_layer(dim) if qk_normalization else nn.Identity()
def _naive_attn(self, x):
B, N, C = x.shape
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0) # make torchscript happy (cannot use tensor as tuple)
if self.qk_normalization:
B_, H_, N_, D_ = q.shape
q = self.q_norm(q.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
k = self.k_norm(k.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
attn = ((q * self.scale) @ k.transpose(-2, -1))
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
def _flash_attn(self, x, key_padding_mask=None, need_weights=False):
qkv = self.qkv(x)
qkv = rearrange(qkv, 'b s (three h d) -> b s three h d', three=3, h=self.num_heads)
if self.qk_normalization:
q, k, v = qkv.unbind(2)
q = self.q_norm(q.flatten(-2, -1)).view(q.shape)
k = self.k_norm(k.flatten(-2, -1)).view(k.shape)
qkv = torch.stack([q, k, v], dim=2)
context, _ = self.inner_attn(
qkv, key_padding_mask=key_padding_mask, need_weights=need_weights, causal=self.causal
)
outs = self.proj(rearrange(context, 'b s h d -> b s (h d)'))
outs = self.proj_drop(outs)
return outs
def forward(self, x):
x = self._naive_attn(x) if not self.use_flash_attn else self._flash_attn(x)
return x
class Mlp(nn.Module):
""" MLP as used in Vision Transformer, MLP-Mixer and related networks
"""
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU,
bias=True, drop=0.):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
bias = to_2tuple(bias)
drop_probs = to_2tuple(drop)
self.fc1 = nn.Linear(in_features, hidden_features, bias=bias[0])
self.act = act_layer()
self.drop1 = nn.Dropout(drop_probs[0])
self.fc2 = nn.Linear(hidden_features, out_features, bias=bias[1])
self.drop2 = nn.Dropout(drop_probs[1])
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop1(x)
x = self.fc2(x)
x = self.drop2(x)
return x
class Block(nn.Module):
def __init__(
self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0., init_values=None,
drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, use_flash_attn=False, with_cp=False,
qk_normalization=False, layerscale_force_fp32=False):
super().__init__()
self.norm1 = norm_layer(dim)
self.attn = Attention(dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop,
use_flash_attn=use_flash_attn, causal=False, norm_layer=norm_layer,
qk_normalization=qk_normalization)
self.ls1 = LayerScale(dim, init_values=init_values,
force_fp32=layerscale_force_fp32) if init_values else nn.Identity()
# NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
self.drop_path1 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.norm2 = norm_layer(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
self.ls2 = LayerScale(dim, init_values=init_values,
force_fp32=layerscale_force_fp32) if init_values else nn.Identity()
self.drop_path2 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.with_cp = with_cp
def forward(self, x):
def _inner_forward(x):
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x))))
return x
if self.with_cp:
return checkpoint.checkpoint(_inner_forward, x)
else:
return _inner_forward(x)
class PatchEmbed(nn.Module):
""" 2D Image to Patch Embedding
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
super().__init__()
img_size = to_2tuple(img_size)
patch_size = to_2tuple(patch_size)
num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0])
self.patch_shape = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
self.img_size = img_size
self.patch_size = patch_size
self.num_patches = num_patches
self.flatten = flatten
self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
def forward(self, x, **kwargs):
x = self.proj(x)
_, _, H, W = x.shape
if self.flatten:
x = x.flatten(2).transpose(1, 2) # BCHW -> BNC
x = self.norm(x)
return x, H, W
class InternViT6B(nn.Module):
def __init__(self, in_chans=3, patch_size=14, img_size=224, pretrain_size=224, qkv_bias=False, drop_path_rate=0.0,
embed_dim=3200, num_heads=25, mlp_ratio=4, init_values=0.1, qk_normalization=True, depth=48,
use_flash_attn=True, with_cp=True, layerscale_force_fp32=False, freeze_vit=True,
cls_target='cls_patch_concat', num_classes=1000, attn_pool_num_heads=16, clip_embed_dim=768,
head_norm_type='bn', pretrained=None):
super().__init__()
self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models
self.pretrain_size = pretrain_size
self.drop_path_rate = drop_path_rate
self.img_size = img_size
self.patch_size = patch_size
self.cls_target = cls_target
self.depth = depth
use_flash_attn = use_flash_attn and has_flash_attn
if use_flash_attn and not has_flash_attn:
print('Warning: Flash Attention is not available, use_flash_attn is set to False.')
use_flash_attn = [use_flash_attn] * depth if not isinstance(use_flash_attn, list) else use_flash_attn
norm_layer_for_blocks = partial(RMSNorm, eps=1e-6)
self.norm_layer_for_blocks = norm_layer_for_blocks
self.patch_embed = PatchEmbed(img_size, patch_size, in_chans, embed_dim)
num_patches = self.patch_embed.num_patches
self.num_patches = num_patches
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.pos_drop = nn.Identity()
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]
self.blocks = nn.ModuleList([
Block(embed_dim, num_heads, mlp_ratio, qkv_bias=qkv_bias,
norm_layer=norm_layer_for_blocks,
drop_path=dpr[i], init_values=init_values, attn_drop=0.,
use_flash_attn=use_flash_attn[i],
with_cp=with_cp,
qk_normalization=qk_normalization,
layerscale_force_fp32=layerscale_force_fp32)
for i in range(depth)])
if cls_target == 'clip_projector':
self.clip_projector = AttentionPoolingBlock(
dim=embed_dim, num_heads=attn_pool_num_heads, qkv_bias=True, qk_scale=None,
drop=0., attn_drop=0., norm_layer=partial(nn.LayerNorm, eps=1e-5), out_dim=clip_embed_dim)
self.init_weights(pretrained)
if freeze_vit:
_freeze_params(self)
if cls_target == 'cls_patch_concat':
if head_norm_type == 'bn':
self.norm = nn.SyncBatchNorm(embed_dim * 2, eps=1e-6)
else:
self.norm = nn.LayerNorm(embed_dim * 2, eps=1e-6)
self.head = nn.Linear(embed_dim * 2, num_classes) if num_classes > 0 else nn.Identity()
elif cls_target == 'clip_projector':
if head_norm_type == 'bn':
self.norm = nn.SyncBatchNorm(clip_embed_dim, eps=1e-6)
else:
self.norm = nn.LayerNorm(clip_embed_dim, eps=1e-6)
self.head = nn.Linear(clip_embed_dim, num_classes) if num_classes > 0 else nn.Identity()
else:
raise NotImplementedError
if type(self.head) != nn.Identity:
self.head.weight.data.normal_(mean=0.0, std=0.01)
self.head.bias.data.zero_()
def init_weights(self, pretrained=None):
print(f'pretrained: {pretrained}')
def resize_pos_embed(pos_embed, H, W):
cls = pos_embed[:, :1, :]
pos_embed = pos_embed[:, 1:, :].reshape(
1, self.pretrain_size // 14, self.pretrain_size // 14, -1).permute(0, 3, 1, 2)
pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
reshape(1, -1, H * W).permute(0, 2, 1)
pos_embed = torch.cat([cls, pos_embed], dim=1)
return pos_embed
if isinstance(pretrained, str):
checkpoint = torch.load(pretrained, map_location='cpu')
if 'module' in checkpoint:
checkpoint = checkpoint['module']
# resize pos_embed
pos_embed = checkpoint['pos_embed']
checkpoint['pos_embed'] = resize_pos_embed(
pos_embed, self.img_size // self.patch_size, self.img_size // self.patch_size)
# resize patch_embed
patch_embed = checkpoint['patch_embed.proj.weight']
checkpoint['patch_embed.proj.weight'] = F.interpolate(
patch_embed, size=(self.patch_size, self.patch_size),
mode='bicubic', align_corners=False)
message = self.load_state_dict(checkpoint, strict=False)
print(message)
@property
def dtype(self):
return self.patch_embed.proj.weight.dtype
def forward_features(self, x):
x, _, _ = self.patch_embed(x.type(self.dtype))
batch_size, seq_len, _ = x.size()
cls_tokens = self.cls_token.expand(batch_size, -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
x = x + self.pos_embed
for idx, blk in enumerate(self.blocks):
x = blk(x)
return x
def forward(self, x):
x = self.forward_features(x)
if self.cls_target == 'cls_patch_concat':
x = torch.cat((x[:, 0, :], x[:, 1:, :].mean(dim=1)), dim=-1)
elif self.cls_target == 'clip_projector':
x = self.clip_projector(x)
else:
raise NotImplementedError
x = self.norm(x)
x = self.head(x)
return x
@torch.jit.ignore
def lr_decay_keywords(self, decay_ratio=0.95):
lr_ratios = {}
# blocks
for idx in range(self.depth):
tag = 'blocks.{}.'.format(idx)
decay = 1.0 * (decay_ratio ** (self.depth - idx))
lr_ratios[tag] = decay
# patch_embed
lr_ratios['patch_embed'] = 1.0 * (decay_ratio ** (self.depth + 1))
lr_ratios['pos_embed'] = 1.0 * (decay_ratio ** (self.depth + 1))
lr_ratios['cls_token'] = 1.0 * (decay_ratio ** (self.depth + 1))
return lr_ratios
# --------------------------------------------------------
# InternVL
# Copyright (c) 2022 OpenGVLab
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------
from torch import optim as optim
from torch.distributed.optim import ZeroRedundancyOptimizer
def build_optimizer(config, model):
"""
Build optimizer, set weight decay of normalization to 0 by default.
"""
skip = {}
skip_keywords = {}
if hasattr(model, 'no_weight_decay'):
skip = model.no_weight_decay()
if hasattr(model, 'no_weight_decay_keywords'):
skip_keywords = model.no_weight_decay_keywords()
parameters = set_weight_decay_and_lr(
model,
config.TRAIN.WEIGHT_DECAY,
config.TRAIN.BASE_LR,
skip,
skip_keywords,
lr_layer_decay=config.TRAIN.LR_LAYER_DECAY,
lr_layer_decay_ratio=config.TRAIN.LR_LAYER_DECAY_RATIO,
freeze_backbone=config.TRAIN.OPTIMIZER.FREEZE_BACKBONE,
dcn_lr_mul=config.TRAIN.OPTIMIZER.DCN_LR_MUL,
)
opt_lower = config.TRAIN.OPTIMIZER.NAME.lower()
optimizer = None
use_zero = config.TRAIN.OPTIMIZER.USE_ZERO
if use_zero:
print(f'\nUse Zero!')
if opt_lower == 'sgd':
# an ugly implementation
# this problem is fixed after torch 1.12
# https://github.com/pytorch/pytorch/issues/71347
# before 1.12, we could only pass list to zero optimizer, so we first pass parameters[0] with its lr and weight decay,
# then we add other parameter via parameter group.
optimizer = ZeroRedundancyOptimizer(
parameters[0]['params'],
optimizer_class=optim.SGD,
momentum=config.TRAIN.OPTIMIZER.MOMENTUM, nesterov=True,
lr=parameters[0]['lr'], weight_decay=parameters[0]['weight_decay']
)
if len(parameters) > 1:
for param_group in parameters[1:]:
optimizer.add_param_group(param_group)
elif opt_lower == 'adamw':
optimizer = ZeroRedundancyOptimizer(
parameters[0]['params'],
optimizer_class=optim.AdamW,
eps=config.TRAIN.OPTIMIZER.EPS, betas=config.TRAIN.OPTIMIZER.BETAS,
lr=parameters[0]['lr'], weight_decay=parameters[0]['weight_decay']
)
if len(parameters) > 1:
for param_group in parameters[1:]:
optimizer.add_param_group(param_group)
else:
if opt_lower == 'sgd':
optimizer = optim.SGD(parameters,
momentum=config.TRAIN.OPTIMIZER.MOMENTUM,
nesterov=True,
lr=config.TRAIN.BASE_LR,
weight_decay=config.TRAIN.WEIGHT_DECAY)
elif opt_lower == 'sgd_linear_probing':
optimizer = optim.SGD(parameters,
momentum=0.9,
nesterov=False,
lr=config.TRAIN.BASE_LR,
weight_decay=0)
elif opt_lower == 'adamw':
optimizer = optim.AdamW(parameters,
eps=config.TRAIN.OPTIMIZER.EPS,
betas=config.TRAIN.OPTIMIZER.BETAS,
lr=config.TRAIN.BASE_LR,
weight_decay=config.TRAIN.WEIGHT_DECAY)
else:
raise NotImplementedError
return optimizer
def check_keywords_in_name(name, keywords=()):
isin = False
for keyword in keywords:
if keyword in name:
isin = True
return isin
def check_keywords_in_dict(name, keywords_dict):
for k, v in keywords_dict.items():
if k in name:
return v
return None
def set_weight_decay_and_lr(
model,
weight_decay,
base_lr,
skip_list=(),
skip_keywords=(),
lr_layer_decay=None,
lr_layer_decay_ratio=None,
freeze_backbone=None,
dcn_lr_mul=None,
layerwise_lr=True,
):
parameters = []
no_decay_name = []
lr_ratio_log = {}
for name, param in model.named_parameters():
if not param.requires_grad:
continue # frozen weights
if freeze_backbone:
for i in freeze_backbone:
if f'levels.{i}' in name:
param.requires_grad = False
# 1. check wd
if len(param.shape) == 1 or name.endswith('.bias') or (
name in skip_list) or check_keywords_in_name(name, skip_keywords):
wd = 0.
no_decay_name.append(name)
else:
wd = weight_decay
if lr_layer_decay:
print('layer-wise lr decay is used !')
assert hasattr(model, 'lr_decay_keywords')
lr_ratio_keywards = model.lr_decay_keywords(lr_layer_decay_ratio)
# 2. check lr
ratio = check_keywords_in_dict(name, lr_ratio_keywards)
if ratio is not None:
lr = ratio * base_lr
else:
lr = base_lr
# dcn lr
if dcn_lr_mul is not None:
if 'offset' in name or 'attention_weights' in name or 'center_feature_scale_proj' in name or 'alpha_beta' in name:
lr = dcn_lr_mul * lr
lr_ratio_log[name] = (base_lr, ratio, wd, param.requires_grad)
else:
lr = base_lr
parameters.append({'params': [param], 'weight_decay': wd, 'lr': lr, 'name': name})
print('no decay params: {no_decay_name}')
if layerwise_lr:
print('lr_ratio_params:')
for k, v in lr_ratio_log.items():
print(k, v)
return parameters
#!/usr/bin/env bash
set -x
PARTITION=$1
JOB_NAME=$2
CONFIG=$3
GPUS=${GPUS:-8}
GPUS_PER_NODE=${GPUS_PER_NODE:-8}
CPUS_PER_TASK=${CPUS_PER_TASK:-10}
SRUN_ARGS=${SRUN_ARGS:-""}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--gres=gpu:${GPUS_PER_NODE} \
--ntasks=${GPUS} \
--ntasks-per-node=${GPUS_PER_NODE} \
--cpus-per-task=${CPUS_PER_TASK} \
--kill-on-bad-exit=1 \
--quotatype=reserved \
${SRUN_ARGS} \
python -u main.py \
--cfg ${CONFIG} \
--accumulation-steps 1 \
--local-rank 0 \
--output work_dirs ${@:4}
#!/usr/bin/env bash
set -x
PARTITION=$1
JOB_NAME=$2
CONFIG=$3
GPUS=${GPUS:-8}
GPUS_PER_NODE=${GPUS_PER_NODE:-8}
CPUS_PER_TASK=${CPUS_PER_TASK:-10}
SRUN_ARGS=${SRUN_ARGS:-""}
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--gres=gpu:${GPUS_PER_NODE} \
--ntasks=${GPUS} \
--ntasks-per-node=${GPUS_PER_NODE} \
--cpus-per-task=${CPUS_PER_TASK} \
--kill-on-bad-exit=1 \
--quotatype=spot \
${SRUN_ARGS} \
python -u main_deepspeed.py \
--cfg ${CONFIG} \
--local-rank 0 \
--data-path /mnt/lustre/share/images \
--output work_dirs_deepspeed ${@:4}
# --------------------------------------------------------
# InternVL
# Copyright (c) 2022 OpenGVLab
# Licensed under The MIT License [see LICENSE for details]
# --------------------------------------------------------
import math
import os
from collections import OrderedDict
import numpy as np
import torch
import torch.distributed as dist
from timm.utils import get_state_dict
try:
# noinspection PyUnresolvedReferences
from apex import amp
except ImportError:
amp = None
def load_ema_checkpoint(config, model_ema, logger):
logger.info(
f'==============> Resuming form {config.MODEL.RESUME}....................'
)
if config.MODEL.RESUME.startswith('https'):
checkpoint = torch.hub.load_state_dict_from_url(config.MODEL.RESUME,
map_location='cpu',
check_hash=True)
else:
checkpoint = torch.load(config.MODEL.RESUME, map_location='cpu')
assert isinstance(checkpoint, dict)
if 'model_ema' in checkpoint:
new_state_dict = OrderedDict()
for k, v in checkpoint['model_ema'].items():
if model_ema.ema_has_module:
name = 'module.' + k if not k.startswith('module') else k
else:
name = k
new_state_dict[name] = v
msg = model_ema.ema.load_state_dict(new_state_dict, strict=False)
logger.info(msg)
logger.info('Loaded state_dict_ema')
else:
logger.warning(
'Failed to find state_dict_ema, starting from loaded model weights'
)
max_accuracy_ema = 0
if 'max_accuracy_ema' in checkpoint:
max_accuracy_ema = checkpoint['max_accuracy_ema']
if 'ema_decay' in checkpoint:
model_ema.decay = checkpoint['ema_decay']
return max_accuracy_ema
def load_checkpoint(config, model, optimizer, lr_scheduler, scaler, logger):
logger.info(
f'==============> Resuming form {config.MODEL.RESUME}....................'
)
if config.MODEL.RESUME.startswith('https'):
checkpoint = torch.hub.load_state_dict_from_url(config.MODEL.RESUME,
map_location='cpu',
check_hash=True)
else:
checkpoint = torch.load(config.MODEL.RESUME, map_location='cpu')
print('resuming model')
model_checkpoint = checkpoint['model']
msg = model.load_state_dict(model_checkpoint, strict=False)
logger.info(msg)
max_accuracy = 0.0
if not config.EVAL_MODE and 'optimizer' in checkpoint and 'lr_scheduler' in checkpoint and 'epoch' in checkpoint:
if optimizer is not None:
print('resuming optimizer')
try:
optimizer.load_state_dict(checkpoint['optimizer'])
except:
print('resume optimizer failed')
if lr_scheduler is not None:
print('resuming lr_scheduler')
lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
config.defrost()
config.TRAIN.START_EPOCH = checkpoint['epoch'] + 1
config.freeze()
if 'amp' in checkpoint and config.AMP_OPT_LEVEL != 'O0' and checkpoint['config'].AMP_OPT_LEVEL != 'O0':
scaler.load_state_dict(checkpoint['amp'])
logger.info(
f"=> loaded successfully {config.MODEL.RESUME} (epoch {checkpoint['epoch']})"
)
if 'max_accuracy' in checkpoint:
max_accuracy = checkpoint['max_accuracy']
del checkpoint
torch.cuda.empty_cache()
return max_accuracy
def load_pretrained(config, model, logger):
logger.info(
f'==============> Loading weight {config.MODEL.PRETRAINED} for fine-tuning......'
)
checkpoint = torch.load(config.MODEL.PRETRAINED, map_location='cpu')
state_dict = checkpoint
if 'model' in checkpoint:
state_dict = checkpoint['model']
elif 'module' in checkpoint:
state_dict = checkpoint['module']
first_key = list(state_dict.keys())[0]
# delete teacher weights
if 'student' in first_key or 'teacher' in first_key:
new_state_dict = OrderedDict()
for k, v in state_dict.items():
if 'student_proj' in k:
continue
if 'student' in k:
new_k = k.replace('student.', '')
new_state_dict[new_k] = v
state_dict = new_state_dict
# weights from sim
if 'mask_token' in first_key:
new_state_dict = OrderedDict()
for k, v in state_dict.items():
if 'mm_dcnv3' in k:
continue
if 'dcnv3' not in k and 'clip_projector' not in k:
continue
new_k = k.replace('dcnv3.', '')
new_state_dict[new_k] = v
new_state_dict['fc_norm.weight'] = state_dict[
'clip.classifier_ln.weight']
new_state_dict['fc_norm.bias'] = state_dict['clip.classifier_ln.bias']
new_state_dict['head.weight'] = state_dict['clip.classifier.weight']
new_state_dict['head.bias'] = state_dict['clip.classifier.bias']
state_dict = new_state_dict
# delete relative_position_index since we always re-init it
relative_position_index_keys = [
k for k in state_dict.keys() if 'relative_position_index' in k
]
for k in relative_position_index_keys:
del state_dict[k]
# delete relative_coords_table since we always re-init it
relative_position_index_keys = [
k for k in state_dict.keys() if 'relative_coords_table' in k
]
for k in relative_position_index_keys:
del state_dict[k]
# delete attn_mask since we always re-init it
attn_mask_keys = [k for k in state_dict.keys() if 'attn_mask' in k]
for k in attn_mask_keys:
del state_dict[k]
# bicubic interpolate relative_position_bias_table if not match
relative_position_bias_table_keys = [
k for k in state_dict.keys() if 'relative_position_bias_table' in k
]
for k in relative_position_bias_table_keys:
relative_position_bias_table_pretrained = state_dict[k]
relative_position_bias_table_current = model.state_dict()[k]
L1, nH1 = relative_position_bias_table_pretrained.size()
L2, nH2 = relative_position_bias_table_current.size()
if nH1 != nH2:
logger.warning(f'Error in loading {k}, passing......')
else:
if L1 != L2:
# bicubic interpolate relative_position_bias_table if not match
S1 = int(L1 ** 0.5)
S2 = int(L2 ** 0.5)
relative_position_bias_table_pretrained_resized = torch.nn.functional.interpolate(
relative_position_bias_table_pretrained.permute(1, 0).view(1, nH1, S1, S1),
size=(S2, S2),
mode='bicubic')
state_dict[k] = relative_position_bias_table_pretrained_resized.view(nH2, L2).permute(1, 0)
# bicubic interpolate absolute_pos_embed if not match
absolute_pos_embed_keys = [
k for k in state_dict.keys() if 'absolute_pos_embed' in k
]
for k in absolute_pos_embed_keys:
# dpe
absolute_pos_embed_pretrained = state_dict[k]
absolute_pos_embed_current = model.state_dict()[k]
_, L1, C1 = absolute_pos_embed_pretrained.size()
_, L2, C2 = absolute_pos_embed_current.size()
if C1 != C1:
logger.warning(f'Error in loading {k}, passing......')
else:
if L1 != L2:
S1 = int(L1 ** 0.5)
S2 = int(L2 ** 0.5)
absolute_pos_embed_pretrained = absolute_pos_embed_pretrained.reshape(-1, S1, S1, C1)
absolute_pos_embed_pretrained = absolute_pos_embed_pretrained.permute(0, 3, 1, 2)
absolute_pos_embed_pretrained_resized = torch.nn.functional.interpolate(
absolute_pos_embed_pretrained,
size=(S2, S2),
mode='bicubic')
absolute_pos_embed_pretrained_resized = absolute_pos_embed_pretrained_resized.permute(0, 2, 3, 1)
absolute_pos_embed_pretrained_resized = absolute_pos_embed_pretrained_resized.flatten(1, 2)
state_dict[k] = absolute_pos_embed_pretrained_resized
# check classifier, if not match, then re-init classifier to zero
if 'head.bias' in state_dict:
head_bias_pretrained = state_dict['head.bias']
Nc1 = head_bias_pretrained.shape[0]
Nc2 = model.head.bias.shape[0]
if (Nc1 != Nc2):
if config.TRAIN.RAND_INIT_FT_HEAD:
model.head.weight.data = model.head.weight.data * 0.001
model.head.bias.data = model.head.bias.data * 0.001
del state_dict['head.weight']
del state_dict['head.bias']
logger.warning(f'Error in loading classifier head, re-init classifier head to 0')
elif Nc1 == 21841 and Nc2 == 1000:
logger.info('loading ImageNet-22K weight to ImageNet-1K ......')
map22kto1k_path = 'meta_data/map22kto1k.txt'
logger.info(map22kto1k_path)
with open(map22kto1k_path) as f:
map22kto1k = f.readlines()
map22kto1k = [int(id22k.strip()) for id22k in map22kto1k]
state_dict['head.weight'] = state_dict['head.weight'][map22kto1k, :]
state_dict['head.bias'] = state_dict['head.bias'][map22kto1k]
msg = model.load_state_dict(state_dict, strict=False)
logger.warning(msg)
logger.info(f'=> loaded successfully {config.MODEL.PRETRAINED}')
del checkpoint
torch.cuda.empty_cache()
def convert_22k_head_to_1k(model, logger):
head_weight = model.module.head.weight
head_bias = model.module.head.bias
Nc1 = head_bias.shape[0]
if Nc1 == 21841:
logger.info('converting ImageNet-22K head to ImageNet-1K ......')
map22kto1k_path = 'meta_data/map22kto1k.txt'
logger.info(map22kto1k_path)
with open(map22kto1k_path) as f:
map22kto1k = f.readlines()
map22kto1k = [int(id22k.strip()) for id22k in map22kto1k]
model.module.head.weight = torch.nn.Parameter(head_weight[map22kto1k, :])
model.module.head.bias = torch.nn.Parameter(head_bias[map22kto1k])
else:
logger.warning(f'Error in converting classifier head')
return model
def save_checkpoint(config,
epoch,
model,
max_accuracy,
optimizer,
lr_scheduler,
scaler,
logger,
model_ema=None,
max_accuracy_ema=None,
ema_decay=None,
model_ems=None,
max_accuracy_ems=None,
ems_model_num=None,
best=None):
save_state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'lr_scheduler': lr_scheduler.state_dict(),
'max_accuracy': max_accuracy,
'epoch': epoch,
'config': config
}
if model_ema is not None:
save_state['model_ema'] = get_state_dict(model_ema)
if max_accuracy_ema is not None:
save_state['max_accuracy_ema'] = max_accuracy_ema
if ema_decay is not None:
save_state['ema_decay'] = ema_decay
if model_ems is not None:
save_state['model_ems'] = get_state_dict(model_ems)
if max_accuracy_ems is not None:
save_state['max_accuracy_ems'] = max_accuracy_ems
if ems_model_num is not None:
save_state['ems_model_num'] = ems_model_num
if config.AMP_OPT_LEVEL != 'O0':
# save_state['amp'] = amp.state_dict()
save_state['amp'] = scaler.state_dict()
if best is None:
save_path = os.path.join(config.OUTPUT, f'ckpt_epoch_{epoch}.pth')
else:
save_path = os.path.join(config.OUTPUT, f'ckpt_epoch_{best}.pth')
logger.info(f'{save_path} saving......')
torch.save(save_state, save_path)
logger.info(f'{save_path} saved !!!')
if dist.get_rank() == 0 and isinstance(epoch, int):
to_del = epoch - config.SAVE_CKPT_NUM * config.SAVE_FREQ
old_ckpt = os.path.join(config.OUTPUT, f'ckpt_epoch_{to_del}.pth')
if os.path.exists(old_ckpt):
os.remove(old_ckpt)
def get_grad_norm(parameters, norm_type=2):
if isinstance(parameters, torch.Tensor):
parameters = [parameters]
parameters = list(filter(lambda p: p.grad is not None, parameters))
norm_type = float(norm_type)
total_norm = 0
for p in parameters:
param_norm = p.grad.data.norm(norm_type)
total_norm += param_norm.item() ** norm_type
total_norm = total_norm ** (1. / norm_type)
return total_norm
def auto_resume_helper(output_dir):
checkpoints = os.listdir(output_dir)
checkpoints = [ckpt for ckpt in checkpoints if ckpt.endswith('pth')]
print(f'All checkpoints founded in {output_dir}: {checkpoints}')
if len(checkpoints) > 0:
latest_checkpoint = max(
[os.path.join(output_dir, d) for d in checkpoints],
key=os.path.getmtime)
print(f'The latest checkpoint founded: {latest_checkpoint}')
resume_file = latest_checkpoint
else:
resume_file = None
return resume_file
def reduce_tensor(tensor):
rt = tensor.clone()
dist.all_reduce(rt, op=dist.ReduceOp.SUM)
rt /= dist.get_world_size()
return rt
# https://github.com/facebookresearch/ConvNeXt/blob/main/utils.py
class NativeScalerWithGradNormCount:
state_dict_key = 'amp_scaler'
def __init__(self):
self._scaler = torch.cuda.amp.GradScaler()
def __call__(self,
loss,
optimizer,
clip_grad=None,
parameters=None,
create_graph=False,
update_grad=True):
self._scaler.scale(loss).backward(create_graph=create_graph)
if update_grad:
if clip_grad is not None:
assert parameters is not None
self._scaler.unscale_(optimizer) # unscale the gradients of optimizer's assigned params in-place
norm = torch.nn.utils.clip_grad_norm_(parameters, clip_grad)
else:
self._scaler.unscale_(optimizer)
norm = get_grad_norm(parameters)
self._scaler.step(optimizer)
self._scaler.update()
else:
norm = None
return norm
def state_dict(self):
return self._scaler.state_dict()
def load_state_dict(self, state_dict):
self._scaler.load_state_dict(state_dict)
class MyAverageMeter(object):
"""Computes and stores the average and current value."""
def __init__(self, max_len=-1):
self.val_list = []
self.count = []
self.max_len = max_len
self.val = 0
self.avg = 0
self.var = 0
def update(self, val):
self.val = val
self.avg = 0
self.var = 0
if not math.isnan(val) and not math.isinf(val):
self.val_list.append(val)
if self.max_len > 0 and len(self.val_list) > self.max_len:
self.val_list = self.val_list[-self.max_len:]
if len(self.val_list) > 0:
self.avg = np.mean(np.array(self.val_list))
self.var = np.std(np.array(self.val_list))
This source diff could not be displayed because it is too large. You can view the blob instead.
=======
Credits
=======
* `Mehdi Cherti <https://github.com/mehdidc>`_
* `Romain Beaumont <https://github.com/rom1504>`_
.. highlight:: shell
============
Contributing
============
Contributions are welcome, and they are greatly appreciated! Every little bit
helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions
----------------------
Report Bugs
~~~~~~~~~~~
Report bugs at https://github.com/LAION-AI/CLIP_benchmark/issues.
If you are reporting a bug, please include:
* Your operating system name and version.
* Any details about your local setup that might be helpful in troubleshooting.
* Detailed steps to reproduce the bug.
Fix Bugs
~~~~~~~~
Look through the GitHub issues for bugs. Anything tagged with "bug" and "help
wanted" is open to whoever wants to implement it.
Implement Features
~~~~~~~~~~~~~~~~~~
Look through the GitHub issues for features. Anything tagged with "enhancement"
and "help wanted" is open to whoever wants to implement it.
Write Documentation
~~~~~~~~~~~~~~~~~~~
CLIP Benchmark could always use more documentation, whether as part of the
official CLIP Benchmark docs, in docstrings, or even on the web in blog posts,
articles, and such.
Submit Feedback
~~~~~~~~~~~~~~~
The best way to send feedback is to file an issue at https://github.com/LAION-AI/CLIP_benchmark/issues.
If you are proposing a feature:
* Explain in detail how it would work.
* Keep the scope as narrow as possible, to make it easier to implement.
* Remember that this is a volunteer-driven project, and that contributions
are welcome :)
Get Started!
------------
Ready to contribute? Here's how to set up `clip_benchmark` for local development.
1. Fork the `clip_benchmark` repo on GitHub.
2. Clone your fork locally::
$ git clone git@github.com:your_name_here/clip_benchmark.git
3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development::
$ mkvirtualenv clip_benchmark
$ cd clip_benchmark/
$ python setup.py develop
4. Create a branch for local development::
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
5. When you're done making changes, check that your changes pass flake8 and the
tests, including testing other Python versions with tox::
$ flake8 clip_benchmark tests
$ python setup.py test or pytest
$ tox
To get flake8 and tox, just pip install them into your virtualenv.
6. Commit your changes and push your branch to GitHub::
$ git add .
$ git commit -m "Your detailed description of your changes."
$ git push origin name-of-your-bugfix-or-feature
7. Submit a pull request through the GitHub website.
Pull Request Guidelines
-----------------------
Before you submit a pull request, check that it meets these guidelines:
1. The pull request should include tests.
2. If the pull request adds functionality, the docs should be updated. Put
your new functionality into a function with a docstring, and add the
feature to the list in README.rst.
3. The pull request should work for Python 3.5, 3.6, 3.7 and 3.8, and for PyPy. Check
https://travis-ci.com/mehdidc/clip_benchmark/pull_requests
and make sure that the tests pass for all supported Python versions.
Tips
----
To run a subset of tests::
$ python -m unittest tests.test_clip_benchmark
Deploying
---------
A reminder for the maintainers on how to deploy.
Make sure all your changes are committed (including an entry in HISTORY.rst).
Then run::
$ bump2version patch # possible: major / minor / patch
$ git push
$ git push --tags
Travis will then deploy to PyPI if tests pass.
## History
### 1.4.0
* Fix silent webdataset error-handling
* Added support for wds/voc2007_multilabel
* default to float32
* add mscoco generative benchmark
### 1.3.0
* update flickr8k results, solve issue #48, thanks to @orchidmajumder
* Evaluate multiple models/datasets/languages using the CLI directly
* Support Japanese CLIP by rinna
* Add arabic imagenet
* updating CuPL prompts with more generated sentences + ensembled with openAI prompts
* put model in eval mode before evaluation
* Webdataset updates
* Make verbose the default
### 1.2.0
* Added support for loading webdatasets
### 1.1.0
* Added better support for multilingual eval
* Added better support for linear probing
* Added support for CuPL prompts
### 1.0.1
* pypi description as markdown
### 1.0.0
* Actual first release on PyPI.
### 0.1.0
* First release on PyPI.
MIT License
Copyright (c) 2022, Mehdi Cherti
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
include AUTHORS.rst
include CONTRIBUTING.rst
include HISTORY.rst
include LICENSE
include README.rst
recursive-include tests *
recursive-exclude * __pycache__
recursive-exclude * *.py[co]
recursive-include * *.json
recursive-include docs *.rst conf.py Makefile make.bat *.jpg *.png *.gif
.PHONY: clean clean-build clean-pyc clean-test coverage dist docs help install lint lint/flake8
.DEFAULT_GOAL := help
define BROWSER_PYSCRIPT
import os, webbrowser, sys
from urllib.request import pathname2url
webbrowser.open("file://" + pathname2url(os.path.abspath(sys.argv[1])))
endef
export BROWSER_PYSCRIPT
define PRINT_HELP_PYSCRIPT
import re, sys
for line in sys.stdin:
match = re.match(r'^([a-zA-Z_-]+):.*?## (.*)$$', line)
if match:
target, help = match.groups()
print("%-20s %s" % (target, help))
endef
export PRINT_HELP_PYSCRIPT
BROWSER := python -c "$$BROWSER_PYSCRIPT"
help:
@python -c "$$PRINT_HELP_PYSCRIPT" < $(MAKEFILE_LIST)
clean: clean-build clean-pyc clean-test ## remove all build, test, coverage and Python artifacts
clean-build: ## remove build artifacts
rm -fr build/
rm -fr dist/
rm -fr .eggs/
find . -name '*.egg-info' -exec rm -fr {} +
find . -name '*.egg' -exec rm -f {} +
clean-pyc: ## remove Python file artifacts
find . -name '*.pyc' -exec rm -f {} +
find . -name '*.pyo' -exec rm -f {} +
find . -name '*~' -exec rm -f {} +
find . -name '__pycache__' -exec rm -fr {} +
clean-test: ## remove test and coverage artifacts
rm -fr .tox/
rm -f .coverage
rm -fr htmlcov/
rm -fr .pytest_cache
lint/flake8: ## check style with flake8
flake8 clip_benchmark tests
lint: lint/flake8 ## check style
test-all: ## run tests on every Python version with tox
tox
coverage: ## check code coverage quickly with the default Python
coverage run --source clip_benchmark setup.py test
coverage report -m
coverage html
$(BROWSER) htmlcov/index.html
docs: ## generate Sphinx HTML documentation, including API docs
rm -f docs/clip_benchmark.rst
rm -f docs/modules.rst
sphinx-apidoc -o docs/ clip_benchmark
$(MAKE) -C docs clean
$(MAKE) -C docs html
$(BROWSER) docs/_build/html/index.html
servedocs: docs ## compile the docs watching for changes
watchmedo shell-command -p '*.rst' -c '$(MAKE) -C docs html' -R -D .
release: dist ## package and upload a release
twine upload dist/*
dist: clean ## builds source and wheel package
python setup.py sdist
python setup.py bdist_wheel
ls -l dist
install: ## [Local development] Upgrade pip, install requirements, install package.
python -m pip install -U pip
python -m pip install -e .
install-dev: ## [Local development] Install test requirements
python -m pip install -r requirements-test.txt
test: ## [Local development] Run unit tests
python -m pytest -x -s -v tests
# InternVL for Zero-Shot Image Classification & Image-Text Retrieval
This folder contains the implementation of InternVL 1.0 for zero-shot image classification and zero-shot image-text retrieval, which corresponds to Section 4.3 of our [InternVL 1.0 paper](https://arxiv.org/pdf/2312.14238).
We mainly use [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark) to evaluate the performance of InternVL. Thanks for this great work.
## 🛠️ Installation
First, follow the [installation guide](../INSTALLATION.md) to perform some basic installations.
In addition, using this codebase requires executing the following steps:
- Install other requirements:
```bash
pip install -r requirements.txt
```
- Install `clip_benchmark` using development mode:
```bash
python setup.py develop
# You can also add the current directory to PYTHONPATH instead.
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
```
## 📦 Data Preparation
This codebase will automatically download the required dataset. If the dataset fails to download automatically, please refer to this [code](./clip_benchmark/datasets/builder.py) for manual downloading.
## 📦 Model Preparation
| model name | type | download | size |
| ------------------------ | :---------: | ------------------------------------------------------------------------------------------ | :-----: |
| internvl_c_13b_224px.pth | pytorch | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL/blob/main/internvl_c_13b_224px.pth) | 25.4 GB |
| InternVL-14B-224px | huggingface | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | 27.7 GB |
Please download the above model weights and place them in the `pretrained/` folder.
You can download either the PyTorch version or the Hugging Face version based on your needs.
```sh
cd pretrained/
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/internvl_c_13b_224px.pth
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL-14B-224px --local-dir InternVL-14B-224px
```
The directory structure is:
```sh
pretrained
├── internvl_c_13b_224px.pth
└── InternVL-14B-224px/
```
## 📊 Evaluation: Zero-Shot Image Classification
### ImageNet variants and ObjectNet
| model name | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet | ∆ | average |
| :--------: | :---: | :--: | :--: | :---: | :-------: | :-------: | :-: | :-----: |
| InternVL-C | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 | 0.8 | 82.4 |
<details>
<summary>[InternVL-C] ImageNet-1K val</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" \
--task "zeroshot_classification" --dataset "imagenet1k" --dataset_root ./data/imagenet-1k/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenet1k", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.83178, "acc5": 0.97322, "mean_per_class_recall": 0.83204}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] ImageNet-A</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" \
--task "zeroshot_classification" --dataset "imagenet-a" --dataset_root ./data/imagenet-a/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenet-a", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.8377333333333333, "acc5": 0.9558666666666666, "mean_per_class_recall": 0.8183934468491632}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] ImageNet-R</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" \
--task "zeroshot_classification" --dataset "imagenet-r" --dataset_root ./data/imagenet-r/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenet-r", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9549666666666666, "acc5": 0.9918333333333333, "mean_per_class_recall": 0.9460205918105684}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] ImageNet-V2</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" \
--task "zeroshot_classification" --dataset "imagenetv2" --dataset_root ./data/imagenetv2/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenetv2", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7726, "acc5": 0.9468, "mean_per_class_recall": 0.7738000000000001}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] ImageNet-Sketch</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" \
--task "zeroshot_classification" --dataset "imagenet_sketch" --dataset_root ./data/imagenet-sketch/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenet_sketch", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7385879070133035, "acc5": 0.9199827074613374, "mean_per_class_recall": 0.7386403921568627}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] ObjectNet</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" \
--task "zeroshot_classification" --dataset "objectnet" --dataset_root ./data/objectnet-1.0/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "objectnet", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.8059114891784215, "acc5": 0.9387853989447615, "mean_per_class_recall": 0.797040815749882}, "language": "en"}
```
</details>
### Multilingual ImageNet-1K
| model name | IN-1K (EN) | IN-1K (ZH) | IN-1K (JP) | IN-1K (AR) | IN-1K (IT) | average |
| :--------: | :--------: | :--------: | :--------: | :--------: | :--------: | :-----: |
| InternVL-C | 83.2 | 64.5 | 61.5 | 44.9 | 65.7 | 64.0 |
<details>
<summary>[InternVL-C] ImageNet-1K val (ZH, Chinese)</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "cn" \
--task "zeroshot_classification" --dataset "imagenet1k" --dataset_root ./data/imagenet-1k/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenet1k", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.6446, "acc5": 0.87842, "mean_per_class_recall": 0.6444200000000001}, "language": "cn"}
```
</details>
<details>
<summary>[InternVL-C] ImageNet-1K val (JP, Japanese)</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "jp" \
--task "zeroshot_classification" --dataset "imagenet1k" --dataset_root ./data/imagenet-1k/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenet1k", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.61488, "acc5": 0.81146, "mean_per_class_recall": 0.6140599999999999}, "language": "jp"}
```
</details>
<details>
<summary>[InternVL-C] ImageNet-1K val (AR, Arabic)</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "ar" \
--task "zeroshot_classification" --dataset "imagenet1k" --dataset_root ./data/imagenet-1k/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenet1k", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.4486, "acc5": 0.66418, "mean_per_class_recall": 0.44764}, "language": "ar"}
```
</details>
<details>
<summary>[InternVL-C] ImageNet-1K val (IT, Italian)</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "it" \
--task "zeroshot_classification" --dataset "imagenet1k" --dataset_root ./data/imagenet-1k/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "imagenet1k", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.65686, "acc5": 0.85254, "mean_per_class_recall": 0.6557799999999999}, "language": "it"}
```
</details>
### Other Datasets
<img width="1219" alt="image" src="https://github.com/OpenGVLab/InternVL/assets/23737120/5de18a6c-8979-432d-bcb6-eb7796b4a08f">
<details>
<summary>[InternVL-C] CIFAR-10</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "cifar10" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "cifar10", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9935, "acc5": 0.9996, "mean_per_class_recall": 0.9935}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] CIFAR-100</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "cifar100" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "cifar100", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9315, "acc5": 0.9925, "mean_per_class_recall": 0.9314}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] MNIST</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "mnist" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "mnist", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.806, "acc5": 0.9743, "mean_per_class_recall": 0.8028667364603377}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Caltech-101</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "caltech101" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "caltech101", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.8949037620297463, "acc5": 0.9847987751531059, "mean_per_class_recall": 0.9548738053818752}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] SUN397</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "sun397" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "sun397", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7600180223256157, "acc5": 0.9623370174890119, "mean_per_class_recall": 0.7641970904214413}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] FGVC Aircraft</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "fgvc_aircraft" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "fgvc_aircraft", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.5271527152715272, "acc5": 0.9426942694269427, "mean_per_class_recall": 0.5255169340463458}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Country-211</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "country211" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "country211", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.34080568720379145, "acc5": 0.6048815165876777, "mean_per_class_recall": 0.3406635071090047}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Stanford Cars</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "cars" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "cars", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9416739211540853, "acc5": 0.99950254943415, "mean_per_class_recall": 0.9416684924576828}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Birdsnap</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "birdsnap" --dataset_root ./data/birdsnap/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "birdsnap", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7203252032520325, "acc5": 0.9636856368563685, "mean_per_class_recall": 0.7027551020408164}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] DTD</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "dtd" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "dtd", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7074468085106383, "acc5": 0.9367021276595745, "mean_per_class_recall": 0.7079787234042553}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Eurosat</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "eurosat" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "eurosat", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7937407407407407, "acc5": 0.9984074074074074, "mean_per_class_recall": 0.8013766666666665}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] FER2013</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "fer2013" --dataset_root ./data/fer2013 --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "fer2013", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.561994984675397, "acc5": 0.9732516021175815, "mean_per_class_recall": 0.5305440899910082}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Flowers-102</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "vtab/flowers" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "vtab/flowers", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.8606277443486746, "acc5": 0.953651000162628, "mean_per_class_recall": 0.8563173902114554}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Food-101</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "food101" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "food101", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9526336633663366, "acc5": 0.9954851485148515, "mean_per_class_recall": 0.9527524752475246}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] GTSRB</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "gtsrb" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "gtsrb", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.6548693586698338, "acc5": 0.9089469517022961, "mean_per_class_recall": 0.5775180283147926}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Pets</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "pets" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "pets", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9604796947397111, "acc5": 0.9991823385118561, "mean_per_class_recall": 0.9602545246926443}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Rendered SST2</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "renderedsst2" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "renderedsst2", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.6792970895112576, "acc5": NaN, "mean_per_class_recall": 0.6792944097041282}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] Resisc45</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "vtab/resisc45" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "vtab/resisc45", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7422631328360577, "acc5": 0.9663545468973179, "mean_per_class_recall": 0.7481098478511045}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] STL10</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "stl10" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "stl10", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9945, "acc5": 1.0, "mean_per_class_recall": 0.9945}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] VOC2007</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_classification" \
--dataset "voc2007" --dataset_root ./data/ --model internvl_c_classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "voc2007", "model": "internvl_c_classification", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7997462606837606, "acc5": 0.9795005341880342, "mean_per_class_recall": 0.9048832641726575}, "language": "en"}
```
</details>
## 📊 Evaluation: Zero-Shot Image-Text Retrieval
### Flickr30K & COCO
<table>
<tr align=center>
<td rowspan="3" align=center><b>model</b></td>
<td colspan="6" align=center><b>Flickr30K</b></td>
<td colspan="6" align=center><b>COCO</b></td>
<td rowspan="3" align=center><b>avg</b></td>
</tr>
<tr align=center>
<td colspan="3" align=center><b>image-to-text</b></td>
<td colspan="3" align=center><b>text-to-image</b></td>
<td colspan="3" align=center><b>image-to-text</b></td>
<td colspan="3" align=center><b>text-to-image</b></td>
</tr>
<tr>
<td>R@1</td>
<td>R@5</td>
<td>R@10</td>
<td>R@1</td>
<td>R@5</td>
<td>R@10</td>
<td>R@1</td>
<td>R@5</td>
<td>R@10</td>
<td>R@1</td>
<td>R@5</td>
<td>R@10</td>
</tr>
<tr align=center>
<td>InternVL-C</td>
<td>94.7</td>
<td>99.6</td>
<td>99.9</td>
<td>81.7</td>
<td>96.0</td>
<td>98.2</td>
<td>70.6</td>
<td>89.0</td>
<td>93.5</td>
<td>54.1</td>
<td>77.3</td>
<td>84.6</td>
<td>86.6</td>
</tr>
<tr align=center>
<td>InternVL-G</td>
<td>95.7</td>
<td>99.7</td>
<td>99.9</td>
<td>85.0</td>
<td>97.0</td>
<td>98.6</td>
<td>74.9</td>
<td>91.3</td>
<td>95.2</td>
<td>58.6</td>
<td>81.3</td>
<td>88.0</td>
<td>88.8</td>
</tr>
</table>
<details>
<summary>[InternVL-C] Flickr30K</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_retrieval" \
--dataset "flickr30k" --dataset_root ./data/flickr30k --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "flickr30k", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1": 0.8166000247001648, "text_retrieval_recall@1": 0.9470000267028809,
"image_retrieval_recall@5": 0.9603999853134155, "text_retrieval_recall@5": 0.9959999918937683,
"image_retrieval_recall@10": 0.9819999933242798, "text_retrieval_recall@10": 0.9990000128746033}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-C] COCO</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_retrieval" \
--dataset "mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1": 0.5411835312843323, "text_retrieval_recall@1": 0.7059999704360962,
"image_retrieval_recall@5": 0.7731707096099854, "text_retrieval_recall@5": 0.8902000188827515,
"image_retrieval_recall@10": 0.8463414907455444, "text_retrieval_recall@10": 0.9354000091552734}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-G] Flickr30K</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_retrieval" \
--dataset "flickr30k" --dataset_root ./data/flickr30k --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json
```
Expected results:
```
{"dataset": "flickr30k", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1": 0.8497999906539917, "text_retrieval_recall@1": 0.9570000171661377,
"image_retrieval_recall@5": 0.9700000286102295, "text_retrieval_recall@5": 0.996999979019165,
"image_retrieval_recall@10": 0.98580002784729, "text_retrieval_recall@10": 0.9990000128746033}, "language": "en"}
```
</details>
<details>
<summary>[InternVL-G] COCO</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "en" --task "zeroshot_retrieval" \
--dataset "mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json
```
Expected results:
```
{"dataset": "mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1": 0.5858056545257568, "text_retrieval_recall@1": 0.7491999864578247,
"image_retrieval_recall@5": 0.813194751739502, "text_retrieval_recall@5": 0.9129999876022339,
"image_retrieval_recall@10": 0.8795281648635864, "text_retrieval_recall@10": 0.9521999955177307}, "language": "en"}
```
</details>
### Flickr30K-CN & COCO-CN
<table>
<tr align=center>
<td rowspan="3" align=center><b>model</b></td>
<td colspan="6" align=center><b>Flickr30K-CN</b></td>
<td colspan="6" align=center><b>COCO-CN</b></td>
<td rowspan="3" align=center><b>avg</b></td>
</tr>
<tr align=center>
<td colspan="3" align=center><b>image-to-text</b></td>
<td colspan="3" align=center><b>text-to-image</b></td>
<td colspan="3" align=center><b>image-to-text</b></td>
<td colspan="3" align=center><b>text-to-image</b></td>
</tr>
<tr>
<td>R@1</td>
<td>R@5</td>
<td>R@10</td>
<td>R@1</td>
<td>R@5</td>
<td>R@10</td>
<td>R@1</td>
<td>R@5</td>
<td>R@10</td>
<td>R@1</td>
<td>R@5</td>
<td>R@10</td>
</tr>
<tr align=center>
<td>InternVL-C</td>
<td>90.3</td>
<td>98.8</td>
<td>99.7</td>
<td>75.1</td>
<td>92.9</td>
<td>96.4</td>
<td>68.8</td>
<td>92.0</td>
<td>96.7</td>
<td>68.9</td>
<td>91.9</td>
<td>96.5</td>
<td>89.0</td>
</tr>
<tr align=center>
<td>InternVL-G</td>
<td>92.9</td>
<td>99.4</td>
<td>99.8</td>
<td>77.7</td>
<td>94.8</td>
<td>97.3</td>
<td>71.4</td>
<td>93.9</td>
<td>97.7</td>
<td>73.8</td>
<td>94.4</td>
<td>98.1</td>
<td>90.9</td>
</tr>
</table>
<details>
<summary>[InternVL-C] Flickr30K-CN</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "cn" --task "zeroshot_retrieval" \
--dataset "flickr30k" --dataset_root ./data/flickr30k --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "flickr30k", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1": 0.7509999871253967, "text_retrieval_recall@1": 0.902999997138977,
"image_retrieval_recall@5": 0.9290000200271606, "text_retrieval_recall@5": 0.9879999756813049,
"image_retrieval_recall@10": 0.9638000130653381, "text_retrieval_recall@10": 0.996999979019165}, "language": "cn"}
```
</details>
<details>
<summary>[InternVL-C] COCO-CN</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "cn" --task "zeroshot_retrieval" \
--dataset "mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json
```
Expected results:
```
{"dataset": "mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1": 0.6885090470314026, "text_retrieval_recall@1": 0.6880000233650208,
"image_retrieval_recall@5": 0.9192782640457153, "text_retrieval_recall@5": 0.9200000166893005,
"image_retrieval_recall@10": 0.9648622870445251, "text_retrieval_recall@10": 0.9670000076293945}, "language": "cn"}
```
</details>
<details>
<summary>[InternVL-G] Flickr30K-CN</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "cn" --task "zeroshot_retrieval" \
--dataset "flickr30k" --dataset_root ./data/flickr30k --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json
```
Expected results:
```
{"dataset": "flickr30k", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1": 0.7767999768257141, "text_retrieval_recall@1": 0.9290000200271606,
"image_retrieval_recall@5": 0.9476000070571899, "text_retrieval_recall@5": 0.9940000176429749,
"image_retrieval_recall@10": 0.9728000164031982, "text_retrieval_recall@10": 0.9980000257492065}, "language": "cn"}
```
</details>
<details>
<summary>[InternVL-G] COCO-CN</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --language "cn" --task "zeroshot_retrieval" \
--dataset "mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json
```
Expected results:
```
{"dataset": "mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1": 0.7378917336463928, "text_retrieval_recall@1": 0.7139999866485596,
"image_retrieval_recall@5": 0.9439696073532104, "text_retrieval_recall@5": 0.9390000104904175,
"image_retrieval_recall@10": 0.9810066223144531, "text_retrieval_recall@10": 0.9769999980926514}, "language": "cn"}
```
</details>
### XTD
| model name | EN | ES | FR | ZH | IT | KO | RU | JP | average |
| :--------: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
| InternVL-C | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
| InternVL-G | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
<details>
<summary>[InternVL-C] XTD</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=en
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=es
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=fr
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=zh
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=it
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=ko
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=ru
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=jp
```
Expected results:
```
{"dataset": "multilingual_mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.7670000195503235, "text_retrieval_recall@1": 0.7480000257492065, "image_retrieval_recall@5": 0.9200000166893005, "text_retrieval_recall@5": 0.921999990940094, "image_retrieval_recall@10": 0.9670000076293945, "text_retrieval_recall@10": 0.9729999899864197}, "language": "en"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.7059999704360962, "text_retrieval_recall@1": 0.7009999752044678, "image_retrieval_recall@5": 0.9020000100135803, "text_retrieval_recall@5": 0.8960000276565552, "image_retrieval_recall@10": 0.9430000185966492, "text_retrieval_recall@10": 0.9570000171661377}, "language": "es"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6970000267028809, "text_retrieval_recall@1": 0.6899999976158142, "image_retrieval_recall@5": 0.8830000162124634, "text_retrieval_recall@5": 0.8889999985694885, "image_retrieval_recall@10": 0.9350000023841858, "text_retrieval_recall@10": 0.9509999752044678}, "language": "fr"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6480000019073486, "text_retrieval_recall@1": 0.6710000038146973, "image_retrieval_recall@5": 0.8759999871253967, "text_retrieval_recall@5": 0.8769999742507935, "image_retrieval_recall@10": 0.9419999718666077, "text_retrieval_recall@10": 0.9559999704360962}, "language": "zh"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6790000200271606, "text_retrieval_recall@1": 0.7039999961853027, "image_retrieval_recall@5": 0.8989999890327454, "text_retrieval_recall@5": 0.8999999761581421, "image_retrieval_recall@10": 0.9440000057220459, "text_retrieval_recall@10": 0.9599999785423279}, "language": "it"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.5830000042915344, "text_retrieval_recall@1": 0.5920000076293945, "image_retrieval_recall@5": 0.8399999737739563, "text_retrieval_recall@5": 0.8360000252723694, "image_retrieval_recall@10": 0.9079999923706055, "text_retrieval_recall@10": 0.921999990940094}, "language": "ko"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6430000066757202, "text_retrieval_recall@1": 0.6439999938011169, "image_retrieval_recall@5": 0.8510000109672546, "text_retrieval_recall@5": 0.8640000224113464, "image_retrieval_recall@10": 0.9169999957084656, "text_retrieval_recall@10": 0.9330000281333923}, "language": "ru"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_c_retrieval", "pretrained": "./pretrained/internvl_c_13b_224px.pth", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6330000162124634, "text_retrieval_recall@1": 0.6759999990463257, "image_retrieval_recall@5": 0.875, "text_retrieval_recall@5": 0.8989999890327454, "image_retrieval_recall@10": 0.9359999895095825, "text_retrieval_recall@10": 0.9549999833106995}, "language": "jp"}
```
</details>
<details>
<summary>[InternVL-G] XTD</summary>
```bash
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=en
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=es
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=fr
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=zh
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=it
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=ko
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=ru
CUDA_VISIBLE_DEVICES=0 python3 clip_benchmark/cli.py eval --model_type internvl --task "zeroshot_retrieval" \
--dataset "multilingual_mscoco_captions" --dataset_root ./data/mscoco_captions --model internvl_g_retrieval_hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=jp
```
Expected results:
```
{"dataset": "multilingual_mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.8119999766349792, "text_retrieval_recall@1": 0.7979999780654907, "image_retrieval_recall@5": 0.9470000267028809, "text_retrieval_recall@5": 0.9480000138282776, "image_retrieval_recall@10": 0.9829999804496765, "text_retrieval_recall@10": 0.9860000014305115}, "language": "en"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.7549999952316284, "text_retrieval_recall@1": 0.7450000047683716, "image_retrieval_recall@5": 0.9350000023841858, "text_retrieval_recall@5": 0.925000011920929, "image_retrieval_recall@10": 0.9660000205039978, "text_retrieval_recall@10": 0.9769999980926514}, "language": "es"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.7450000047683716, "text_retrieval_recall@1": 0.7279999852180481, "image_retrieval_recall@5": 0.9179999828338623, "text_retrieval_recall@5": 0.9190000295639038, "image_retrieval_recall@10": 0.9620000123977661, "text_retrieval_recall@10": 0.9649999737739563}, "language": "fr"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6980000138282776, "text_retrieval_recall@1": 0.6949999928474426, "image_retrieval_recall@5": 0.9120000004768372, "text_retrieval_recall@5": 0.9110000133514404, "image_retrieval_recall@10": 0.9620000123977661, "text_retrieval_recall@10": 0.9670000076293945}, "language": "zh"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.7329999804496765, "text_retrieval_recall@1": 0.7450000047683716, "image_retrieval_recall@5": 0.9309999942779541, "text_retrieval_recall@5": 0.9309999942779541, "image_retrieval_recall@10": 0.9639999866485596, "text_retrieval_recall@10": 0.968999981880188}, "language": "it"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6430000066757202, "text_retrieval_recall@1": 0.6470000147819519, "image_retrieval_recall@5": 0.8790000081062317, "text_retrieval_recall@5": 0.8769999742507935, "image_retrieval_recall@10": 0.9419999718666077, "text_retrieval_recall@10": 0.9509999752044678}, "language": "ko"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6850000023841858, "text_retrieval_recall@1": 0.6899999976158142, "image_retrieval_recall@5": 0.8740000128746033, "text_retrieval_recall@5": 0.8920000195503235, "image_retrieval_recall@10": 0.9390000104904175, "text_retrieval_recall@10": 0.9480000138282776}, "language": "ru"}
{"dataset": "multilingual_mscoco_captions", "model": "internvl_g_retrieval_hf", "pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@1": 0.6850000023841858, "text_retrieval_recall@1": 0.703000009059906, "image_retrieval_recall@5": 0.9020000100135803, "text_retrieval_recall@5": 0.9100000262260437, "image_retrieval_recall@10": 0.9539999961853027, "text_retrieval_recall@10": 0.9610000252723694}, "language": "jp"}
```
</details>
## Original README of CLIP Benchmark
[![pypi](https://img.shields.io/pypi/v/clip_benchmark.svg)](https://pypi.python.org/pypi/clip_benchmark)
The goal of this repo is to evaluate CLIP-like models on a standard set
of datasets on different tasks such as zero-shot classification and zero-shot
retrieval.
Below we show the average rank (1 is the best, lower is better) of different CLIP models, evaluated
on different datasets.
![benchmark.png](benchmark.png)
The current detailed results of the benchmark can be seen [here](benchmark/README.md)
or directly in the [notebook](benchmark/results.ipynb).
### Features
- Support for zero-shot classification and zero-shot retrieval
- Support for [OpenCLIP](https://github.com/mlfoundations/open_clip) pre-trained models
- Support various datasets from [torchvision](https://pytorch.org/vision/stable/datasets.html), [tensorflow datasets](https://www.tensorflow.org/datasets), and [VTAB](https://github.com/google-research/task_adaptation).
- Support [Japanese CLIP by rinna](https://github.com/rinnakk/japanese-clip)
### How to install?
`pip install clip-benchmark`
### How to use?
To evaluate we recommend to create a models.txt like
```
ViT-B-32,openai
```
to get the list of datasets
```
wget https://raw.githubusercontent.com/LAION-AI/CLIP_benchmark/main/benchmark/webdatasets.txt
```
Then to run
```
clip_benchmark eval --pretrained_model models.txt \
--dataset "webdatasets.txt" \
--dataset_root "https://huggingface.co/datasets/clip-benchmark/wds_{dataset_cleaned}/tree/main" \
--output "benchmark_{dataset}_{pretrained}_{model}_{language}_{task}.json"
```
Then to get the full table
```
clip_benchmark build benchmark_*.json --output benchmark.csv
```
#### Command line interface (CLI)
The easiest way to benchmark the models is using the CLI, `clip_benchmark`.
You can specify the model to use, the dataset and the task to evaluate on. Once it is done, evaluation is performed and
the results are written into a JSON file.
#### Using other models than openclip
It is possible to use other models than openclip ones. For example japanese-clip is supported
Here is an example of use
```
>>> python3 clip_benchmark/cli.py eval \
--model_type "ja_clip" \ # flag to use japanese-clip
--pretrained "rinna/japanese-cloob-vit-b-16" \ # now, we have `rinna/japanese-cloob-vit-b-16` or `rinna/japanese-clip-vit-b-16`.
--language "jp" \
--task "zeroshot_classification" \
--dataset "imagenet1k" \
--dataset_root {ROOT_PATH}
>>> cat result.json
{"dataset": "imagenet1k", "model": "ViT-B-32-quickgelu", "pretrained": "rinna/japanese-cloob-vit-b-16", "task": "zeroshot_classification", "metrics": {"acc1": 0.54636, "acc5": 0.72856, "mean_per_class_recall": 0.54522}, "language": "jp"}
```
#### How to add other CLIP models
Please follow these steps:
1. Add a identity file to load model in `clip_benchmark/models`
2. Define a loading function, that returns a tuple (model, transform, tokenizer). Please see `clip_benchmark/models/open_clip.py` as an example.
3. Add the function into `TYPE2FUNC` in `clip_benchmark/models/__init__.py`
Remarks:
- The new tokenizer/model must enable to do the following things as https://github.com/openai/CLIP#usage
- `tokenizer(texts).to(device)` ... `texts` is a list of string
- `model.encode_text(tokenized_texts)` ... `tokenized_texts` is a output from `tokenizer(texts).to(device)`
- `model.encode_image(images)` ... `images` is a image tensor by the `transform`
#### CIFAR-10 example
Here is an example for CIFAR-10 zero-shot classification using OpenCLIP's pre-trained model on LAION-400m:
`clip_benchmark eval --dataset=cifar10 --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
By default, the dataset is downloaded into `--dataset_root`, which by default is `root`.
Here is the content of `result.json` after the evaluation is done:
```json
{
"dataset": "cifar10", "model": "ViT-B-32-quickgelu",
"pretrained": "laion400m_e32", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9074, "acc5": 0.998}
}
```
#### VOC2007 example
Here is another example with VOC2007, which is a multi-label classification dataset.
`clip_benchmark eval --dataset=voc2007_multilabel --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
Here is the content of `result.json` after the evaluation is done:
```json
{"dataset": "voc2007_multilabel", "model": "ViT-B-32-quickgelu", "pretrained": "laion400m_e32", "task": "zeroshot_classification", "metrics": {"mean_average_precision": 0.7627869844436646}}
```
Here, we compute the mean average precision or mAP, more details about that metric [here](https://fangdahan.medium.com/calculate-mean-average-precision-map-for-multi-label-classification-b082679d31be) in the context of multi-label classification.
#### VTAB example
Here is an example on how to run it on [VTAB](https://github.com/google-research/task_adaptation) classification tasks.
First, you need to install VTAB's dedicated package.
`pip install task_adaptation==0.1`
Then, you can run it by providing the full dataset name.
Example with `eurosat`:
`clip_benchmark eval --dataset=vtab/eurosat --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
See [clip_benchmark/datasets/builder.py#L634](clip_benchmark/datasets/builder.py#L634) for the full list of
VTAB dataset collection.
#### TensorFlow dataset example
Here is an example on how to run it on [Tensorflow datasets](https://www.tensorflow.org/datasets).
First, you need to install `tfds-nightly` and `timm`.
`pip install timm tfds-nightly`
The name of the dataset follows the template `tfds/<DATASET_NAME>`.
Example with `cifar10`:
`clip_benchmark eval --dataset=tfds/cifar10 --task=zeroshot_classification --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
#### COCO captions example
Here is an example for COCO captions zero-shot retrieval:
`clip_benchmark eval --dataset=mscoco_captions --task=zeroshot_retrieval --pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json --batch_size=64`
Note that for using COCO, you also need to install `pycocotools` (e.g., using `pip install pycocotools`).
#### Webdataset example
Here is an example on how to run it on [webdatasets](https://github.com/webdataset/webdataset).
First, you need to install `webdataset`.
`pip install webdataset`
##### Creating a webdataset
You can either convert an already supported CLIP_benchmark dataset to webdataset format, or manually create your own with the same file structure. For already supported datasets use the CLI command `clip_benchmark_export_wds` as in this example:
```
$ clip_benchmark_export_wds --dataset cifar10 --split train --dataset_root DATA_DIR/ --output wds_cifar10/
$ clip_benchmark_export_wds --dataset cifar10 --split test --dataset_root DATA_DIR/ --output wds_cifar10/
```
which will convert the train and test splits for CIFAR-10 (downloaded to `DATA_DIR/`) and save the webdataset to `wds_cifar10/` (upload to Huggingface Hub must be done manually for now). Retrieval datasets are also supported with the `--retrieval` flag.
For other datasets, data must be stored with the following file structure:
```
root_dir/
train/
nshards.txt
0.tar
1.tar
...
test/
nshards.txt
0.tar
...
classnames.txt
zeroshot_classification_templates.txt
dataset_type.txt
```
Each split should be contained in its own folder and `nshards.txt` should contain a single integer corresponding to the number of TAR files. The TAR files should follow webdataset format, with an image file (.webp, .png, or .jpg) and a label (.cls) for each example. Classnames and templates are required for zeroshot classification evaluation, with each classname or template on its own line. Dataset type is required for distinguishing zeroshot retrieval evaluation: the file should just contain the text `retrieval`.
##### Evaluating on a webdataset
The name of the dataset follows the template `wds/<DATASET_NAME>`. Note that the dataset name currently only affects the name in the results output - classnames and templates are loaded directly from the included files. The dataset root directory can be either a local path to the `root_dir` as specified above, or an HTTP URL pointing to a Huggingface Hub dataset file tree.
Example with `vtab/cifar10`:
```
$ clip_benchmark eval --dataset wds/vtab/cifar10 --dataset_root ROOT_DIR/wds_vtab-cifar10/
$ clip_benchmark eval --dataset wds/vtab/cifar10 --dataset_root https://huggingface.co/datasets/clip-benchmark/wds_vtab-cifar10/tree/main
```
All other arguments remain the same as in the other examples. See `https://huggingface.co/clip-benchmark` for a full list of datasets that have already been uploaded to Huggingface.
### Evaluate mulitple models on multiple datasets
For the purpose of benchmarking, it is possible to run the CLI with multiple
pre-trained models on multiple datasets.
#### Pretrained models and datasets list as arguments
For models, we can provide list of pretrained model names in the form of 'model,pretrained' (so `model` and `pretrained` are comma separated). For datasets, we can provide a list of datasets. For languages, we can provide a list of languages.
Example:
```bash
clip_benchmark eval --pretrained_model ViT-B-32-quickgelu,laion400m_e32 ViT-L-14,laion400m_e32 \
--dataset cifar10 cifar100 --dataset_root "clip_benchmark_datasets/{dataset}" --language en jp \
--output "{dataset}_{pretrained}_{model}_{language}_{task}.json"
```
Note that `--dataset_root` and `--output` can be now in the form of a template that depends on the dataset/model/language/task (for `--output`) and dataset name (for `--dataset_root`).
Note that If the benchmark fails at some point, it is possible to resume it by skipping already evaluated models using `--skip_existing`.
#### Pretrained models and datasets list as files
We can also provide a path to files with models (each line is in the form of 'model,pretrained' where `model` and `pretrained` are comma separated) and datasets list (one dataset per line):
```bash
clip_benchmark eval --pretrained_model benchmark/models.txt \
--dataset benchmark/datasets.txt --dataset_root "clip_benchmark_datasets/{dataset}" \
--output "{dataset}_{pretrained}_{model}_{language}_{task}.json"
```
Examples are available in [benchmark/datasets.txt](benchmark/datasets.txt) and [benchmark/models.txt](benchmark/models.txt)
#### Model and dataset collections
We can also provide model collection names (`openai`, `openclip_base`, `openclip_multilingual`, `openclip_full` are supported) or dataset collection names (`vtab`, `vtab+`, `retrieval`, `imagenet_robustness` are supported):
```bash
clip_benchmark eval --pretrained_model openai openclip_base --dataset vtab+ retrieval \
--dataset_root "clip_benchmark_datasets/{dataset}" --not quiet \
--output "{dataset}_{pretrained}_{model}_{language}_{task}.json"
```
#### Development
For development, you can also do this:
```bash
git clone https://github.com/LAION-AI/CLIP_benchmark
cd CLIP_benchmark
python setup.py install
```
### Credits
- Thanks to [OpenCLIP](https://github.com/mlfoundations/open_clip) authors, zero-shot accuracy code is adapted from there and pre-trained models are used in the command line interface.
- Thanks to [SLIP](https://github.com/facebookresearch/SLIP) authors, some zero-shot templates and classnames are from there.
- Thanks to [Wise-ft](https://github.com/mlfoundations/wise-ft) authors, Imagenet robustness datasets code is adapted from there
- Thanks to [LiT](https://arxiv.org/abs/2111.07991.pdf) authors, some zero-shot templates and classnames of VTAB datasets are from there.
- This package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter) and the [audreyr/cookiecutter-pypackage](https://github.com/audreyr/cookiecutter-pypackage) project template. Thanks to the author.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment