"...text-generation-inference.git" did not exist on "c4422e567811051e40469551c58a8078158a6656"
Unverified Commit b1054cbb authored by Aditya Oke's avatar Aditya Oke Committed by GitHub
Browse files

Add Video SwinTransformer (#6521)



* Just start adding mere copy paste

* Replace d with t and D with T

* Align swin transformer video to image a bit

* Rename d -> t

* align with 2d impl

* align with 2d impl

* Add helpful comments and config for 3d

* add docs

* Add docs

* Add configurations

* Add docs

* Fix bugs

* Fix wrong edit

* Fix wrong edit

* Fix bugs

* Fix bugs

* Fix as per fx suggestions

* Update torchvision/models/video/swin_transformer.py

* Fix as per fx suggestions

* Fix expect files and code

* Update the expect files

* Modify video swin

* Add min size and min temporal size, num params

* Add flops and size

* Fix types

* Fix url recipe
Co-authored-by: default avatarYosua Michael Maranatha <yosuamichael@fb.com>
parent 3f4dcae6
......@@ -518,6 +518,7 @@ pre-trained weights:
models/video_mvit
models/video_resnet
models/video_s3d
models/video_swin_transformer
|
......
......@@ -15,7 +15,7 @@ Model builders
--------------
The following model builders can be used to instantiate an SwinTransformer model (original and V2) with and without pre-trained weights.
All the model builders internally rely on the ``torchvision.models.swin_transformer.SwinTransformer``
All the model builders internally rely on the ``torchvision.models.swin_transformer.SwinTransformer``
base class. Please refer to the `source code
<https://github.com/pytorch/vision/blob/main/torchvision/models/swin_transformer.py>`_ for
more details about this class.
......
Video SwinTransformer
=====================
.. currentmodule:: torchvision.models.video
The Video SwinTransformer model is based on the `Video Swin Transformer <https://arxiv.org/abs/2106.13230>`__ paper.
.. betastatus:: video module
Model builders
--------------
The following model builders can be used to instantiate a VideoResNet model, with or
without pre-trained weights. All the model builders internally rely on the
``torchvision.models.video.swin_transformer.SwinTransformer3d`` base class. Please refer to the `source
code
<https://github.com/pytorch/vision/blob/main/torchvision/models/video/swin_transformer.py>`_ for
more details about this class.
.. autosummary::
:toctree: generated/
:template: function.rst
swin3d_t
swin3d_s
swin3d_b
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
......@@ -494,6 +494,8 @@ class SwinTransformerBlockV2(SwinTransformerBlock):
)
def forward(self, x: Tensor):
# Here is the difference, we apply norm after the attention in V2.
# In V1 we applied norm before the attention.
x = x + self.stochastic_depth(self.norm1(self.attn(x)))
x = x + self.stochastic_depth(self.norm2(self.mlp(x)))
return x
......@@ -587,7 +589,7 @@ class SwinTransformer(nn.Module):
num_features = embed_dim * 2 ** (len(depths) - 1)
self.norm = norm_layer(num_features)
self.permute = Permute([0, 3, 1, 2])
self.permute = Permute([0, 3, 1, 2]) # B H W C -> B C H W
self.avgpool = nn.AdaptiveAvgPool2d(1)
self.flatten = nn.Flatten(1)
self.head = nn.Linear(num_features, num_classes)
......
from .mvit import *
from .resnet import *
from .s3d import *
from .swin_transformer import *
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment