"benchmark/git@developer.sourcefind.cn:change/sglang.git" did not exist on "1357397a34ab0845caf3f384578cb084d44d94e5"
Unverified Commit b090b790 authored by NielsRogge's avatar NielsRogge Committed by GitHub
Browse files

Make Swin work with VisionEncoderDecoderModel (#15527)



* Add attribute_map

* Add mention in docs

* Set hidden_size attribute correctly

* Add note about Transformer-based models only
Co-authored-by: default avatarNiels Rogge <nielsrogge@Nielss-MBP.localdomain>
parent ec15da24
...@@ -13,8 +13,8 @@ specific language governing permissions and limitations under the License. ...@@ -13,8 +13,8 @@ specific language governing permissions and limitations under the License.
# Vision Encoder Decoder Models # Vision Encoder Decoder Models
The [`VisionEncoderDecoderModel`] can be used to initialize an image-to-text-sequence model with any The [`VisionEncoderDecoderModel`] can be used to initialize an image-to-text-sequence model with any
pretrained vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit)) pretrained Transformer-based vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit), [Swin](swin))
and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert)). and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert), [DistilBERT](distilbert)).
The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for
example) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, example) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
......
...@@ -90,6 +90,10 @@ class SwinConfig(PretrainedConfig): ...@@ -90,6 +90,10 @@ class SwinConfig(PretrainedConfig):
```""" ```"""
model_type = "swin" model_type = "swin"
attribute_map = {
"num_attention_heads": "num_heads",
}
def __init__( def __init__(
self, self,
image_size=224, image_size=224,
...@@ -130,3 +134,6 @@ class SwinConfig(PretrainedConfig): ...@@ -130,3 +134,6 @@ class SwinConfig(PretrainedConfig):
self.path_norm = patch_norm self.path_norm = patch_norm
self.layer_norm_eps = layer_norm_eps self.layer_norm_eps = layer_norm_eps
self.initializer_range = initializer_range self.initializer_range = initializer_range
# we set the hidden_size attribute in order to make Swin work with VisionEncoderDecoderModel
# this indicates the channel dimension after the last stage of the model
self.hidden_size = embed_dim * 8
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment