Graphormer model for Graph Classification (#20968)

* [FT] First commit for graphormer architecture. The model has no tokenizer, as it uses a collator and preprocessing function for its input management. Architecture to be tested against original one. The arch might need to be changed to fit the checkpoint, but a revert to the original arch will make the code less nice to read. TODO: doc * [FIX] removed test model * [FIX] import error * [FIX] black and flake * [DOC] added paper refs * [FIX] [DOC] * [FIX] black * [DOC] Updated READMEs * [FIX] Order of imports + rm Tokenizer calls * [FIX] Moved assert in class to prevent doc build failure * [FIX] make fix-copies * [Doc] update from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [FIX] Removed Graphormer from Sequence classification model list * [DOC] Added HF copyright to Cython file * [DOC] Fixed comments * [FIX] typos in class doc + removed config classes. Todo: update doc from paper definitions * [FIX] Removed dependency to fairseq, and replaced all asserts with Exception management * [FIX] Homogeneized initialization of weights to pretrained constructor * [FIX] [CP] Updated multi_hop parameter to get same results as in original implementation * [DOC] Relevant parameter description in the configuration file * [DOC] Updated doc and comments in main graphormer file * [FIX] make style and quality checks * [DOC] Fix doc format * [FIX] [WIP] Updated part of the tests, though still a wip * [FIX] [WIP] * [FIX] repo consistency * [FIX] Changed input names for more understandability * [FIX] [BUG] updated num_classes params for propagation in the model * simplified collator * [FIX] Updated tests to follow new naming pattern * [TESTS] Updated test suite along with model * |FIX] rm tokenizer import * [DOC] add link to graphormerdoc * Changed section in doc from text model to graph model * Apply suggestions from code review Spacing, inits Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [DOC] Explain algos_graphormer functions * Cython soft import protection * Rm call to Callable in configuration graphormer * [FIX] replaced asserts with Exceptions * Add org to graphormer checkpoints * Prefixed classes with Graphormer * Management of init functions * format * fixes * fix length file * update indent * relaunching ci * Errors for missing cython imports * fix style * fix style doc Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Graphormer model for Graph Classification (#20968)
* [FT] First commit for graphormer architecture. The model has no tokenizer, as it uses a collator and preprocessing function for its input management. Architecture to be tested against original one. The arch might need to be changed to fit the checkpoint, but a revert to the original arch will make the code less nice to read. TODO: doc * [FIX] removed test model * [FIX] import error * [FIX] black and flake * [DOC] added paper refs * [FIX] [DOC] * [FIX] black * [DOC] Updated READMEs * [FIX] Order of imports + rm Tokenizer calls * [FIX] Moved assert in class to prevent doc build failure * [FIX] make fix-copies * [Doc] update from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [FIX] Removed Graphormer from Sequence classification model list * [DOC] Added HF copyright to Cython file * [DOC] Fixed comments * [FIX] typos in class doc + removed config classes. Todo: update doc from paper definitions * [FIX] Removed dependency to fairseq, and replaced all asserts with Exception management * [FIX] Homogeneized initialization of weights to pretrained constructor * [FIX] [CP] Updated multi_hop parameter to get same results as in original implementation * [DOC] Relevant parameter description in the configuration file * [DOC] Updated doc and comments in main graphormer file * [FIX] make style and quality checks * [DOC] Fix doc format * [FIX] [WIP] Updated part of the tests, though still a wip * [FIX] [WIP] * [FIX] repo consistency * [FIX] Changed input names for more understandability * [FIX] [BUG] updated num_classes params for propagation in the model * simplified collator * [FIX] Updated tests to follow new naming pattern * [TESTS] Updated test suite along with model * |FIX] rm tokenizer import * [DOC] add link to graphormerdoc * Changed section in doc from text model to graph model * Apply suggestions from code review Spacing, inits Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [DOC] Explain algos_graphormer functions * Cython soft import protection * Rm call to Callable in configuration graphormer * [FIX] replaced asserts with Exceptions * Add org to graphormer checkpoints * Prefixed classes with Graphormer * Management of init functions * format * fixes * fix length file * update indent * relaunching ci * Errors for missing cython imports * fix style * fix style doc Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
87208a05 · Clémentine Fourrier · GitHub · 758bd39e · 87208a05 · 87208a05
Unverified Commit 87208a05 authored Jan 19, 2023 by Clémentine Fourrier Committed by GitHub Jan 19, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -334,6 +334,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.

--- a/README_es.md
+++ b/README_es.md
@@ -327,6 +327,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.

--- a/README_hd.md
+++ b/README_hd.md
@@ -299,6 +299,7 @@ conda install -c huggingface transformers
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (ओपनएआई से) साथ में पेपर [लैंग्वेज मॉडल्स अनसुपरवाइज्ड मल्टीटास्क लर्नर्स हैं](https://blog.openai.com/better-language-models/) एलेक रैडफोर्ड*, जेफरी वू*, रेवन चाइल्ड, डेविड लुआन, डारियो एमोडी* द्वारा * और इल्या सुत्सकेवर** ने पोस्ट किया।
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI से) साथ वाला पेपर [kingoflolz/mesh-transformer-jax](https://github. com/kingoflolz/mesh-transformer-jax/) बेन वांग और अरन कोमात्सुजाकी द्वारा।
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv .org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा।
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा।
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा।

--- a/README_ja.md
+++ b/README_ja.md
@@ -361,6 +361,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI から) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** から公開された研究論文: [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/)
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI から) Ben Wang and Aran Komatsuzaki から公開されたレポジトリー [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden から) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren から公開された研究論文: [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (Microsoft から) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu から公開された研究論文: [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234).
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA から) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang から公開された研究論文: [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094)
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447)
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321)

--- a/README_ko.md
+++ b/README_ko.md
@@ -276,6 +276,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI 에서) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 의 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 논문과 함께 발표했습니다.
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden 에서) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 의 [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 논문과 함께 발표했습니다.
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu  의 [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234)  논문과 함께 발표했습니다.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA 에서) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 의 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 논문과 함께 발표했습니다.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley 에서) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 의 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 논문과 함께 발표했습니다.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -300,6 +300,7 @@ conda install -c huggingface transformers
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 由 Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 发布。
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (来自 EleutherAI) 伴随论文 [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 由 Ben Wang and Aran Komatsuzaki 发布。 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (来自 UCSD, NVIDIA) 伴随论文 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 由 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 发布。
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -312,6 +312,7 @@ conda install -c huggingface transformers
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released with the paper [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.

--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -571,6 +571,11 @@
      - local: model_doc/time_series_transformer
        title: Time Series Transformer
      title: Time series models
+    - isExpanded: false
+      sections:
+      - local: model_doc/graphormer
+        title: Graphormer
+      title: Graph models
    title: Models
  - sections:
    - local: internal/modeling_utils

--- a/docs/source/en/index.mdx
+++ b/docs/source/en/index.mdx
@@ -113,6 +113,7 @@ The documentation is organized into five sections:
 1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
 1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
+1. **[Graphormer](model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
@@ -289,6 +290,7 @@ Flax), PyTorch, and/or TensorFlow.
 |       GPT NeoX Japanese       |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |             GPT-J             |       ❌       |       ❌       |       ✅        |         ✅         |      ✅      |
 |            GPT-Sw3            |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|          Graphormer           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |           GroupViT            |       ❌       |       ❌       |       ✅        |         ✅         |      ❌      |
 |            Hubert             |       ❌       |       ❌       |       ✅        |         ✅         |      ❌      |
 |            I-BERT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |

--- a/docs/source/en/model_doc/graphormer.mdx
+++ b/docs/source/en/model_doc/graphormer.mdx
+<!--Copyright 2022 The HuggingFace Team and Microsoft. All rights reserved.
+Licensed under the MIT License; you may not use this file except in compliance with
+the License. 
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Graphormer
+## Overview
+The Graphormer model was proposed in [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234)  by 
+Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen and Tie-Yan Liu. It is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.
+The abstract from the paper is the following:
+*The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.*
+Tips:
+This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode.
+You can reduce the batch size, increase your RAM, or decrease the `UNREACHABLE_NODE_DISTANCE` parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges.
+This model does not use a tokenizer, but instead a special collator during training.
+This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
+## GraphormerConfig
+[[autodoc]] GraphormerConfig
+## GraphormerModel
+[[autodoc]] GraphormerModel
+    - forward
+## GraphormerForGraphClassification
+[[autodoc]] GraphormerForGraphClassification
+    - forward
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -271,6 +271,7 @@ _import_structure = {
    "models.gpt_neox_japanese": ["GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXJapaneseConfig"],
    "models.gpt_sw3": [],
    "models.gptj": ["GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTJConfig"],
+    "models.graphormer": ["GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "GraphormerConfig"],
    "models.groupvit": [
        "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "GroupViTConfig",
@@ -1539,6 +1540,14 @@ else:
            "GPTJPreTrainedModel",
        ]
    )
+    _import_structure["models.graphormer"].extend(
+        [
+            "GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "GraphormerForGraphClassification",
+            "GraphormerModel",
+            "GraphormerPreTrainedModel",
+        ]
+    )
    _import_structure["models.groupvit"].extend(
        [
            "GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -3672,6 +3681,7 @@ if TYPE_CHECKING:
    from .models.gpt_neox import GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXConfig
    from .models.gpt_neox_japanese import GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXJapaneseConfig
    from .models.gptj import GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTJConfig
+    from .models.graphormer import GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, GraphormerConfig
    from .models.groupvit import (
        GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        GroupViTConfig,
@@ -4747,6 +4757,12 @@ if TYPE_CHECKING:
            GPTJModel,
            GPTJPreTrainedModel,
        )
+        from .models.graphormer import (
+            GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            GraphormerForGraphClassification,
+            GraphormerModel,
+            GraphormerPreTrainedModel,
+        )
        from .models.groupvit import (
            GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST,
            GroupViTModel,

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -82,6 +82,7 @@ from . import (
    gpt_neox_japanese,
    gpt_sw3,
    gptj,
+    graphormer,
    groupvit,
    herbert,
    hubert,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -86,6 +86,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("gpt_neox", "GPTNeoXConfig"),
        ("gpt_neox_japanese", "GPTNeoXJapaneseConfig"),
        ("gptj", "GPTJConfig"),
+        ("graphormer", "GraphormerConfig"),
        ("groupvit", "GroupViTConfig"),
        ("hubert", "HubertConfig"),
        ("ibert", "IBertConfig"),
@@ -247,6 +248,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("gpt_neox", "GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("gpt_neox_japanese", "GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("graphormer", "GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("groupvit", "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -409,6 +411,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("gpt_neox", "GPT NeoX"),
        ("gpt_neox_japanese", "GPT NeoX Japanese"),
        ("gptj", "GPT-J"),
+        ("graphormer", "Graphormer"),
        ("groupvit", "GroupViT"),
        ("herbert", "HerBERT"),
        ("hubert", "Hubert"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -85,6 +85,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("gpt_neox", "GPTNeoXModel"),
        ("gpt_neox_japanese", "GPTNeoXJapaneseModel"),
        ("gptj", "GPTJModel"),
+        ("graphormer", "GraphormerModel"),
        ("groupvit", "GroupViTModel"),
        ("hubert", "HubertModel"),
        ("ibert", "IBertModel"),

--- a/src/transformers/models/graphormer/__init__.py
+++ b/src/transformers/models/graphormer/__init__.py
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+# rely on isort to merge the imports
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
+_import_structure = {
+    "configuration_graphormer": ["GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "GraphormerConfig"],
+}
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_graphormer"] = [
+        "GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "GraphormerForGraphClassification",
+        "GraphormerModel",
+        "GraphormerPreTrainedModel",
+    ]
+if TYPE_CHECKING:
+    from .configuration_graphormer import GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, GraphormerConfig
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_graphormer import (
+            GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            GraphormerForGraphClassification,
+            GraphormerModel,
+            GraphormerPreTrainedModel,
+        )
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
--- a/src/transformers/models/graphormer/algos_graphormer.pyx
+++ b/src/transformers/models/graphormer/algos_graphormer.pyx
+# Copyright (c) Microsoft Corporation and HuggingFace
+# Licensed under the MIT License.
+import cython
+cimport numpy
+from cython.parallel cimport parallel, prange
+import numpy as np
+# Reduce this number if matrices are too big for large graphs
+UNREACHABLE_NODE_DISTANCE = 510 
+def floyd_warshall(adjacency_matrix):
+    """
+    Applies the Floyd-Warshall algorithm to the adjacency matrix, to compute the 
+    shortest paths distance between all nodes, up to UNREACHABLE_NODE_DISTANCE.
+    """
+    (nrows, ncols) = adjacency_matrix.shape
+    assert nrows == ncols
+    cdef unsigned int n = nrows
+    adj_mat_copy = adjacency_matrix.astype(np.int32, order='C', casting='safe', copy=True)
+    assert adj_mat_copy.flags['C_CONTIGUOUS']
+    cdef numpy.ndarray[numpy.int32_t, ndim=2, mode='c'] M = adj_mat_copy
+    cdef numpy.ndarray[numpy.int32_t, ndim=2, mode='c'] path = -1 * np.ones([n, n], dtype=np.int32)
+    cdef unsigned int i, j, k
+    cdef numpy.int32_t M_ij, M_ik, cost_ikkj
+    cdef numpy.int32_t* M_ptr = &M[0,0]
+    cdef numpy.int32_t* M_i_ptr
+    cdef numpy.int32_t* M_k_ptr
+    # set unreachable nodes distance to UNREACHABLE_NODE_DISTANCE
+    for i in range(n):
+        for j in range(n):
+            if i == j:
+                M[i][j] = 0
+            elif M[i][j] == 0:
+                M[i][j] = UNREACHABLE_NODE_DISTANCE
+    # floyed algo
+    for k in range(n):
+        M_k_ptr = M_ptr + n*k
+        for i in range(n):
+            M_i_ptr = M_ptr + n*i
+            M_ik = M_i_ptr[k]
+            for j in range(n):
+                cost_ikkj = M_ik + M_k_ptr[j]
+                M_ij = M_i_ptr[j]
+                if M_ij > cost_ikkj:
+                    M_i_ptr[j] = cost_ikkj
+                    path[i][j] = k
+    # set unreachable path to UNREACHABLE_NODE_DISTANCE
+    for i in range(n):
+        for j in range(n):
+            if M[i][j] >= UNREACHABLE_NODE_DISTANCE:
+                path[i][j] = UNREACHABLE_NODE_DISTANCE
+                M[i][j] = UNREACHABLE_NODE_DISTANCE
+    return M, path
+def get_all_edges(path, i, j):
+    """
+    Recursive function to compute all possible paths between two nodes from the graph adjacency matrix.
+    """
+    cdef int k = path[i][j]
+    if k == -1:
+        return []
+    else:
+        return get_all_edges(path, i, k) + [k] + get_all_edges(path, k, j)
+def gen_edge_input(max_dist, path, edge_feat):
+    """
+    Generates the full edge feature and adjacency matrix.
+    Shape: num_nodes * num_nodes * max_distance_between_nodes * num_edge_features
+    Dim 1 is the input node, dim 2 the output node of the edge, dim 3 the depth of the edge, dim 4 the feature
+    """
+    (nrows, ncols) = path.shape
+    assert nrows == ncols
+    cdef unsigned int n = nrows
+    cdef unsigned int max_dist_copy = max_dist
+    path_copy = path.astype(long, order='C', casting='safe', copy=True)
+    edge_feat_copy = edge_feat.astype(long, order='C', casting='safe', copy=True)
+    assert path_copy.flags['C_CONTIGUOUS']
+    assert edge_feat_copy.flags['C_CONTIGUOUS']
+    cdef numpy.ndarray[numpy.int32_t, ndim=4, mode='c'] edge_fea_all = -1 * np.ones([n, n, max_dist_copy, edge_feat.shape[-1]], dtype=np.int32)
+    cdef unsigned int i, j, k, num_path, cur
+    for i in range(n):
+        for j in range(n):
+            if i == j:
+                continue
+            if path_copy[i][j] == UNREACHABLE_NODE_DISTANCE:
+                continue
+            path = [i] + get_all_edges(path_copy, i, j) + [j]
+            num_path = len(path) - 1
+            for k in range(num_path):
+                edge_fea_all[i, j, k, :] = edge_feat_copy[path[k], path[k+1], :]
+    return edge_fea_all
--- a/src/transformers/models/graphormer/collating_graphormer.py
+++ b/src/transformers/models/graphormer/collating_graphormer.py
+# Copyright (c) Microsoft Corporation and HuggingFace
+# Licensed under the MIT License.
+from typing import Any, Dict, List, Mapping
+import numpy as np
+import torch
+from ...utils import is_cython_available, requires_backends
+if is_cython_available():
+    import pyximport
+    pyximport.install(setup_args={"include_dirs": np.get_include()})
+    from . import algos_graphormer  # noqa E402
+def convert_to_single_emb(x, offset: int = 512):
+    feature_num = x.shape[1] if len(x.shape) > 1 else 1
+    feature_offset = 1 + np.arange(0, feature_num * offset, offset, dtype=np.int64)
+    x = x + feature_offset
+    return x
+def preprocess_item(item, keep_features=True):
+    requires_backends(preprocess_item, ["Cython"])
+    if not is_cython_available():
+        raise ImportError("Graphormer preprocessing needs Cython (pyximport)")
+    if keep_features and "edge_attr" in item.keys():  # edge_attr
+        edge_attr = np.asarray(item["edge_attr"], dtype=np.int64)
+    else:
+        edge_attr = np.ones((len(item["edge_index"][0]), 1), dtype=np.int64)  # same embedding for all
+    if keep_features and "node_feat" in item.keys():  # input_nodes
+        node_feature = np.asarray(item["node_feat"], dtype=np.int64)
+    else:
+        node_feature = np.ones((item["num_nodes"], 1), dtype=np.int64)  # same embedding for all
+    edge_index = np.asarray(item["edge_index"], dtype=np.int64)
+    input_nodes = convert_to_single_emb(node_feature) + 1
+    num_nodes = item["num_nodes"]
+    if len(edge_attr.shape) == 1:
+        edge_attr = edge_attr[:, None]
+    attn_edge_type = np.zeros([num_nodes, num_nodes, edge_attr.shape[-1]], dtype=np.int64)
+    attn_edge_type[edge_index[0], edge_index[1]] = convert_to_single_emb(edge_attr) + 1
+    # node adj matrix [num_nodes, num_nodes] bool
+    adj = np.zeros([num_nodes, num_nodes], dtype=bool)
+    adj[edge_index[0], edge_index[1]] = True
+    shortest_path_result, path = algos_graphormer.floyd_warshall(adj)
+    max_dist = np.amax(shortest_path_result)
+    input_edges = algos_graphormer.gen_edge_input(max_dist, path, attn_edge_type)
+    attn_bias = np.zeros([num_nodes + 1, num_nodes + 1], dtype=np.single)  # with graph token
+    # combine
+    item["input_nodes"] = input_nodes + 1  # we shift all indices by one for padding
+    item["attn_bias"] = attn_bias
+    item["attn_edge_type"] = attn_edge_type
+    item["spatial_pos"] = shortest_path_result.astype(np.int64) + 1  # we shift all indices by one for padding
+    item["in_degree"] = np.sum(adj, axis=1).reshape(-1) + 1  # we shift all indices by one for padding
+    item["out_degree"] = item["in_degree"]  # for undirected graph
+    item["input_edges"] = input_edges + 1  # we shift all indices by one for padding
+    if "labels" not in item:
+        item["labels"] = item["y"]
+    return item
+class GraphormerDataCollator:
+    def __init__(self, spatial_pos_max=20, on_the_fly_processing=False):
+        if not is_cython_available():
+            raise ImportError("Graphormer preprocessing needs Cython (pyximport)")
+        self.spatial_pos_max = spatial_pos_max
+        self.on_the_fly_processing = on_the_fly_processing
+    def __call__(self, features: List[dict]) -> Dict[str, Any]:
+        if self.on_the_fly_processing:
+            features = [preprocess_item(i) for i in features]
+        if not isinstance(features[0], Mapping):
+            features = [vars(f) for f in features]
+        batch = {}
+        max_node_num = max(len(i["input_nodes"]) for i in features)
+        node_feat_size = len(features[0]["input_nodes"][0])
+        edge_feat_size = len(features[0]["attn_edge_type"][0][0])
+        max_dist = max(len(i["input_edges"][0][0]) for i in features)
+        edge_input_size = len(features[0]["input_edges"][0][0][0])
+        batch_size = len(features)
+        batch["attn_bias"] = torch.zeros(batch_size, max_node_num + 1, max_node_num + 1, dtype=torch.float)
+        batch["attn_edge_type"] = torch.zeros(batch_size, max_node_num, max_node_num, edge_feat_size, dtype=torch.long)
+        batch["spatial_pos"] = torch.zeros(batch_size, max_node_num, max_node_num, dtype=torch.long)
+        batch["in_degree"] = torch.zeros(batch_size, max_node_num, dtype=torch.long)
+        batch["input_nodes"] = torch.zeros(batch_size, max_node_num, node_feat_size, dtype=torch.long)
+        batch["input_edges"] = torch.zeros(
+            batch_size, max_node_num, max_node_num, max_dist, edge_input_size, dtype=torch.long
+        )
+        for ix, f in enumerate(features):
+            for k in ["attn_bias", "attn_edge_type", "spatial_pos", "in_degree", "input_nodes", "input_edges"]:
+                f[k] = torch.tensor(f[k])
+            if len(f["attn_bias"][1:, 1:][f["spatial_pos"] >= self.spatial_pos_max]) > 0:
+                f["attn_bias"][1:, 1:][f["spatial_pos"] >= self.spatial_pos_max] = float("-inf")
+            batch["attn_bias"][ix, : f["attn_bias"].shape[0], : f["attn_bias"].shape[1]] = f["attn_bias"]
+            batch["attn_edge_type"][ix, : f["attn_edge_type"].shape[0], : f["attn_edge_type"].shape[1], :] = f[
+                "attn_edge_type"
+            ]
+            batch["spatial_pos"][ix, : f["spatial_pos"].shape[0], : f["spatial_pos"].shape[1]] = f["spatial_pos"]
+            batch["in_degree"][ix, : f["in_degree"].shape[0]] = f["in_degree"]
+            batch["input_nodes"][ix, : f["input_nodes"].shape[0], :] = f["input_nodes"]
+            batch["input_edges"][
+                ix, : f["input_edges"].shape[0], : f["input_edges"].shape[1], : f["input_edges"].shape[2], :
+            ] = f["input_edges"]
+        batch["out_degree"] = batch["in_degree"]
+        sample = features[0]["labels"]
+        if len(sample) == 1:  # one task
+            if isinstance(sample[0], float):  # regression
+                batch["labels"] = torch.from_numpy(np.concatenate([i["labels"] for i in features]))
+            else:  # binary classification
+                batch["labels"] = torch.from_numpy(np.concatenate([i["labels"] for i in features]))
+        else:  # multi task classification, left to float to keep the NaNs
+            batch["labels"] = torch.from_numpy(np.stack([i["labels"] for i in features], dim=0))
+        return batch
--- a/src/transformers/models/graphormer/configuration_graphormer.py
+++ b/src/transformers/models/graphormer/configuration_graphormer.py
+# coding=utf-8
+# Copyright 2022 Microsoft, clefourrier and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Graphormer model configuration"""
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+logger = logging.get_logger(__name__)
+GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    # pcqm4mv1 now deprecated
+    "graphormer-base": "https://huggingface.co/graphormer-base-pcqm4mv2/resolve/main/config.json",
+    # See all Graphormer models at https://huggingface.co/models?filter=graphormer
+}
+class GraphormerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~GraphormerModel`]. It is used to instantiate an
+    Graphormer model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the Graphormer
+    [graphormer-base-pcqm4mv1](https://huggingface.co/graphormer-base-pcqm4mv1) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        num_classes (`int`, *optional*, defaults to 2):
+            Number of target classes or labels, set to 1 if the task is a regression task.
+        num_atoms (`int`, *optional*, defaults to 512*9):
+            Number of node types in the graphs.
+        num_edges (`int`, *optional*, defaults to 512*3):
+            Number of edges types in the graph.
+        num_in_degree (`int`, *optional*, defaults to 512):
+            Number of in degrees types in the input graphs.
+        num_out_degree (`int`, *optional*, defaults to 512):
+            Number of out degrees types in the input graphs.
+        num_edge_dis (`int`, *optional*, defaults to 128):
+            Number of edge dis in the input graphs.
+        multi_hop_max_dist (`int`, *optional*, defaults to 20):
+            Maximum distance of multi hop edges between two nodes.
+        spatial_pos_max (`int`, *optional*, defaults to 1024):
+            Maximum distance between nodes in the graph attention bias matrices, used during preprocessing and
+            collation.
+        edge_type (`str`, *optional*, defaults to multihop):
+            Type of edge relation chosen.
+        max_nodes (`int`, *optional*, defaults to 512):
+            Maximum number of nodes which can be parsed for the input graphs.
+        share_input_output_embed (`bool`, *optional*, defaults to `False`):
+            Shares the embedding layer between encoder and decoder - careful, True is not implemented.
+        num_layers (`int`, *optional*, defaults to 12):
+            Number of layers.
+        embedding_dim (`int`, *optional*, defaults to 768):
+            Dimension of the embedding layer in encoder.
+        ffn_embedding_dim (`int`, *optional*, defaults to 768):
+            Dimension of the "intermediate" (often named feed-forward) layer in encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads in the encoder.
+        self_attention (`bool`, *optional*, defaults to `True`):
+            Model is self attentive (False not implemented).
+        activation_function (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for the attention weights.
+        activation_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability after activation in the FFN.
+        layerdrop (`float`, *optional*, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
+            for more details.
+        bias (`bool`, *optional*, defaults to `True`):
+            Uses bias in the attention module - unsupported at the moment.
+        embed_scale(`float`, *optional*, defaults to None):
+            Scaling factor for the node embeddings.
+        num_trans_layers_to_freeze (`int`, *optional*, defaults to 0):
+            Number of transformer layers to freeze.
+        encoder_normalize_before (`bool`, *optional*, defaults to `False`):
+            Normalize features before encoding the graph.
+        pre_layernorm (`bool`, *optional*, defaults to `False`):
+            Apply layernorm before self attention and the feed forward network. Without this, post layernorm will be
+            used.
+        apply_graphormer_init (`bool`, *optional*, defaults to `False`):
+            Apply a custom graphormer initialisation to the model before training.
+        freeze_embeddings (`bool`, *optional*, defaults to `False`):
+            Freeze the embedding layer, or train it along the model.
+        encoder_normalize_before (`bool`, *optional*, defaults to `False`):
+            Apply the layer norm before each encoder block.
+        q_noise (`float`, *optional*, defaults to 0.0):
+            Amount of quantization noise (see "Training with Quantization Noise for Extreme Model Compression"). (For
+            more detail, see fairseq's documentation on quant_noise).
+        qn_block_size (`int`, *optional*, defaults to 8):
+            Size of the blocks for subsequent quantization with iPQ (see q_noise).
+        kdim (`int`, *optional*, defaults to None):
+            Dimension of the key in the attention, if different from the other values.
+        vdim (`int`, *optional*, defaults to None):
+            Dimension of the value in the attention, if different from the other values.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        traceable (`bool`, *optional*, defaults to `False`):
+            Changes return value of the encoder's inner_state to stacked tensors.
+        Example:
+            ```python
+            >>> from transformers import GraphormerForGraphClassification, GraphormerConfig
+            >>> # Initializing a Graphormer graphormer-base-pcqm4mv2 style configuration
+            >>> configuration = GraphormerConfig()
+            >>> # Initializing a model from the graphormer-base-pcqm4mv1 style configuration
+            >>> model = GraphormerForGraphClassification(configuration)
+            >>> # Accessing the model configuration
+            >>> configuration = model.config
+            ```
+    """
+    model_type = "graphormer"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        num_classes: int = 2,
+        num_atoms: int = 512 * 9,
+        num_edges: int = 512 * 3,
+        num_in_degree: int = 512,
+        num_out_degree: int = 512,
+        num_spatial: int = 512,
+        num_edge_dis: int = 128,
+        multi_hop_max_dist: int = 5,  # sometimes is 20
+        spatial_pos_max: int = 1024,
+        edge_type: str = "multi_hop",
+        max_nodes: int = 512,
+        share_input_output_embed: bool = False,
+        num_hidden_layers: int = 12,
+        embedding_dim: int = 768,
+        ffn_embedding_dim: int = 768,
+        num_attention_heads: int = 32,
+        dropout: float = 0.1,
+        attention_dropout: float = 0.1,
+        activation_dropout: float = 0.1,
+        layerdrop: float = 0.0,
+        encoder_normalize_before: bool = False,
+        pre_layernorm: bool = False,
+        apply_graphormer_init: bool = False,
+        activation_fn: str = "gelu",
+        embed_scale: float = None,
+        freeze_embeddings: bool = False,
+        num_trans_layers_to_freeze: int = 0,
+        traceable: bool = False,
+        q_noise: float = 0.0,
+        qn_block_size: int = 8,
+        kdim: int = None,
+        vdim: int = None,
+        bias: bool = True,
+        self_attention: bool = True,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        **kwargs,
+    ):
+        self.num_classes = num_classes
+        self.num_atoms = num_atoms
+        self.num_in_degree = num_in_degree
+        self.num_out_degree = num_out_degree
+        self.num_edges = num_edges
+        self.num_spatial = num_spatial
+        self.num_edge_dis = num_edge_dis
+        self.edge_type = edge_type
+        self.multi_hop_max_dist = multi_hop_max_dist
+        self.spatial_pos_max = spatial_pos_max
+        self.max_nodes = max_nodes
+        self.num_hidden_layers = num_hidden_layers
+        self.embedding_dim = embedding_dim
+        self.hidden_size = embedding_dim
+        self.ffn_embedding_dim = ffn_embedding_dim
+        self.num_attention_heads = num_attention_heads
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.layerdrop = layerdrop
+        self.encoder_normalize_before = encoder_normalize_before
+        self.pre_layernorm = pre_layernorm
+        self.apply_graphormer_init = apply_graphormer_init
+        self.activation_fn = activation_fn
+        self.embed_scale = embed_scale
+        self.freeze_embeddings = freeze_embeddings
+        self.num_trans_layers_to_freeze = num_trans_layers_to_freeze
+        self.share_input_output_embed = share_input_output_embed
+        self.traceable = traceable
+        self.q_noise = q_noise
+        self.qn_block_size = qn_block_size
+        # These parameters are here for future extensions
+        # atm, the model only supports self attention
+        self.kdim = kdim
+        self.vdim = vdim
+        self.self_attention = self_attention
+        self.bias = bias
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            **kwargs,
+        )
--- a/src/transformers/models/graphormer/modeling_graphormer.py
+++ b/src/transformers/models/graphormer/modeling_graphormer.py
+# coding=utf-8
+# Copyright 2022 Microsoft, clefourrier The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Graphormer model."""
+import math
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from ...activations import ACT2FN
+from ...modeling_outputs import BaseModelOutputWithNoAttention, SequenceClassifierOutput
+from ...modeling_utils import PreTrainedModel
+from ...utils import logging
+from .configuration_graphormer import GraphormerConfig
+logger = logging.get_logger(__name__)
+_CHECKPOINT_FOR_DOC = "graphormer-base-pcqm4mv1"
+_CONFIG_FOR_DOC = "GraphormerConfig"
+GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "clefourrier/graphormer-base-pcqm4mv1",
+    "clefourrier/graphormer-base-pcqm4mv2",
+    # See all Graphormer models at https://huggingface.co/models?filter=graphormer
+]
+def quant_noise(module, p, block_size):
+    """
+    From:
+    https://github.com/facebookresearch/fairseq/blob/dd0079bde7f678b0cd0715cbd0ae68d661b7226d/fairseq/modules/quant_noise.py
+    Wraps modules and applies quantization noise to the weights for subsequent quantization with Iterative Product
+    Quantization as described in "Training with Quantization Noise for Extreme Model Compression"
+    Args:
+        - module: nn.Module
+        - p: amount of Quantization Noise
+        - block_size: size of the blocks for subsequent quantization with iPQ
+    Remarks:
+        - Module weights must have the right sizes wrt the block size
+        - Only Linear, Embedding and Conv2d modules are supported for the moment
+        - For more detail on how to quantize by blocks with convolutional weights, see "And the Bit Goes Down:
+          Revisiting the Quantization of Neural Networks"
+        - We implement the simplest form of noise here as stated in the paper which consists in randomly dropping
+          blocks
+    """
+    # if no quantization noise, don't register hook
+    if p <= 0:
+        return module
+    # supported modules
+    if not isinstance(module, (nn.Linear, nn.Embedding, nn.Conv2d)):
+        raise NotImplementedError("Module unsupported for quant_noise.")
+    # test whether module.weight has the right sizes wrt block_size
+    is_conv = module.weight.ndim == 4
+    # 2D matrix
+    if not is_conv:
+        if module.weight.size(1) % block_size != 0:
+            raise AssertionError("Input features must be a multiple of block sizes")
+    # 4D matrix
+    else:
+        # 1x1 convolutions
+        if module.kernel_size == (1, 1):
+            if module.in_channels % block_size != 0:
+                raise AssertionError("Input channels must be a multiple of block sizes")
+        # regular convolutions
+        else:
+            k = module.kernel_size[0] * module.kernel_size[1]
+            if k % block_size != 0:
+                raise AssertionError("Kernel size must be a multiple of block size")
+    def _forward_pre_hook(mod, input):
+        # no noise for evaluation
+        if mod.training:
+            if not is_conv:
+                # gather weight and sizes
+                weight = mod.weight
+                in_features = weight.size(1)
+                out_features = weight.size(0)
+                # split weight matrix into blocks and randomly drop selected blocks
+                mask = torch.zeros(in_features // block_size * out_features, device=weight.device)
+                mask.bernoulli_(p)
+                mask = mask.repeat_interleave(block_size, -1).view(-1, in_features)
+            else:
+                # gather weight and sizes
+                weight = mod.weight
+                in_channels = mod.in_channels
+                out_channels = mod.out_channels
+                # split weight matrix into blocks and randomly drop selected blocks
+                if mod.kernel_size == (1, 1):
+                    mask = torch.zeros(
+                        int(in_channels // block_size * out_channels),
+                        device=weight.device,
+                    )
+                    mask.bernoulli_(p)
+                    mask = mask.repeat_interleave(block_size, -1).view(-1, in_channels)
+                else:
+                    mask = torch.zeros(weight.size(0), weight.size(1), device=weight.device)
+                    mask.bernoulli_(p)
+                    mask = mask.unsqueeze(2).unsqueeze(3).repeat(1, 1, mod.kernel_size[0], mod.kernel_size[1])
+            # scale weights and apply mask
+            mask = mask.to(torch.bool)  # x.bool() is not currently supported in TorchScript
+            s = 1 / (1 - p)
+            mod.weight.data = s * weight.masked_fill(mask, 0)
+    module.register_forward_pre_hook(_forward_pre_hook)
+    return module
+class LayerDropModuleList(nn.ModuleList):
+    """
+    From:
+    https://github.com/facebookresearch/fairseq/blob/dd0079bde7f678b0cd0715cbd0ae68d661b7226d/fairseq/modules/layer_drop.py
+    A LayerDrop implementation based on [`torch.nn.ModuleList`]. LayerDrop as described in
+    https://arxiv.org/abs/1909.11556.
+    We refresh the choice of which layers to drop every time we iterate over the LayerDropModuleList instance. During
+    evaluation we always iterate over all layers.
+    Usage:
+    ```python
+    layers = LayerDropList(p=0.5, modules=[layer1, layer2, layer3])
+    for layer in layers:  # this might iterate over layers 1 and 3
+        x = layer(x)
+    for layer in layers:  # this might iterate over all layers
+        x = layer(x)
+    for layer in layers:  # this might not iterate over any layers
+        x = layer(x)
+    ```
+    Args:
+        p (float): probability of dropping out each layer
+        modules (iterable, optional): an iterable of modules to add
+    """
+    def __init__(self, p, modules=None):
+        super().__init__(modules)
+        self.p = p
+    def __iter__(self):
+        dropout_probs = torch.empty(len(self)).uniform_()
+        for i, m in enumerate(super().__iter__()):
+            if not self.training or (dropout_probs[i] > self.p):
+                yield m
+class GraphormerGraphNodeFeature(nn.Module):
+    """
+    Compute node features for each node in the graph.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.num_heads = config.num_attention_heads
+        self.num_atoms = config.num_atoms
+        self.atom_encoder = nn.Embedding(config.num_atoms + 1, config.hidden_size, padding_idx=config.pad_token_id)
+        self.in_degree_encoder = nn.Embedding(
+            config.num_in_degree, config.hidden_size, padding_idx=config.pad_token_id
+        )
+        self.out_degree_encoder = nn.Embedding(
+            config.num_out_degree, config.hidden_size, padding_idx=config.pad_token_id
+        )
+        self.graph_token = nn.Embedding(1, config.hidden_size)
+    def forward(self, input_nodes, in_degree, out_degree):
+        n_graph, n_node = input_nodes.size()[:2]
+        node_feature = (  # node feature + graph token
+            self.atom_encoder(input_nodes).sum(dim=-2)  # [n_graph, n_node, n_hidden]
+            + self.in_degree_encoder(in_degree)
+            + self.out_degree_encoder(out_degree)
+        )
+        graph_token_feature = self.graph_token.weight.unsqueeze(0).repeat(n_graph, 1, 1)
+        graph_node_feature = torch.cat([graph_token_feature, node_feature], dim=1)
+        return graph_node_feature
+class GraphormerGraphAttnBias(nn.Module):
+    """
+    Compute attention bias for each head.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.num_heads = config.num_attention_heads
+        self.multi_hop_max_dist = config.multi_hop_max_dist
+        # We do not change edge feature embedding learning, as edge embeddings are represented as a combination of the original features
+        # + shortest path
+        self.edge_encoder = nn.Embedding(config.num_edges + 1, config.num_attention_heads, padding_idx=0)
+        self.edge_type = config.edge_type
+        if self.edge_type == "multi_hop":
+            self.edge_dis_encoder = nn.Embedding(
+                config.num_edge_dis * config.num_attention_heads * config.num_attention_heads,
+                1,
+            )
+        self.spatial_pos_encoder = nn.Embedding(config.num_spatial, config.num_attention_heads, padding_idx=0)
+        self.graph_token_virtual_distance = nn.Embedding(1, config.num_attention_heads)
+    def forward(self, input_nodes, attn_bias, spatial_pos, input_edges, attn_edge_type):
+        n_graph, n_node = input_nodes.size()[:2]
+        graph_attn_bias = attn_bias.clone()
+        graph_attn_bias = graph_attn_bias.unsqueeze(1).repeat(
+            1, self.num_heads, 1, 1
+        )  # [n_graph, n_head, n_node+1, n_node+1]
+        # spatial pos
+        # [n_graph, n_node, n_node, n_head] -> [n_graph, n_head, n_node, n_node]
+        spatial_pos_bias = self.spatial_pos_encoder(spatial_pos).permute(0, 3, 1, 2)
+        graph_attn_bias[:, :, 1:, 1:] = graph_attn_bias[:, :, 1:, 1:] + spatial_pos_bias
+        # reset spatial pos here
+        t = self.graph_token_virtual_distance.weight.view(1, self.num_heads, 1)
+        graph_attn_bias[:, :, 1:, 0] = graph_attn_bias[:, :, 1:, 0] + t
+        graph_attn_bias[:, :, 0, :] = graph_attn_bias[:, :, 0, :] + t
+        # edge feature
+        if self.edge_type == "multi_hop":
+            spatial_pos_ = spatial_pos.clone()
+            spatial_pos_[spatial_pos_ == 0] = 1  # set pad to 1
+            # set 1 to 1, input_nodes > 1 to input_nodes - 1
+            spatial_pos_ = torch.where(spatial_pos_ > 1, spatial_pos_ - 1, spatial_pos_)
+            if self.multi_hop_max_dist > 0:
+                spatial_pos_ = spatial_pos_.clamp(0, self.multi_hop_max_dist)
+                input_edges = input_edges[:, :, :, : self.multi_hop_max_dist, :]
+            # [n_graph, n_node, n_node, max_dist, n_head]
+            input_edges = self.edge_encoder(input_edges).mean(-2)
+            max_dist = input_edges.size(-2)
+            edge_input_flat = input_edges.permute(3, 0, 1, 2, 4).reshape(max_dist, -1, self.num_heads)
+            edge_input_flat = torch.bmm(
+                edge_input_flat,
+                self.edge_dis_encoder.weight.reshape(-1, self.num_heads, self.num_heads)[:max_dist, :, :],
+            )
+            input_edges = edge_input_flat.reshape(max_dist, n_graph, n_node, n_node, self.num_heads).permute(
+                1, 2, 3, 0, 4
+            )
+            input_edges = (input_edges.sum(-2) / (spatial_pos_.float().unsqueeze(-1))).permute(0, 3, 1, 2)
+        else:
+            # [n_graph, n_node, n_node, n_head] -> [n_graph, n_head, n_node, n_node]
+            input_edges = self.edge_encoder(attn_edge_type).mean(-2).permute(0, 3, 1, 2)
+        graph_attn_bias[:, :, 1:, 1:] = graph_attn_bias[:, :, 1:, 1:] + input_edges
+        graph_attn_bias = graph_attn_bias + attn_bias.unsqueeze(1)  # reset
+        return graph_attn_bias
+class GraphormerMultiheadAttention(nn.Module):
+    """Multi-headed attention.
+    See "Attention Is All You Need" for more details.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.embedding_dim = config.embedding_dim
+        self.kdim = config.kdim if config.kdim is not None else config.embedding_dim
+        self.vdim = config.vdim if config.vdim is not None else config.embedding_dim
+        self.qkv_same_dim = self.kdim == config.embedding_dim and self.vdim == config.embedding_dim
+        self.num_heads = config.num_attention_heads
+        self.dropout_module = torch.nn.Dropout(p=config.dropout, inplace=False)
+        self.head_dim = config.embedding_dim // config.num_attention_heads
+        if not (self.head_dim * config.num_attention_heads == self.embedding_dim):
+            raise AssertionError("The embedding_dim must be divisible by num_heads.")
+        self.scaling = self.head_dim**-0.5
+        self.self_attention = True  # config.self_attention
+        if not (self.self_attention):
+            raise NotImplementedError("The Graphormer model only supports self attention for now.")
+        if self.self_attention and not self.qkv_same_dim:
+            raise AssertionError("Self-attention requires query, key and value to be of the same size.")
+        self.k_proj = quant_noise(
+            nn.Linear(self.kdim, config.embedding_dim, bias=config.bias),
+            config.q_noise,
+            config.qn_block_size,
+        )
+        self.v_proj = quant_noise(
+            nn.Linear(self.vdim, config.embedding_dim, bias=config.bias),
+            config.q_noise,
+            config.qn_block_size,
+        )
+        self.q_proj = quant_noise(
+            nn.Linear(config.embedding_dim, config.embedding_dim, bias=config.bias),
+            config.q_noise,
+            config.qn_block_size,
+        )
+        self.out_proj = quant_noise(
+            nn.Linear(config.embedding_dim, config.embedding_dim, bias=config.bias),
+            config.q_noise,
+            config.qn_block_size,
+        )
+        self.onnx_trace = False
+    def reset_parameters(self):
+        if self.qkv_same_dim:
+            # Empirically observed the convergence to be much better with
+            # the scaled initialization
+            nn.init.xavier_uniform_(self.k_proj.weight, gain=1 / math.sqrt(2))
+            nn.init.xavier_uniform_(self.v_proj.weight, gain=1 / math.sqrt(2))
+            nn.init.xavier_uniform_(self.q_proj.weight, gain=1 / math.sqrt(2))
+        else:
+            nn.init.xavier_uniform_(self.k_proj.weight)
+            nn.init.xavier_uniform_(self.v_proj.weight)
+            nn.init.xavier_uniform_(self.q_proj.weight)
+        nn.init.xavier_uniform_(self.out_proj.weight)
+        if self.out_proj.bias is not None:
+            nn.init.constant_(self.out_proj.bias, 0.0)
+    def forward(
+        self,
+        query,
+        key: Optional[torch.Tensor],
+        value: Optional[torch.Tensor],
+        attn_bias: Optional[torch.Tensor],
+        key_padding_mask: Optional[torch.Tensor] = None,
+        need_weights: bool = True,
+        attn_mask: Optional[torch.Tensor] = None,
+        before_softmax: bool = False,
+        need_head_weights: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """
+        Args:
+            key_padding_mask (Bytetorch.Tensor, optional): mask to exclude
+                keys that are pads, of shape `(batch, src_len)`, where padding elements are indicated by 1s.
+            need_weights (bool, optional): return the attention weights,
+                averaged over heads (default: False).
+            attn_mask (Bytetorch.Tensor, optional): typically used to
+                implement causal attention, where the mask prevents the attention from looking forward in time
+                (default: None).
+            before_softmax (bool, optional): return the raw attention
+                weights and values before the attention softmax.
+            need_head_weights (bool, optional): return the attention
+                weights for each head. Implies *need_weights*. Default: return the average attention weights over all
+                heads.
+        """
+        if need_head_weights:
+            need_weights = True
+        tgt_len, bsz, embedding_dim = query.size()
+        src_len = tgt_len
+        if not (embedding_dim == self.embedding_dim):
+            raise AssertionError(
+                f"The query embedding dimension {embedding_dim} is not equal to the expected embedding_dim"
+                f" {self.embedding_dim}."
+            )
+        if not (list(query.size()) == [tgt_len, bsz, embedding_dim]):
+            raise AssertionError("Query size incorrect in Graphormer, compared to model dimensions.")
+        if key is not None:
+            src_len, key_bsz, _ = key.size()
+            if not torch.jit.is_scripting():
+                if (key_bsz != bsz) or (value is None) or not (src_len, bsz == value.shape[:2]):
+                    raise AssertionError(
+                        "The batch shape does not match the key or value shapes provided to the attention."
+                    )
+        q = self.q_proj(query)
+        k = self.k_proj(query)
+        v = self.v_proj(query)
+        q *= self.scaling
+        q = q.contiguous().view(tgt_len, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if k is not None:
+            k = k.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if v is not None:
+            v = v.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if (k is None) or not (k.size(1) == src_len):
+            raise AssertionError("The shape of the key generated in the attention is incorrect")
+        # This is part of a workaround to get around fork/join parallelism
+        # not supporting Optional types.
+        if key_padding_mask is not None and key_padding_mask.dim() == 0:
+            key_padding_mask = None
+        if key_padding_mask is not None:
+            if key_padding_mask.size(0) != bsz or key_padding_mask.size(1) != src_len:
+                raise AssertionError(
+                    "The shape of the generated padding mask for the key does not match expected dimensions."
+                )
+        attn_weights = torch.bmm(q, k.transpose(1, 2))
+        attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
+        if list(attn_weights.size()) != [bsz * self.num_heads, tgt_len, src_len]:
+            raise AssertionError("The attention weights generated do not match the expected dimensions.")
+        if attn_bias is not None:
+            attn_weights += attn_bias.view(bsz * self.num_heads, tgt_len, src_len)
+        if attn_mask is not None:
+            attn_mask = attn_mask.unsqueeze(0)
+            attn_weights += attn_mask
+        if key_padding_mask is not None:
+            # don't attend to padding symbols
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.masked_fill(
+                key_padding_mask.unsqueeze(1).unsqueeze(2).to(torch.bool), float("-inf")
+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        if before_softmax:
+            return attn_weights, v
+        attn_weights_float = torch.nn.functional.softmax(attn_weights, dim=-1)
+        attn_weights = attn_weights_float.type_as(attn_weights)
+        attn_probs = self.dropout_module(attn_weights)
+        if v is None:
+            raise AssertionError("No value generated")
+        attn = torch.bmm(attn_probs, v)
+        if list(attn.size()) != [bsz * self.num_heads, tgt_len, self.head_dim]:
+            raise AssertionError("The attention generated do not match the expected dimensions.")
+        attn = attn.transpose(0, 1).contiguous().view(tgt_len, bsz, embedding_dim)
+        attn = self.out_proj(attn)
+        attn_weights = None
+        if need_weights:
+            attn_weights = attn_weights_float.contiguous().view(bsz, self.num_heads, tgt_len, src_len).transpose(1, 0)
+            if not need_head_weights:
+                # average attention weights over heads
+                attn_weights = attn_weights.mean(dim=0)
+        return attn, attn_weights
+    def apply_sparse_mask(self, attn_weights, tgt_len: int, src_len: int, bsz: int):
+        return attn_weights
+class GraphormerGraphEncoderLayer(nn.Module):
+    def __init__(self, config) -> None:
+        super().__init__()
+        # Initialize parameters
+        self.embedding_dim = config.embedding_dim
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_dropout = config.attention_dropout
+        self.q_noise = config.q_noise
+        self.qn_block_size = config.qn_block_size
+        self.pre_layernorm = config.pre_layernorm
+        self.dropout_module = torch.nn.Dropout(p=config.dropout, inplace=False)
+        self.activation_dropout_module = torch.nn.Dropout(p=config.dropout, inplace=False)
+        # Initialize blocks
+        self.activation_fn = ACT2FN[config.activation_fn]
+        self.self_attn = GraphormerMultiheadAttention(config)
+        # layer norm associated with the self attention layer
+        self.self_attn_layer_norm = nn.LayerNorm(self.embedding_dim)
+        self.fc1 = self.build_fc(
+            self.embedding_dim,
+            config.ffn_embedding_dim,
+            q_noise=config.q_noise,
+            qn_block_size=config.qn_block_size,
+        )
+        self.fc2 = self.build_fc(
+            config.ffn_embedding_dim,
+            self.embedding_dim,
+            q_noise=config.q_noise,
+            qn_block_size=config.qn_block_size,
+        )
+        # layer norm associated with the position wise feed-forward NN
+        self.final_layer_norm = nn.LayerNorm(self.embedding_dim)
+    def build_fc(self, input_dim, output_dim, q_noise, qn_block_size):
+        return quant_noise(nn.Linear(input_dim, output_dim), q_noise, qn_block_size)
+    def forward(
+        self,
+        input_nodes: torch.Tensor,
+        self_attn_bias: Optional[torch.Tensor] = None,
+        self_attn_mask: Optional[torch.Tensor] = None,
+        self_attn_padding_mask: Optional[torch.Tensor] = None,
+    ):
+        """
+        nn.LayerNorm is applied either before or after the self-attention/ffn modules similar to the original
+        Transformer implementation.
+        """
+        residual = input_nodes
+        if self.pre_layernorm:
+            input_nodes = self.self_attn_layer_norm(input_nodes)
+        input_nodes, attn = self.self_attn(
+            query=input_nodes,
+            key=input_nodes,
+            value=input_nodes,
+            attn_bias=self_attn_bias,
+            key_padding_mask=self_attn_padding_mask,
+            need_weights=False,
+            attn_mask=self_attn_mask,
+        )
+        input_nodes = self.dropout_module(input_nodes)
+        input_nodes = residual + input_nodes
+        if not self.pre_layernorm:
+            input_nodes = self.self_attn_layer_norm(input_nodes)
+        residual = input_nodes
+        if self.pre_layernorm:
+            input_nodes = self.final_layer_norm(input_nodes)
+        input_nodes = self.activation_fn(self.fc1(input_nodes))
+        input_nodes = self.activation_dropout_module(input_nodes)
+        input_nodes = self.fc2(input_nodes)
+        input_nodes = self.dropout_module(input_nodes)
+        input_nodes = residual + input_nodes
+        if not self.pre_layernorm:
+            input_nodes = self.final_layer_norm(input_nodes)
+        return input_nodes, attn
+class GraphormerGraphEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dropout_module = torch.nn.Dropout(p=config.dropout, inplace=False)
+        self.layerdrop = config.layerdrop
+        self.embedding_dim = config.embedding_dim
+        self.apply_graphormer_init = config.apply_graphormer_init
+        self.traceable = config.traceable
+        self.graph_node_feature = GraphormerGraphNodeFeature(config)
+        self.graph_attn_bias = GraphormerGraphAttnBias(config)
+        self.embed_scale = config.embed_scale
+        if config.q_noise > 0:
+            self.quant_noise = quant_noise(
+                nn.Linear(self.embedding_dim, self.embedding_dim, bias=False),
+                config.q_noise,
+                config.qn_block_size,
+            )
+        else:
+            self.quant_noise = None
+        if config.encoder_normalize_before:
+            self.emb_layer_norm = nn.LayerNorm(self.embedding_dim)
+        else:
+            self.emb_layer_norm = None
+        if config.pre_layernorm:
+            self.final_layer_norm = nn.LayerNorm(self.embedding_dim)
+        if self.layerdrop > 0.0:
+            self.layers = LayerDropModuleList(p=self.layerdrop)
+        else:
+            self.layers = nn.ModuleList([])
+        self.layers.extend([GraphormerGraphEncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        # Apply initialization of model params after building the model
+        if config.freeze_embeddings:
+            raise NotImplementedError("Freezing embeddings is not implemented yet.")
+        for layer in range(config.num_trans_layers_to_freeze):
+            m = self.layers[layer]
+            if m is not None:
+                for p in m.parameters():
+                    p.requires_grad = False
+    def forward(
+        self,
+        input_nodes,
+        input_edges,
+        attn_bias,
+        in_degree,
+        out_degree,
+        spatial_pos,
+        attn_edge_type,
+        perturb=None,
+        last_state_only: bool = False,
+        token_embeddings: Optional[torch.Tensor] = None,
+        attn_mask: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.torch.Tensor, torch.Tensor]:
+        # compute padding mask. This is needed for multi-head attention
+        data_x = input_nodes
+        n_graph, n_node = data_x.size()[:2]
+        padding_mask = (data_x[:, :, 0]).eq(0)
+        padding_mask_cls = torch.zeros(n_graph, 1, device=padding_mask.device, dtype=padding_mask.dtype)
+        padding_mask = torch.cat((padding_mask_cls, padding_mask), dim=1)
+        attn_bias = self.graph_attn_bias(input_nodes, attn_bias, spatial_pos, input_edges, attn_edge_type)
+        if token_embeddings is not None:
+            input_nodes = token_embeddings
+        else:
+            input_nodes = self.graph_node_feature(input_nodes, in_degree, out_degree)
+        if perturb is not None:
+            input_nodes[:, 1:, :] += perturb
+        if self.embed_scale is not None:
+            input_nodes = input_nodes * self.embed_scale
+        if self.quant_noise is not None:
+            input_nodes = self.quant_noise(input_nodes)
+        if self.emb_layer_norm is not None:
+            input_nodes = self.emb_layer_norm(input_nodes)
+        input_nodes = self.dropout_module(input_nodes)
+        input_nodes = input_nodes.transpose(0, 1)
+        inner_states = []
+        if not last_state_only:
+            inner_states.append(input_nodes)
+        for layer in self.layers:
+            input_nodes, _ = layer(
+                input_nodes,
+                self_attn_padding_mask=padding_mask,
+                self_attn_mask=attn_mask,
+                self_attn_bias=attn_bias,
+            )
+            if not last_state_only:
+                inner_states.append(input_nodes)
+        graph_rep = input_nodes[0, :, :]
+        if last_state_only:
+            inner_states = [input_nodes]
+        if self.traceable:
+            return torch.stack(inner_states), graph_rep
+        else:
+            return inner_states, graph_rep
+class GraphormerDecoderHead(nn.Module):
+    def __init__(self, embedding_dim, num_classes):
+        super().__init__()
+        """num_classes should be 1 for regression, or the number of classes for classification"""
+        self.lm_output_learned_bias = nn.Parameter(torch.zeros(1))
+        self.classifier = nn.Linear(embedding_dim, num_classes, bias=False)
+        self.num_classes = num_classes
+    def forward(self, input_nodes, **unused):
+        input_nodes = self.classifier(input_nodes)
+        input_nodes = input_nodes + self.lm_output_learned_bias
+        return input_nodes
+class GraphormerPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = GraphormerConfig
+    base_model_prefix = "graphormer"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+    main_input_name_nodes = "input_nodes"
+    main_input_name_edges = "input_edges"
+    def normal_(self, data):
+        # with FSDP, module params will be on CUDA, so we cast them back to CPU
+        # so that the RNG is consistent with and without FSDP
+        data.copy_(data.cpu().normal_(mean=0.0, std=0.02).to(data.device))
+    def init_graphormer_params(self, module):
+        """
+        Initialize the weights specific to the Graphormer Model.
+        """
+        if isinstance(module, nn.Linear):
+            self.normal_(module.weight.data)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        if isinstance(module, nn.Embedding):
+            self.normal_(module.weight.data)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        if isinstance(module, GraphormerMultiheadAttention):
+            self.normal_(module.q_proj.weight.data)
+            self.normal_(module.k_proj.weight.data)
+            self.normal_(module.v_proj.weight.data)
+    def _init_weights(self, module):
+        """
+        Initialize the weights
+        """
+        if isinstance(module, (nn.Linear, nn.Conv2d)):
+            # We might be missing part of the Linear init, dependant on the layer num
+            module.weight.data.normal_(mean=0.0, std=0.02)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=0.02)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, GraphormerMultiheadAttention):
+            module.q_proj.weight.data.normal_(mean=0.0, std=0.02)
+            module.k_proj.weight.data.normal_(mean=0.0, std=0.02)
+            module.v_proj.weight.data.normal_(mean=0.0, std=0.02)
+            module.reset_parameters()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, GraphormerGraphEncoder):
+            if module.apply_graphormer_init:
+                module.apply(self.init_graphormer_params)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, GraphormerModel):
+            module.gradient_checkpointing = value
+class GraphormerModel(GraphormerPreTrainedModel):
+    """The Graphormer model is a graph-encoder model.
+    It goes from a graph to its representation. If you want to use the model for a downstream classification task, use
+    GraphormerForGraphClassification instead. For any other downstream task, feel free to add a new class, or combine
+    this model with a downstream model of your choice, following the example in GraphormerForGraphClassification.
+    """
+    def __init__(self, config):
+        super().__init__(config)
+        self.max_nodes = config.max_nodes
+        self.graph_encoder = GraphormerGraphEncoder(config)
+        self.share_input_output_embed = config.share_input_output_embed
+        self.lm_output_learned_bias = None
+        # Remove head is set to true during fine-tuning
+        self.load_softmax = not getattr(config, "remove_head", False)
+        self.lm_head_transform_weight = nn.Linear(config.embedding_dim, config.embedding_dim)
+        self.activation_fn = ACT2FN[config.activation_fn]
+        self.layer_norm = nn.LayerNorm(config.embedding_dim)
+        self.post_init()
+    def reset_output_layer_parameters(self):
+        self.lm_output_learned_bias = nn.Parameter(torch.zeros(1))
+    def forward(
+        self,
+        input_nodes,
+        input_edges,
+        attn_bias,
+        in_degree,
+        out_degree,
+        spatial_pos,
+        attn_edge_type,
+        perturb=None,
+        masked_tokens=None,
+        return_dict: Optional[bool] = True,
+        **unused
+    ):
+        inner_states, graph_rep = self.graph_encoder(
+            input_nodes, input_edges, attn_bias, in_degree, out_degree, spatial_pos, attn_edge_type, perturb=perturb
+        )
+        # last inner state, then revert Batch and Graph len
+        input_nodes = inner_states[-1].transpose(0, 1)
+        # project masked tokens only
+        if masked_tokens is not None:
+            raise NotImplementedError
+        input_nodes = self.layer_norm(self.activation_fn(self.lm_head_transform_weight(input_nodes)))
+        # project back to size of vocabulary
+        if self.share_input_output_embed and hasattr(self.graph_encoder.embed_tokens, "weight"):
+            input_nodes = torch.nn.functional.linear(input_nodes, self.graph_encoder.embed_tokens.weight)
+        if not return_dict:
+            return (input_nodes, inner_states)
+        return BaseModelOutputWithNoAttention(last_hidden_state=input_nodes, hidden_states=inner_states)
+    def max_nodes(self):
+        """Maximum output length supported by the encoder."""
+        return self.max_nodes
+class GraphormerForGraphClassification(GraphormerPreTrainedModel):
+    """
+    This model can be used for graph-level classification or regression tasks.
+    It can be trained on
+    - regression (by setting config.num_classes to 1); there should be one float-type label per graph
+    - one task classification (by setting config.num_classes to the number of classes); there should be one integer
+      label per graph
+    - binary multi-task classification (by setting config.num_classes to the number of labels); there should be a list
+      of integer labels for each graph.
+    """
+    def __init__(self, config):
+        super().__init__(config)
+        self.encoder = GraphormerModel(config)
+        self.embedding_dim = config.embedding_dim
+        self.num_classes = config.num_classes
+        self.classifier = GraphormerDecoderHead(self.embedding_dim, self.num_classes)
+        self.is_encoder_decoder = True
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_nodes,
+        input_edges,
+        attn_bias,
+        in_degree,
+        out_degree,
+        spatial_pos,
+        attn_edge_type,
+        labels: Optional[torch.LongTensor] = None,
+        return_dict: Optional[bool] = True,
+        **unused,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
+        encoder_outputs = self.encoder(
+            input_nodes,
+            input_edges,
+            attn_bias,
+            in_degree,
+            out_degree,
+            spatial_pos,
+            attn_edge_type,
+        )
+        outputs, hidden_states = encoder_outputs["last_hidden_state"], encoder_outputs["hidden_states"]
+        head_outputs = self.classifier(outputs)
+        logits = head_outputs[:, 0, :].contiguous()
+        if labels is not None:
+            mask = ~torch.isnan(labels)
+            if self.num_classes == 1:  # regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits[mask].squeeze(), labels[mask].squeeze().float())
+            elif self.num_classes > 1 and len(labels.shape) == 1:  # One task classification
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits[mask].view(-1, self.num_classes), labels[mask].view(-1))
+            else:  # Binary multi-task classification
+                loss_fct = BCEWithLogitsLoss(reduction="sum")
+                loss = loss_fct(logits[mask], labels[mask])
+        if not return_dict:
+            return (loss, logits, hidden_states)
+        return SequenceClassifierOutput(loss=loss, logits=logits, hidden_states=hidden_states, attentions=None)
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -51,6 +51,7 @@ from .utils import (
    is_apex_available,
    is_bitsandbytes_available,
    is_bs4_available,
+    is_cython_available,
    is_decord_available,
    is_detectron2_available,
    is_faiss_available,
@@ -711,6 +712,13 @@ def require_jumanpp(test_case):
    return unittest.skipUnless(is_jumanpp_available(), "test requires jumanpp")(test_case)
+def require_cython(test_case):
+    """
+    Decorator marking a test that requires jumanpp
+    """
+    return unittest.skipUnless(is_cython_available(), "test requires cython")(test_case)
 def get_gpu_count():
    """
    Return the number of available gpus (regardless of whether torch, tf or jax is used)