Graphormer model for Graph Classification (#20968)

* [FT] First commit for graphormer architecture. The model has no tokenizer, as it uses a collator and preprocessing function for its input management. Architecture to be tested against original one. The arch might need to be changed to fit the checkpoint, but a revert to the original arch will make the code less nice to read. TODO: doc * [FIX] removed test model * [FIX] import error * [FIX] black and flake * [DOC] added paper refs * [FIX] [DOC] * [FIX] black * [DOC] Updated READMEs * [FIX] Order of imports + rm Tokenizer calls * [FIX] Moved assert in class to prevent doc build failure * [FIX] make fix-copies * [Doc] update from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [FIX] Removed Graphormer from Sequence classification model list * [DOC] Added HF copyright to Cython file * [DOC] Fixed comments * [FIX] typos in class doc + removed config classes. Todo: update doc from paper definitions * [FIX] Removed dependency to fairseq, and replaced all asserts with Exception management * [FIX] Homogeneized initialization of weights to pretrained constructor * [FIX] [CP] Updated multi_hop parameter to get same results as in original implementation * [DOC] Relevant parameter description in the configuration file * [DOC] Updated doc and comments in main graphormer file * [FIX] make style and quality checks * [DOC] Fix doc format * [FIX] [WIP] Updated part of the tests, though still a wip * [FIX] [WIP] * [FIX] repo consistency * [FIX] Changed input names for more understandability * [FIX] [BUG] updated num_classes params for propagation in the model * simplified collator * [FIX] Updated tests to follow new naming pattern * [TESTS] Updated test suite along with model * |FIX] rm tokenizer import * [DOC] add link to graphormerdoc * Changed section in doc from text model to graph model * Apply suggestions from code review Spacing, inits Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [DOC] Explain algos_graphormer functions * Cython soft import protection * Rm call to Callable in configuration graphormer * [FIX] replaced asserts with Exceptions * Add org to graphormer checkpoints * Prefixed classes with Graphormer * Management of init functions * format * fixes * fix length file * update indent * relaunching ci * Errors for missing cython imports * fix style * fix style doc Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Graphormer model for Graph Classification (#20968)
* [FT] First commit for graphormer architecture. The model has no tokenizer, as it uses a collator and preprocessing function for its input management. Architecture to be tested against original one. The arch might need to be changed to fit the checkpoint, but a revert to the original arch will make the code less nice to read. TODO: doc * [FIX] removed test model * [FIX] import error * [FIX] black and flake * [DOC] added paper refs * [FIX] [DOC] * [FIX] black * [DOC] Updated READMEs * [FIX] Order of imports + rm Tokenizer calls * [FIX] Moved assert in class to prevent doc build failure * [FIX] make fix-copies * [Doc] update from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [FIX] Removed Graphormer from Sequence classification model list * [DOC] Added HF copyright to Cython file * [DOC] Fixed comments * [FIX] typos in class doc + removed config classes. Todo: update doc from paper definitions * [FIX] Removed dependency to fairseq, and replaced all asserts with Exception management * [FIX] Homogeneized initialization of weights to pretrained constructor * [FIX] [CP] Updated multi_hop parameter to get same results as in original implementation * [DOC] Relevant parameter description in the configuration file * [DOC] Updated doc and comments in main graphormer file * [FIX] make style and quality checks * [DOC] Fix doc format * [FIX] [WIP] Updated part of the tests, though still a wip * [FIX] [WIP] * [FIX] repo consistency * [FIX] Changed input names for more understandability * [FIX] [BUG] updated num_classes params for propagation in the model * simplified collator * [FIX] Updated tests to follow new naming pattern * [TESTS] Updated test suite along with model * |FIX] rm tokenizer import * [DOC] add link to graphormerdoc * Changed section in doc from text model to graph model * Apply suggestions from code review Spacing, inits Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * [DOC] Explain algos_graphormer functions * Cython soft import protection * Rm call to Callable in configuration graphormer * [FIX] replaced asserts with Exceptions * Add org to graphormer checkpoints * Prefixed classes with Graphormer * Management of init functions * format * fixes * fix length file * update indent * relaunching ci * Errors for missing cython imports * fix style * fix style doc Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
87208a05 · Clémentine Fourrier · GitHub · 758bd39e · 87208a05 · 87208a05
Unverified Commit 87208a05 authored Jan 19, 2023 by Clémentine Fourrier Committed by GitHub Jan 19, 2023
20 changed files
--- a/README.md
+++ b/README.md
@@ -334,6 +334,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.

--- a/README_es.md
+++ b/README_es.md
@@ -327,6 +327,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.

--- a/README_hd.md
+++ b/README_hd.md
@@ -299,6 +299,7 @@ conda install -c huggingface transformers
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (ओपनएआई से) साथ में पेपर [लैंग्वेज मॉडल्स अनसुपरवाइज्ड मल्टीटास्क लर्नर्स हैं](https://blog.openai.com/better-language-models/) एलेक रैडफोर्ड*, जेफरी वू*, रेवन चाइल्ड, डेविड लुआन, डारियो एमोडी* द्वारा * और इल्या सुत्सकेवर** ने पोस्ट किया।
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI से) साथ वाला पेपर [kingoflolz/mesh-transformer-jax](https://github. com/kingoflolz/mesh-transformer-jax/) बेन वांग और अरन कोमात्सुजाकी द्वारा।
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv .org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा।
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा।
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा।

--- a/README_ja.md
+++ b/README_ja.md
@@ -361,6 +361,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI から) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** から公開された研究論文: [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/)
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI から) Ben Wang and Aran Komatsuzaki から公開されたレポジトリー [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden から) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren から公開された研究論文: [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (Microsoft から) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu から公開された研究論文: [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234).
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA から) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang から公開された研究論文: [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094)
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447)
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321)

--- a/README_ko.md
+++ b/README_ko.md
@@ -276,6 +276,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI 에서) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 의 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 논문과 함께 발표했습니다.
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden 에서) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 의 [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 논문과 함께 발표했습니다.
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu  의 [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234)  논문과 함께 발표했습니다.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA 에서) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 의 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 논문과 함께 발표했습니다.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley 에서) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 의 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 논문과 함께 발표했습니다.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -300,6 +300,7 @@ conda install -c huggingface transformers
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 由 Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 发布。
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (来自 EleutherAI) 伴随论文 [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 由 Ben Wang and Aran Komatsuzaki 发布。 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (来自 UCSD, NVIDIA) 伴随论文 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 由 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 发布。
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -312,6 +312,7 @@ conda install -c huggingface transformers
 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released with the paper [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 
 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 
+1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.

--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -571,6 +571,11 @@
      - local: model_doc/time_series_transformer
        title: Time Series Transformer
      title: Time series models
+    - isExpanded: false
+      sections:
+      - local: model_doc/graphormer
+        title: Graphormer
+      title: Graph models
    title: Models
  - sections:
    - local: internal/modeling_utils

--- a/docs/source/en/index.mdx
+++ b/docs/source/en/index.mdx
@@ -113,6 +113,7 @@ The documentation is organized into five sections:
 1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
 1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
 1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
+1. **[Graphormer](model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
@@ -289,6 +290,7 @@ Flax), PyTorch, and/or TensorFlow.
 |       GPT NeoX Japanese       |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 |             GPT-J             |       ❌       |       ❌       |       ✅        |         ✅         |      ✅      |
 |            GPT-Sw3            |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
+|          Graphormer           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |           GroupViT            |       ❌       |       ❌       |       ✅        |         ✅         |      ❌      |
 |            Hubert             |       ❌       |       ❌       |       ✅        |         ✅         |      ❌      |
 |            I-BERT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |

--- a/docs/source/en/model_doc/graphormer.mdx
+++ b/docs/source/en/model_doc/graphormer.mdx
+<!--Copyright 2022 The HuggingFace Team and Microsoft. All rights reserved.
+Licensed under the MIT License; you may not use this file except in compliance with
+the License. 
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Graphormer
+## Overview
+The Graphormer model was proposed in [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234)  by 
+Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen and Tie-Yan Liu. It is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.
+The abstract from the paper is the following:
+*The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.*
+Tips:
+This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode.
+You can reduce the batch size, increase your RAM, or decrease the `UNREACHABLE_NODE_DISTANCE` parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges.
+This model does not use a tokenizer, but instead a special collator during training.
+This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
+## GraphormerConfig
+[[autodoc]] GraphormerConfig
+## GraphormerModel
+[[autodoc]] GraphormerModel
+    - forward
+## GraphormerForGraphClassification
+[[autodoc]] GraphormerForGraphClassification
+    - forward
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -271,6 +271,7 @@ _import_structure = {
    "models.gpt_neox_japanese": ["GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXJapaneseConfig"],
    "models.gpt_sw3": [],
    "models.gptj": ["GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTJConfig"],
+    "models.graphormer": ["GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "GraphormerConfig"],
    "models.groupvit": [
        "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "GroupViTConfig",
@@ -1539,6 +1540,14 @@ else:
            "GPTJPreTrainedModel",
        ]
    )
+    _import_structure["models.graphormer"].extend(
+        [
+            "GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "GraphormerForGraphClassification",
+            "GraphormerModel",
+            "GraphormerPreTrainedModel",
+        ]
+    )
    _import_structure["models.groupvit"].extend(
        [
            "GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -3672,6 +3681,7 @@ if TYPE_CHECKING:
    from .models.gpt_neox import GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXConfig
    from .models.gpt_neox_japanese import GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXJapaneseConfig
    from .models.gptj import GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTJConfig
+    from .models.graphormer import GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, GraphormerConfig
    from .models.groupvit import (
        GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP,
        GroupViTConfig,
@@ -4747,6 +4757,12 @@ if TYPE_CHECKING:
            GPTJModel,
            GPTJPreTrainedModel,
        )
+        from .models.graphormer import (
+            GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            GraphormerForGraphClassification,
+            GraphormerModel,
+            GraphormerPreTrainedModel,
+        )
        from .models.groupvit import (
            GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST,
            GroupViTModel,

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -82,6 +82,7 @@ from . import (
    gpt_neox_japanese,
    gpt_sw3,
    gptj,
+    graphormer,
    groupvit,
    herbert,
    hubert,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -86,6 +86,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("gpt_neox", "GPTNeoXConfig"),
        ("gpt_neox_japanese", "GPTNeoXJapaneseConfig"),
        ("gptj", "GPTJConfig"),
+        ("graphormer", "GraphormerConfig"),
        ("groupvit", "GroupViTConfig"),
        ("hubert", "HubertConfig"),
        ("ibert", "IBertConfig"),
@@ -247,6 +248,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("gpt_neox", "GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("gpt_neox_japanese", "GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("graphormer", "GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("groupvit", "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -409,6 +411,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("gpt_neox", "GPT NeoX"),
        ("gpt_neox_japanese", "GPT NeoX Japanese"),
        ("gptj", "GPT-J"),
+        ("graphormer", "Graphormer"),
        ("groupvit", "GroupViT"),
        ("herbert", "HerBERT"),
        ("hubert", "Hubert"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -85,6 +85,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("gpt_neox", "GPTNeoXModel"),
        ("gpt_neox_japanese", "GPTNeoXJapaneseModel"),
        ("gptj", "GPTJModel"),
+        ("graphormer", "GraphormerModel"),
        ("groupvit", "GroupViTModel"),
        ("hubert", "HubertModel"),
        ("ibert", "IBertModel"),

--- a/src/transformers/models/graphormer/__init__.py
+++ b/src/transformers/models/graphormer/__init__.py
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+# rely on isort to merge the imports
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
+_import_structure = {
+    "configuration_graphormer": ["GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "GraphormerConfig"],
+}
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_graphormer"] = [
+        "GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "GraphormerForGraphClassification",
+        "GraphormerModel",
+        "GraphormerPreTrainedModel",
+    ]
+if TYPE_CHECKING:
+    from .configuration_graphormer import GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, GraphormerConfig
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_graphormer import (
+            GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            GraphormerForGraphClassification,
+            GraphormerModel,
+            GraphormerPreTrainedModel,
+        )
+else:
+    import sys
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
--- a/src/transformers/models/graphormer/algos_graphormer.pyx
+++ b/src/transformers/models/graphormer/algos_graphormer.pyx
+# Copyright (c) Microsoft Corporation and HuggingFace
+# Licensed under the MIT License.
+import cython
+cimport numpy
+from cython.parallel cimport parallel, prange
+import numpy as np
+# Reduce this number if matrices are too big for large graphs
+UNREACHABLE_NODE_DISTANCE = 510 
+def floyd_warshall(adjacency_matrix):
+    """
+    Applies the Floyd-Warshall algorithm to the adjacency matrix, to compute the 
+    shortest paths distance between all nodes, up to UNREACHABLE_NODE_DISTANCE.
+    """
+    (nrows, ncols) = adjacency_matrix.shape
+    assert nrows == ncols
+    cdef unsigned int n = nrows
+    adj_mat_copy = adjacency_matrix.astype(np.int32, order='C', casting='safe', copy=True)
+    assert adj_mat_copy.flags['C_CONTIGUOUS']
+    cdef numpy.ndarray[numpy.int32_t, ndim=2, mode='c'] M = adj_mat_copy
+    cdef numpy.ndarray[numpy.int32_t, ndim=2, mode='c'] path = -1 * np.ones([n, n], dtype=np.int32)
+    cdef unsigned int i, j, k
+    cdef numpy.int32_t M_ij, M_ik, cost_ikkj
+    cdef numpy.int32_t* M_ptr = &M[0,0]
+    cdef numpy.int32_t* M_i_ptr
+    cdef numpy.int32_t* M_k_ptr
+    # set unreachable nodes distance to UNREACHABLE_NODE_DISTANCE
+    for i in range(n):
+        for j in range(n):
+            if i == j:
+                M[i][j] = 0
+            elif M[i][j] == 0:
+                M[i][j] = UNREACHABLE_NODE_DISTANCE
+    # floyed algo
+    for k in range(n):
+        M_k_ptr = M_ptr + n*k
+        for i in range(n):
+            M_i_ptr = M_ptr + n*i
+            M_ik = M_i_ptr[k]
+            for j in range(n):
+                cost_ikkj = M_ik + M_k_ptr[j]
+                M_ij = M_i_ptr[j]
+                if M_ij > cost_ikkj:
+                    M_i_ptr[j] = cost_ikkj
+                    path[i][j] = k
+    # set unreachable path to UNREACHABLE_NODE_DISTANCE
+    for i in range(n):
+        for j in range(n):
+            if M[i][j] >= UNREACHABLE_NODE_DISTANCE:
+                path[i][j] = UNREACHABLE_NODE_DISTANCE
+                M[i][j] = UNREACHABLE_NODE_DISTANCE
+    return M, path
+def get_all_edges(path, i, j):
+    """
+    Recursive function to compute all possible paths between two nodes from the graph adjacency matrix.
+    """
+    cdef int k = path[i][j]
+    if k == -1:
+        return []
+    else:
+        return get_all_edges(path, i, k) + [k] + get_all_edges(path, k, j)
+def gen_edge_input(max_dist, path, edge_feat):
+    """
+    Generates the full edge feature and adjacency matrix.
+    Shape: num_nodes * num_nodes * max_distance_between_nodes * num_edge_features
+    Dim 1 is the input node, dim 2 the output node of the edge, dim 3 the depth of the edge, dim 4 the feature
+    """
+    (nrows, ncols) = path.shape
+    assert nrows == ncols
+    cdef unsigned int n = nrows
+    cdef unsigned int max_dist_copy = max_dist
+    path_copy = path.astype(long, order='C', casting='safe', copy=True)
+    edge_feat_copy = edge_feat.astype(long, order='C', casting='safe', copy=True)
+    assert path_copy.flags['C_CONTIGUOUS']
+    assert edge_feat_copy.flags['C_CONTIGUOUS']
+    cdef numpy.ndarray[numpy.int32_t, ndim=4, mode='c'] edge_fea_all = -1 * np.ones([n, n, max_dist_copy, edge_feat.shape[-1]], dtype=np.int32)
+    cdef unsigned int i, j, k, num_path, cur
+    for i in range(n):
+        for j in range(n):
+            if i == j:
+                continue
+            if path_copy[i][j] == UNREACHABLE_NODE_DISTANCE:
+                continue
+            path = [i] + get_all_edges(path_copy, i, j) + [j]
+            num_path = len(path) - 1
+            for k in range(num_path):
+                edge_fea_all[i, j, k, :] = edge_feat_copy[path[k], path[k+1], :]
+    return edge_fea_all
--- a/src/transformers/models/graphormer/collating_graphormer.py
+++ b/src/transformers/models/graphormer/collating_graphormer.py
+# Copyright (c) Microsoft Corporation and HuggingFace
+# Licensed under the MIT License.
+from typing import Any, Dict, List, Mapping
+import numpy as np
+import torch
+from ...utils import is_cython_available, requires_backends
+if is_cython_available():
+    import pyximport
+    pyximport.install(setup_args={"include_dirs": np.get_include()})
+    from . import algos_graphormer  # noqa E402
+def convert_to_single_emb(x, offset: int = 512):
+    feature_num = x.shape[1] if len(x.shape) > 1 else 1
+    feature_offset = 1 + np.arange(0, feature_num * offset, offset, dtype=np.int64)
+    x = x + feature_offset
+    return x
+def preprocess_item(item, keep_features=True):
+    requires_backends(preprocess_item, ["Cython"])
+    if not is_cython_available():
+        raise ImportError("Graphormer preprocessing needs Cython (pyximport)")
+    if keep_features and "edge_attr" in item.keys():  # edge_attr
+        edge_attr = np.asarray(item["edge_attr"], dtype=np.int64)
+    else:
+        edge_attr = np.ones((len(item["edge_index"][0]), 1), dtype=np.int64)  # same embedding for all
+    if keep_features and "node_feat" in item.keys():  # input_nodes
+        node_feature = np.asarray(item["node_feat"], dtype=np.int64)
+    else:
+        node_feature = np.ones((item["num_nodes"], 1), dtype=np.int64)  # same embedding for all
+    edge_index = np.asarray(item["edge_index"], dtype=np.int64)
+    input_nodes = convert_to_single_emb(node_feature) + 1
+    num_nodes = item["num_nodes"]
+    if len(edge_attr.shape) == 1:
+        edge_attr = edge_attr[:, None]
+    attn_edge_type = np.zeros([num_nodes, num_nodes, edge_attr.shape[-1]], dtype=np.int64)
+    attn_edge_type[edge_index[0], edge_index[1]] = convert_to_single_emb(edge_attr) + 1
+    # node adj matrix [num_nodes, num_nodes] bool
+    adj = np.zeros([num_nodes, num_nodes], dtype=bool)
+    adj[edge_index[0], edge_index[1]] = True
+    shortest_path_result, path = algos_graphormer.floyd_warshall(adj)
+    max_dist = np.amax(shortest_path_result)
+    input_edges = algos_graphormer.gen_edge_input(max_dist, path, attn_edge_type)
+    attn_bias = np.zeros([num_nodes + 1, num_nodes + 1], dtype=np.single)  # with graph token
+    # combine
+    item["input_nodes"] = input_nodes + 1  # we shift all indices by one for padding
+    item["attn_bias"] = attn_bias
+    item["attn_edge_type"] = attn_edge_type
+    item["spatial_pos"] = shortest_path_result.astype(np.int64) + 1  # we shift all indices by one for padding
+    item["in_degree"] = np.sum(adj, axis=1).reshape(-1) + 1  # we shift all indices by one for padding
+    item["out_degree"] = item["in_degree"]  # for undirected graph
+    item["input_edges"] = input_edges + 1  # we shift all indices by one for padding
+    if "labels" not in item:
+        item["labels"] = item["y"]
+    return item
+class GraphormerDataCollator:
+    def __init__(self, spatial_pos_max=20, on_the_fly_processing=False):
+        if not is_cython_available():
+            raise ImportError("Graphormer preprocessing needs Cython (pyximport)")
+        self.spatial_pos_max = spatial_pos_max
+        self.on_the_fly_processing = on_the_fly_processing
+    def __call__(self, features: List[dict]) -> Dict[str, Any]:
+        if self.on_the_fly_processing:
+            features = [preprocess_item(i) for i in features]
+        if not isinstance(features[0], Mapping):
+            features = [vars(f) for f in features]
+        batch = {}
+        max_node_num = max(len(i["input_nodes"]) for i in features)
+        node_feat_size = len(features[0]["input_nodes"][0])
+        edge_feat_size = len(features[0]["attn_edge_type"][0][0])
+        max_dist = max(len(i["input_edges"][0][0]) for i in features)
+        edge_input_size = len(features[0]["input_edges"][0][0][0])
+        batch_size = len(features)
+        batch["attn_bias"] = torch.zeros(batch_size, max_node_num + 1, max_node_num + 1, dtype=torch.float)
+        batch["attn_edge_type"] = torch.zeros(batch_size, max_node_num, max_node_num, edge_feat_size, dtype=torch.long)
+        batch["spatial_pos"] = torch.zeros(batch_size, max_node_num, max_node_num, dtype=torch.long)
+        batch["in_degree"] = torch.zeros(batch_size, max_node_num, dtype=torch.long)
+        batch["input_nodes"] = torch.zeros(batch_size, max_node_num, node_feat_size, dtype=torch.long)
+        batch["input_edges"] = torch.zeros(
+            batch_size, max_node_num, max_node_num, max_dist, edge_input_size, dtype=torch.long
+        )
+        for ix, f in enumerate(features):
+            for k in ["attn_bias", "attn_edge_type", "spatial_pos", "in_degree", "input_nodes", "input_edges"]:
+                f[k] = torch.tensor(f[k])
+            if len(f["attn_bias"][1:, 1:][f["spatial_pos"] >= self.spatial_pos_max]) > 0:
+                f["attn_bias"][1:, 1:][f["spatial_pos"] >= self.spatial_pos_max] = float("-inf")
+            batch["attn_bias"][ix, : f["attn_bias"].shape[0], : f["attn_bias"].shape[1]] = f["attn_bias"]
+            batch["attn_edge_type"][ix, : f["attn_edge_type"].shape[0], : f["attn_edge_type"].shape[1], :] = f[
+                "attn_edge_type"
+            ]
+            batch["spatial_pos"][ix, : f["spatial_pos"].shape[0], : f["spatial_pos"].shape[1]] = f["spatial_pos"]
+            batch["in_degree"][ix, : f["in_degree"].shape[0]] = f["in_degree"]
+            batch["input_nodes"][ix, : f["input_nodes"].shape[0], :] = f["input_nodes"]
+            batch["input_edges"][
+                ix, : f["input_edges"].shape[0], : f["input_edges"].shape[1], : f["input_edges"].shape[2], :
+            ] = f["input_edges"]
+        batch["out_degree"] = batch["in_degree"]
+        sample = features[0]["labels"]
+        if len(sample) == 1:  # one task
+            if isinstance(sample[0], float):  # regression
+                batch["labels"] = torch.from_numpy(np.concatenate([i["labels"] for i in features]))
+            else:  # binary classification
+                batch["labels"] = torch.from_numpy(np.concatenate([i["labels"] for i in features]))
+        else:  # multi task classification, left to float to keep the NaNs
+            batch["labels"] = torch.from_numpy(np.stack([i["labels"] for i in features], dim=0))
+        return batch
--- a/src/transformers/models/graphormer/configuration_graphormer.py
+++ b/src/transformers/models/graphormer/configuration_graphormer.py
+# coding=utf-8
+# Copyright 2022 Microsoft, clefourrier and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Graphormer model configuration"""
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+logger = logging.get_logger(__name__)
+GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    # pcqm4mv1 now deprecated
+    "graphormer-base": "https://huggingface.co/graphormer-base-pcqm4mv2/resolve/main/config.json",
+    # See all Graphormer models at https://huggingface.co/models?filter=graphormer
+}
+class GraphormerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`~GraphormerModel`]. It is used to instantiate an
+    Graphormer model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the Graphormer
+    [graphormer-base-pcqm4mv1](https://huggingface.co/graphormer-base-pcqm4mv1) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        num_classes (`int`, *optional*, defaults to 2):
+            Number of target classes or labels, set to 1 if the task is a regression task.
+        num_atoms (`int`, *optional*, defaults to 512*9):
+            Number of node types in the graphs.
+        num_edges (`int`, *optional*, defaults to 512*3):
+            Number of edges types in the graph.
+        num_in_degree (`int`, *optional*, defaults to 512):
+            Number of in degrees types in the input graphs.
+        num_out_degree (`int`, *optional*, defaults to 512):
+            Number of out degrees types in the input graphs.
+        num_edge_dis (`int`, *optional*, defaults to 128):
+            Number of edge dis in the input graphs.
+        multi_hop_max_dist (`int`, *optional*, defaults to 20):
+            Maximum distance of multi hop edges between two nodes.
+        spatial_pos_max (`int`, *optional*, defaults to 1024):
+            Maximum distance between nodes in the graph attention bias matrices, used during preprocessing and
+            collation.
+        edge_type (`str`, *optional*, defaults to multihop):
+            Type of edge relation chosen.
+        max_nodes (`int`, *optional*, defaults to 512):
+            Maximum number of nodes which can be parsed for the input graphs.
+        share_input_output_embed (`bool`, *optional*, defaults to `False`):
+            Shares the embedding layer between encoder and decoder - careful, True is not implemented.
+        num_layers (`int`, *optional*, defaults to 12):
+            Number of layers.
+        embedding_dim (`int`, *optional*, defaults to 768):
+            Dimension of the embedding layer in encoder.
+        ffn_embedding_dim (`int`, *optional*, defaults to 768):
+            Dimension of the "intermediate" (often named feed-forward) layer in encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads in the encoder.
+        self_attention (`bool`, *optional*, defaults to `True`):
+            Model is self attentive (False not implemented).
+        activation_function (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for the attention weights.
+        activation_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability after activation in the FFN.
+        layerdrop (`float`, *optional*, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
+            for more details.
+        bias (`bool`, *optional*, defaults to `True`):
+            Uses bias in the attention module - unsupported at the moment.
+        embed_scale(`float`, *optional*, defaults to None):
+            Scaling factor for the node embeddings.
+        num_trans_layers_to_freeze (`int`, *optional*, defaults to 0):
+            Number of transformer layers to freeze.
+        encoder_normalize_before (`bool`, *optional*, defaults to `False`):
+            Normalize features before encoding the graph.
+        pre_layernorm (`bool`, *optional*, defaults to `False`):
+            Apply layernorm before self attention and the feed forward network. Without this, post layernorm will be
+            used.
+        apply_graphormer_init (`bool`, *optional*, defaults to `False`):
+            Apply a custom graphormer initialisation to the model before training.
+        freeze_embeddings (`bool`, *optional*, defaults to `False`):
+            Freeze the embedding layer, or train it along the model.
+        encoder_normalize_before (`bool`, *optional*, defaults to `False`):
+            Apply the layer norm before each encoder block.
+        q_noise (`float`, *optional*, defaults to 0.0):
+            Amount of quantization noise (see "Training with Quantization Noise for Extreme Model Compression"). (For
+            more detail, see fairseq's documentation on quant_noise).
+        qn_block_size (`int`, *optional*, defaults to 8):
+            Size of the blocks for subsequent quantization with iPQ (see q_noise).
+        kdim (`int`, *optional*, defaults to None):
+            Dimension of the key in the attention, if different from the other values.
+        vdim (`int`, *optional*, defaults to None):
+            Dimension of the value in the attention, if different from the other values.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        traceable (`bool`, *optional*, defaults to `False`):
+            Changes return value of the encoder's inner_state to stacked tensors.
+        Example:
+            ```python
+            >>> from transformers import GraphormerForGraphClassification, GraphormerConfig
+            >>> # Initializing a Graphormer graphormer-base-pcqm4mv2 style configuration
+            >>> configuration = GraphormerConfig()
+            >>> # Initializing a model from the graphormer-base-pcqm4mv1 style configuration
+            >>> model = GraphormerForGraphClassification(configuration)
+            >>> # Accessing the model configuration
+            >>> configuration = model.config
+            ```
+    """
+    model_type = "graphormer"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        num_classes: int = 2,
+        num_atoms: int = 512 * 9,
+        num_edges: int = 512 * 3,
+        num_in_degree: int = 512,
+        num_out_degree: int = 512,
+        num_spatial: int = 512,
+        num_edge_dis: int = 128,
+        multi_hop_max_dist: int = 5,  # sometimes is 20
+        spatial_pos_max: int = 1024,
+        edge_type: str = "multi_hop",
+        max_nodes: int = 512,
+        share_input_output_embed: bool = False,
+        num_hidden_layers: int = 12,
+        embedding_dim: int = 768,
+        ffn_embedding_dim: int = 768,
+        num_attention_heads: int = 32,
+        dropout: float = 0.1,
+        attention_dropout: float = 0.1,
+        activation_dropout: float = 0.1,
+        layerdrop: float = 0.0,
+        encoder_normalize_before: bool = False,
+        pre_layernorm: bool = False,
+        apply_graphormer_init: bool = False,
+        activation_fn: str = "gelu",
+        embed_scale: float = None,
+        freeze_embeddings: bool = False,
+        num_trans_layers_to_freeze: int = 0,
+        traceable: bool = False,
+        q_noise: float = 0.0,
+        qn_block_size: int = 8,
+        kdim: int = None,
+        vdim: int = None,
+        bias: bool = True,
+        self_attention: bool = True,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        **kwargs,
+    ):
+        self.num_classes = num_classes
+        self.num_atoms = num_atoms
+        self.num_in_degree = num_in_degree
+        self.num_out_degree = num_out_degree
+        self.num_edges = num_edges
+        self.num_spatial = num_spatial
+        self.num_edge_dis = num_edge_dis
+        self.edge_type = edge_type
+        self.multi_hop_max_dist = multi_hop_max_dist
+        self.spatial_pos_max = spatial_pos_max
+        self.max_nodes = max_nodes
+        self.num_hidden_layers = num_hidden_layers
+        self.embedding_dim = embedding_dim
+        self.hidden_size = embedding_dim
+        self.ffn_embedding_dim = ffn_embedding_dim
+        self.num_attention_heads = num_attention_heads
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.layerdrop = layerdrop
+        self.encoder_normalize_before = encoder_normalize_before
+        self.pre_layernorm = pre_layernorm
+        self.apply_graphormer_init = apply_graphormer_init
+        self.activation_fn = activation_fn
+        self.embed_scale = embed_scale
+        self.freeze_embeddings = freeze_embeddings
+        self.num_trans_layers_to_freeze = num_trans_layers_to_freeze
+        self.share_input_output_embed = share_input_output_embed
+        self.traceable = traceable
+        self.q_noise = q_noise
+        self.qn_block_size = qn_block_size
+        # These parameters are here for future extensions
+        # atm, the model only supports self attention
+        self.kdim = kdim
+        self.vdim = vdim
+        self.self_attention = self_attention
+        self.bias = bias
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            **kwargs,
+        )
--- a/src/transformers/models/graphormer/modeling_graphormer.py
+++ b/src/transformers/models/graphormer/modeling_graphormer.py
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -51,6 +51,7 @@ from .utils import (
    is_apex_available,
    is_bitsandbytes_available,
    is_bs4_available,
+    is_cython_available,
    is_decord_available,
    is_detectron2_available,
    is_faiss_available,
@@ -711,6 +712,13 @@ def require_jumanpp(test_case):
    return unittest.skipUnless(is_jumanpp_available(), "test requires jumanpp")(test_case)
+def require_cython(test_case):
+    """
+    Decorator marking a test that requires jumanpp
+    """
+    return unittest.skipUnless(is_cython_available(), "test requires cython")(test_case)
 def get_gpu_count():
    """
    Return the number of available gpus (regardless of whether torch, tf or jax is used)