Unverified Commit 87208a05 authored by Clémentine Fourrier's avatar Clémentine Fourrier Committed by GitHub
Browse files

Graphormer model for Graph Classification (#20968)



* [FT] First commit for graphormer architecture.

The model has no tokenizer, as it uses a collator and preprocessing function for its input management.
Architecture to be tested against original one.
The arch might need to be changed to fit the checkpoint, but a revert to the original arch will make the code less nice to read.
TODO: doc

* [FIX] removed test model

* [FIX] import error

* [FIX] black and flake

* [DOC] added paper refs

* [FIX] [DOC]

* [FIX] black

* [DOC] Updated READMEs

* [FIX] Order of imports + rm Tokenizer calls

* [FIX] Moved assert in class to prevent doc build failure

* [FIX] make fix-copies

* [Doc] update from code review
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* [FIX] Removed Graphormer from Sequence classification model list

* [DOC] Added HF copyright to Cython file

* [DOC] Fixed comments

* [FIX] typos in class doc + removed config classes.

Todo: update doc from paper definitions

* [FIX] Removed dependency to fairseq, and replaced all asserts with Exception management

* [FIX] Homogeneized initialization of weights to pretrained constructor

* [FIX] [CP] Updated multi_hop parameter to get same results as in original implementation

* [DOC] Relevant parameter description in the configuration file

* [DOC] Updated doc and comments in main graphormer file

* [FIX] make style and quality checks

* [DOC] Fix doc format

* [FIX] [WIP] Updated part of the tests, though still a wip

* [FIX] [WIP]

* [FIX] repo consistency

* [FIX] Changed input names for more understandability

* [FIX] [BUG] updated num_classes params for propagation in the model

* simplified collator

* [FIX] Updated tests to follow new naming pattern

* [TESTS] Updated test suite along with model

* |FIX] rm tokenizer import

* [DOC] add link to graphormerdoc

* Changed section in doc from text model to graph model

* Apply suggestions from code review

Spacing, inits
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>

* [DOC] Explain algos_graphormer functions

* Cython soft import protection

* Rm call to Callable in configuration graphormer

* [FIX] replaced asserts with Exceptions

* Add org to graphormer checkpoints

* Prefixed classes with Graphormer

* Management of init functions

* format

* fixes

* fix length file

* update indent

* relaunching ci

* Errors for missing cython imports

* fix style

* fix style doc
Co-authored-by: default avatarSylvain Gugger <35901082+sgugger@users.noreply.github.com>
parent 758bd39e
...@@ -334,6 +334,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h ...@@ -334,6 +334,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
......
...@@ -327,6 +327,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt ...@@ -327,6 +327,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
......
...@@ -299,6 +299,7 @@ conda install -c huggingface transformers ...@@ -299,6 +299,7 @@ conda install -c huggingface transformers
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (ओपनएआई से) साथ में पेपर [लैंग्वेज मॉडल्स अनसुपरवाइज्ड मल्टीटास्क लर्नर्स हैं](https://blog.openai.com/better-language-models/) एलेक रैडफोर्ड*, जेफरी वू*, रेवन चाइल्ड, डेविड लुआन, डारियो एमोडी* द्वारा * और इल्या सुत्सकेवर** ने पोस्ट किया। 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (ओपनएआई से) साथ में पेपर [लैंग्वेज मॉडल्स अनसुपरवाइज्ड मल्टीटास्क लर्नर्स हैं](https://blog.openai.com/better-language-models/) एलेक रैडफोर्ड*, जेफरी वू*, रेवन चाइल्ड, डेविड लुआन, डारियो एमोडी* द्वारा * और इल्या सुत्सकेवर** ने पोस्ट किया।
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI से) साथ वाला पेपर [kingoflolz/mesh-transformer-jax](https://github. com/kingoflolz/mesh-transformer-jax/) बेन वांग और अरन कोमात्सुजाकी द्वारा। 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI से) साथ वाला पेपर [kingoflolz/mesh-transformer-jax](https://github. com/kingoflolz/mesh-transformer-jax/) बेन वांग और अरन कोमात्सुजाकी द्वारा।
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv .org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा। 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv .org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा।
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा। 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा।
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा। 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा।
......
...@@ -361,6 +361,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ ...@@ -361,6 +361,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI から) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** から公開された研究論文: [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI から) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** から公開された研究論文: [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/)
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI から) Ben Wang and Aran Komatsuzaki から公開されたレポジトリー [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI から) Ben Wang and Aran Komatsuzaki から公開されたレポジトリー [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/)
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden から) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren から公開された研究論文: [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden から) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren から公開された研究論文: [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf)
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (Microsoft から) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu から公開された研究論文: [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234).
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA から) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang から公開された研究論文: [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA から) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang から公開された研究論文: [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094)
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447)
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321)
......
...@@ -276,6 +276,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 ...@@ -276,6 +276,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI 에서) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**[Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 논문과 함께 발표했습니다. 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI 에서) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**[Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 논문과 함께 발표했습니다.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden 에서) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 의 [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 논문과 함께 발표했습니다. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (AI-Sweden 에서) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 의 [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 논문과 함께 발표했습니다.
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu 의 [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) 논문과 함께 발표했습니다.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA 에서) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 의 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 논문과 함께 발표했습니다. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA 에서) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 의 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 논문과 함께 발표했습니다.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley 에서) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 의 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 논문과 함께 발표했습니다. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley 에서) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 의 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 논문과 함께 발표했습니다.
......
...@@ -300,6 +300,7 @@ conda install -c huggingface transformers ...@@ -300,6 +300,7 @@ conda install -c huggingface transformers
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 由 Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 发布。 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 由 Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 发布。
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (来自 EleutherAI) 伴随论文 [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 由 Ben Wang and Aran Komatsuzaki 发布。 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (来自 EleutherAI) 伴随论文 [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 由 Ben Wang and Aran Komatsuzaki 发布。
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (来自 UCSD, NVIDIA) 伴随论文 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 由 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 发布。 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (来自 UCSD, NVIDIA) 伴随论文 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 由 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 发布。
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。
......
...@@ -312,6 +312,7 @@ conda install -c huggingface transformers ...@@ -312,6 +312,7 @@ conda install -c huggingface transformers
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. 1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released with the paper [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released with the paper [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/main/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[Graphormer](https://huggingface.co/docs/transformers/main/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
......
...@@ -571,6 +571,11 @@ ...@@ -571,6 +571,11 @@
- local: model_doc/time_series_transformer - local: model_doc/time_series_transformer
title: Time Series Transformer title: Time Series Transformer
title: Time series models title: Time series models
- isExpanded: false
sections:
- local: model_doc/graphormer
title: Graphormer
title: Graph models
title: Models title: Models
- sections: - sections:
- local: internal/modeling_utils - local: internal/modeling_utils
......
...@@ -113,6 +113,7 @@ The documentation is organized into five sections: ...@@ -113,6 +113,7 @@ The documentation is organized into five sections:
1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. 1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.
1. **[Graphormer](model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu.
1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
...@@ -289,6 +290,7 @@ Flax), PyTorch, and/or TensorFlow. ...@@ -289,6 +290,7 @@ Flax), PyTorch, and/or TensorFlow.
| GPT NeoX Japanese | ✅ | ❌ | ✅ | ❌ | ❌ | | GPT NeoX Japanese | ✅ | ❌ | ✅ | ❌ | ❌ |
| GPT-J | ❌ | ❌ | ✅ | ✅ | ✅ | | GPT-J | ❌ | ❌ | ✅ | ✅ | ✅ |
| GPT-Sw3 | ✅ | ✅ | ✅ | ✅ | ✅ | | GPT-Sw3 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Graphormer | ❌ | ❌ | ✅ | ❌ | ❌ |
| GroupViT | ❌ | ❌ | ✅ | ✅ | ❌ | | GroupViT | ❌ | ❌ | ✅ | ✅ | ❌ |
| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ | | Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |
| I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ | | I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
......
<!--Copyright 2022 The HuggingFace Team and Microsoft. All rights reserved.
Licensed under the MIT License; you may not use this file except in compliance with
the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Graphormer
## Overview
The Graphormer model was proposed in [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen and Tie-Yan Liu. It is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention.
The abstract from the paper is the following:
*The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.*
Tips:
This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode.
You can reduce the batch size, increase your RAM, or decrease the `UNREACHABLE_NODE_DISTANCE` parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges.
This model does not use a tokenizer, but instead a special collator during training.
This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer).
## GraphormerConfig
[[autodoc]] GraphormerConfig
## GraphormerModel
[[autodoc]] GraphormerModel
- forward
## GraphormerForGraphClassification
[[autodoc]] GraphormerForGraphClassification
- forward
...@@ -271,6 +271,7 @@ _import_structure = { ...@@ -271,6 +271,7 @@ _import_structure = {
"models.gpt_neox_japanese": ["GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXJapaneseConfig"], "models.gpt_neox_japanese": ["GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXJapaneseConfig"],
"models.gpt_sw3": [], "models.gpt_sw3": [],
"models.gptj": ["GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTJConfig"], "models.gptj": ["GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTJConfig"],
"models.graphormer": ["GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "GraphormerConfig"],
"models.groupvit": [ "models.groupvit": [
"GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP",
"GroupViTConfig", "GroupViTConfig",
...@@ -1539,6 +1540,14 @@ else: ...@@ -1539,6 +1540,14 @@ else:
"GPTJPreTrainedModel", "GPTJPreTrainedModel",
] ]
) )
_import_structure["models.graphormer"].extend(
[
"GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
"GraphormerForGraphClassification",
"GraphormerModel",
"GraphormerPreTrainedModel",
]
)
_import_structure["models.groupvit"].extend( _import_structure["models.groupvit"].extend(
[ [
"GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST", "GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST",
...@@ -3672,6 +3681,7 @@ if TYPE_CHECKING: ...@@ -3672,6 +3681,7 @@ if TYPE_CHECKING:
from .models.gpt_neox import GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXConfig from .models.gpt_neox import GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXConfig
from .models.gpt_neox_japanese import GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXJapaneseConfig from .models.gpt_neox_japanese import GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXJapaneseConfig
from .models.gptj import GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTJConfig from .models.gptj import GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTJConfig
from .models.graphormer import GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, GraphormerConfig
from .models.groupvit import ( from .models.groupvit import (
GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP,
GroupViTConfig, GroupViTConfig,
...@@ -4747,6 +4757,12 @@ if TYPE_CHECKING: ...@@ -4747,6 +4757,12 @@ if TYPE_CHECKING:
GPTJModel, GPTJModel,
GPTJPreTrainedModel, GPTJPreTrainedModel,
) )
from .models.graphormer import (
GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
GraphormerForGraphClassification,
GraphormerModel,
GraphormerPreTrainedModel,
)
from .models.groupvit import ( from .models.groupvit import (
GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST, GROUPVIT_PRETRAINED_MODEL_ARCHIVE_LIST,
GroupViTModel, GroupViTModel,
......
...@@ -82,6 +82,7 @@ from . import ( ...@@ -82,6 +82,7 @@ from . import (
gpt_neox_japanese, gpt_neox_japanese,
gpt_sw3, gpt_sw3,
gptj, gptj,
graphormer,
groupvit, groupvit,
herbert, herbert,
hubert, hubert,
......
...@@ -86,6 +86,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ...@@ -86,6 +86,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("gpt_neox", "GPTNeoXConfig"), ("gpt_neox", "GPTNeoXConfig"),
("gpt_neox_japanese", "GPTNeoXJapaneseConfig"), ("gpt_neox_japanese", "GPTNeoXJapaneseConfig"),
("gptj", "GPTJConfig"), ("gptj", "GPTJConfig"),
("graphormer", "GraphormerConfig"),
("groupvit", "GroupViTConfig"), ("groupvit", "GroupViTConfig"),
("hubert", "HubertConfig"), ("hubert", "HubertConfig"),
("ibert", "IBertConfig"), ("ibert", "IBertConfig"),
...@@ -247,6 +248,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict( ...@@ -247,6 +248,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("gpt_neox", "GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gpt_neox", "GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gpt_neox_japanese", "GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gpt_neox_japanese", "GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("graphormer", "GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("groupvit", "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("groupvit", "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
...@@ -409,6 +411,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ...@@ -409,6 +411,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("gpt_neox", "GPT NeoX"), ("gpt_neox", "GPT NeoX"),
("gpt_neox_japanese", "GPT NeoX Japanese"), ("gpt_neox_japanese", "GPT NeoX Japanese"),
("gptj", "GPT-J"), ("gptj", "GPT-J"),
("graphormer", "Graphormer"),
("groupvit", "GroupViT"), ("groupvit", "GroupViT"),
("herbert", "HerBERT"), ("herbert", "HerBERT"),
("hubert", "Hubert"), ("hubert", "Hubert"),
......
...@@ -85,6 +85,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ...@@ -85,6 +85,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("gpt_neox", "GPTNeoXModel"), ("gpt_neox", "GPTNeoXModel"),
("gpt_neox_japanese", "GPTNeoXJapaneseModel"), ("gpt_neox_japanese", "GPTNeoXJapaneseModel"),
("gptj", "GPTJModel"), ("gptj", "GPTJModel"),
("graphormer", "GraphormerModel"),
("groupvit", "GroupViTModel"), ("groupvit", "GroupViTModel"),
("hubert", "HubertModel"), ("hubert", "HubertModel"),
("ibert", "IBertModel"), ("ibert", "IBertModel"),
......
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
# rely on isort to merge the imports
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
_import_structure = {
"configuration_graphormer": ["GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "GraphormerConfig"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_graphormer"] = [
"GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
"GraphormerForGraphClassification",
"GraphormerModel",
"GraphormerPreTrainedModel",
]
if TYPE_CHECKING:
from .configuration_graphormer import GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, GraphormerConfig
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_graphormer import (
GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
GraphormerForGraphClassification,
GraphormerModel,
GraphormerPreTrainedModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
# Copyright (c) Microsoft Corporation and HuggingFace
# Licensed under the MIT License.
import cython
cimport numpy
from cython.parallel cimport parallel, prange
import numpy as np
# Reduce this number if matrices are too big for large graphs
UNREACHABLE_NODE_DISTANCE = 510
def floyd_warshall(adjacency_matrix):
"""
Applies the Floyd-Warshall algorithm to the adjacency matrix, to compute the
shortest paths distance between all nodes, up to UNREACHABLE_NODE_DISTANCE.
"""
(nrows, ncols) = adjacency_matrix.shape
assert nrows == ncols
cdef unsigned int n = nrows
adj_mat_copy = adjacency_matrix.astype(np.int32, order='C', casting='safe', copy=True)
assert adj_mat_copy.flags['C_CONTIGUOUS']
cdef numpy.ndarray[numpy.int32_t, ndim=2, mode='c'] M = adj_mat_copy
cdef numpy.ndarray[numpy.int32_t, ndim=2, mode='c'] path = -1 * np.ones([n, n], dtype=np.int32)
cdef unsigned int i, j, k
cdef numpy.int32_t M_ij, M_ik, cost_ikkj
cdef numpy.int32_t* M_ptr = &M[0,0]
cdef numpy.int32_t* M_i_ptr
cdef numpy.int32_t* M_k_ptr
# set unreachable nodes distance to UNREACHABLE_NODE_DISTANCE
for i in range(n):
for j in range(n):
if i == j:
M[i][j] = 0
elif M[i][j] == 0:
M[i][j] = UNREACHABLE_NODE_DISTANCE
# floyed algo
for k in range(n):
M_k_ptr = M_ptr + n*k
for i in range(n):
M_i_ptr = M_ptr + n*i
M_ik = M_i_ptr[k]
for j in range(n):
cost_ikkj = M_ik + M_k_ptr[j]
M_ij = M_i_ptr[j]
if M_ij > cost_ikkj:
M_i_ptr[j] = cost_ikkj
path[i][j] = k
# set unreachable path to UNREACHABLE_NODE_DISTANCE
for i in range(n):
for j in range(n):
if M[i][j] >= UNREACHABLE_NODE_DISTANCE:
path[i][j] = UNREACHABLE_NODE_DISTANCE
M[i][j] = UNREACHABLE_NODE_DISTANCE
return M, path
def get_all_edges(path, i, j):
"""
Recursive function to compute all possible paths between two nodes from the graph adjacency matrix.
"""
cdef int k = path[i][j]
if k == -1:
return []
else:
return get_all_edges(path, i, k) + [k] + get_all_edges(path, k, j)
def gen_edge_input(max_dist, path, edge_feat):
"""
Generates the full edge feature and adjacency matrix.
Shape: num_nodes * num_nodes * max_distance_between_nodes * num_edge_features
Dim 1 is the input node, dim 2 the output node of the edge, dim 3 the depth of the edge, dim 4 the feature
"""
(nrows, ncols) = path.shape
assert nrows == ncols
cdef unsigned int n = nrows
cdef unsigned int max_dist_copy = max_dist
path_copy = path.astype(long, order='C', casting='safe', copy=True)
edge_feat_copy = edge_feat.astype(long, order='C', casting='safe', copy=True)
assert path_copy.flags['C_CONTIGUOUS']
assert edge_feat_copy.flags['C_CONTIGUOUS']
cdef numpy.ndarray[numpy.int32_t, ndim=4, mode='c'] edge_fea_all = -1 * np.ones([n, n, max_dist_copy, edge_feat.shape[-1]], dtype=np.int32)
cdef unsigned int i, j, k, num_path, cur
for i in range(n):
for j in range(n):
if i == j:
continue
if path_copy[i][j] == UNREACHABLE_NODE_DISTANCE:
continue
path = [i] + get_all_edges(path_copy, i, j) + [j]
num_path = len(path) - 1
for k in range(num_path):
edge_fea_all[i, j, k, :] = edge_feat_copy[path[k], path[k+1], :]
return edge_fea_all
# Copyright (c) Microsoft Corporation and HuggingFace
# Licensed under the MIT License.
from typing import Any, Dict, List, Mapping
import numpy as np
import torch
from ...utils import is_cython_available, requires_backends
if is_cython_available():
import pyximport
pyximport.install(setup_args={"include_dirs": np.get_include()})
from . import algos_graphormer # noqa E402
def convert_to_single_emb(x, offset: int = 512):
feature_num = x.shape[1] if len(x.shape) > 1 else 1
feature_offset = 1 + np.arange(0, feature_num * offset, offset, dtype=np.int64)
x = x + feature_offset
return x
def preprocess_item(item, keep_features=True):
requires_backends(preprocess_item, ["Cython"])
if not is_cython_available():
raise ImportError("Graphormer preprocessing needs Cython (pyximport)")
if keep_features and "edge_attr" in item.keys(): # edge_attr
edge_attr = np.asarray(item["edge_attr"], dtype=np.int64)
else:
edge_attr = np.ones((len(item["edge_index"][0]), 1), dtype=np.int64) # same embedding for all
if keep_features and "node_feat" in item.keys(): # input_nodes
node_feature = np.asarray(item["node_feat"], dtype=np.int64)
else:
node_feature = np.ones((item["num_nodes"], 1), dtype=np.int64) # same embedding for all
edge_index = np.asarray(item["edge_index"], dtype=np.int64)
input_nodes = convert_to_single_emb(node_feature) + 1
num_nodes = item["num_nodes"]
if len(edge_attr.shape) == 1:
edge_attr = edge_attr[:, None]
attn_edge_type = np.zeros([num_nodes, num_nodes, edge_attr.shape[-1]], dtype=np.int64)
attn_edge_type[edge_index[0], edge_index[1]] = convert_to_single_emb(edge_attr) + 1
# node adj matrix [num_nodes, num_nodes] bool
adj = np.zeros([num_nodes, num_nodes], dtype=bool)
adj[edge_index[0], edge_index[1]] = True
shortest_path_result, path = algos_graphormer.floyd_warshall(adj)
max_dist = np.amax(shortest_path_result)
input_edges = algos_graphormer.gen_edge_input(max_dist, path, attn_edge_type)
attn_bias = np.zeros([num_nodes + 1, num_nodes + 1], dtype=np.single) # with graph token
# combine
item["input_nodes"] = input_nodes + 1 # we shift all indices by one for padding
item["attn_bias"] = attn_bias
item["attn_edge_type"] = attn_edge_type
item["spatial_pos"] = shortest_path_result.astype(np.int64) + 1 # we shift all indices by one for padding
item["in_degree"] = np.sum(adj, axis=1).reshape(-1) + 1 # we shift all indices by one for padding
item["out_degree"] = item["in_degree"] # for undirected graph
item["input_edges"] = input_edges + 1 # we shift all indices by one for padding
if "labels" not in item:
item["labels"] = item["y"]
return item
class GraphormerDataCollator:
def __init__(self, spatial_pos_max=20, on_the_fly_processing=False):
if not is_cython_available():
raise ImportError("Graphormer preprocessing needs Cython (pyximport)")
self.spatial_pos_max = spatial_pos_max
self.on_the_fly_processing = on_the_fly_processing
def __call__(self, features: List[dict]) -> Dict[str, Any]:
if self.on_the_fly_processing:
features = [preprocess_item(i) for i in features]
if not isinstance(features[0], Mapping):
features = [vars(f) for f in features]
batch = {}
max_node_num = max(len(i["input_nodes"]) for i in features)
node_feat_size = len(features[0]["input_nodes"][0])
edge_feat_size = len(features[0]["attn_edge_type"][0][0])
max_dist = max(len(i["input_edges"][0][0]) for i in features)
edge_input_size = len(features[0]["input_edges"][0][0][0])
batch_size = len(features)
batch["attn_bias"] = torch.zeros(batch_size, max_node_num + 1, max_node_num + 1, dtype=torch.float)
batch["attn_edge_type"] = torch.zeros(batch_size, max_node_num, max_node_num, edge_feat_size, dtype=torch.long)
batch["spatial_pos"] = torch.zeros(batch_size, max_node_num, max_node_num, dtype=torch.long)
batch["in_degree"] = torch.zeros(batch_size, max_node_num, dtype=torch.long)
batch["input_nodes"] = torch.zeros(batch_size, max_node_num, node_feat_size, dtype=torch.long)
batch["input_edges"] = torch.zeros(
batch_size, max_node_num, max_node_num, max_dist, edge_input_size, dtype=torch.long
)
for ix, f in enumerate(features):
for k in ["attn_bias", "attn_edge_type", "spatial_pos", "in_degree", "input_nodes", "input_edges"]:
f[k] = torch.tensor(f[k])
if len(f["attn_bias"][1:, 1:][f["spatial_pos"] >= self.spatial_pos_max]) > 0:
f["attn_bias"][1:, 1:][f["spatial_pos"] >= self.spatial_pos_max] = float("-inf")
batch["attn_bias"][ix, : f["attn_bias"].shape[0], : f["attn_bias"].shape[1]] = f["attn_bias"]
batch["attn_edge_type"][ix, : f["attn_edge_type"].shape[0], : f["attn_edge_type"].shape[1], :] = f[
"attn_edge_type"
]
batch["spatial_pos"][ix, : f["spatial_pos"].shape[0], : f["spatial_pos"].shape[1]] = f["spatial_pos"]
batch["in_degree"][ix, : f["in_degree"].shape[0]] = f["in_degree"]
batch["input_nodes"][ix, : f["input_nodes"].shape[0], :] = f["input_nodes"]
batch["input_edges"][
ix, : f["input_edges"].shape[0], : f["input_edges"].shape[1], : f["input_edges"].shape[2], :
] = f["input_edges"]
batch["out_degree"] = batch["in_degree"]
sample = features[0]["labels"]
if len(sample) == 1: # one task
if isinstance(sample[0], float): # regression
batch["labels"] = torch.from_numpy(np.concatenate([i["labels"] for i in features]))
else: # binary classification
batch["labels"] = torch.from_numpy(np.concatenate([i["labels"] for i in features]))
else: # multi task classification, left to float to keep the NaNs
batch["labels"] = torch.from_numpy(np.stack([i["labels"] for i in features], dim=0))
return batch
# coding=utf-8
# Copyright 2022 Microsoft, clefourrier and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Graphormer model configuration"""
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
# pcqm4mv1 now deprecated
"graphormer-base": "https://huggingface.co/graphormer-base-pcqm4mv2/resolve/main/config.json",
# See all Graphormer models at https://huggingface.co/models?filter=graphormer
}
class GraphormerConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`~GraphormerModel`]. It is used to instantiate an
Graphormer model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the Graphormer
[graphormer-base-pcqm4mv1](https://huggingface.co/graphormer-base-pcqm4mv1) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
num_classes (`int`, *optional*, defaults to 2):
Number of target classes or labels, set to 1 if the task is a regression task.
num_atoms (`int`, *optional*, defaults to 512*9):
Number of node types in the graphs.
num_edges (`int`, *optional*, defaults to 512*3):
Number of edges types in the graph.
num_in_degree (`int`, *optional*, defaults to 512):
Number of in degrees types in the input graphs.
num_out_degree (`int`, *optional*, defaults to 512):
Number of out degrees types in the input graphs.
num_edge_dis (`int`, *optional*, defaults to 128):
Number of edge dis in the input graphs.
multi_hop_max_dist (`int`, *optional*, defaults to 20):
Maximum distance of multi hop edges between two nodes.
spatial_pos_max (`int`, *optional*, defaults to 1024):
Maximum distance between nodes in the graph attention bias matrices, used during preprocessing and
collation.
edge_type (`str`, *optional*, defaults to multihop):
Type of edge relation chosen.
max_nodes (`int`, *optional*, defaults to 512):
Maximum number of nodes which can be parsed for the input graphs.
share_input_output_embed (`bool`, *optional*, defaults to `False`):
Shares the embedding layer between encoder and decoder - careful, True is not implemented.
num_layers (`int`, *optional*, defaults to 12):
Number of layers.
embedding_dim (`int`, *optional*, defaults to 768):
Dimension of the embedding layer in encoder.
ffn_embedding_dim (`int`, *optional*, defaults to 768):
Dimension of the "intermediate" (often named feed-forward) layer in encoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads in the encoder.
self_attention (`bool`, *optional*, defaults to `True`):
Model is self attentive (False not implemented).
activation_function (`str` or `function`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (`float`, *optional*, defaults to 0.1):
The dropout probability for the attention weights.
activation_dropout (`float`, *optional*, defaults to 0.1):
The dropout probability after activation in the FFN.
layerdrop (`float`, *optional*, defaults to 0.0):
The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
for more details.
bias (`bool`, *optional*, defaults to `True`):
Uses bias in the attention module - unsupported at the moment.
embed_scale(`float`, *optional*, defaults to None):
Scaling factor for the node embeddings.
num_trans_layers_to_freeze (`int`, *optional*, defaults to 0):
Number of transformer layers to freeze.
encoder_normalize_before (`bool`, *optional*, defaults to `False`):
Normalize features before encoding the graph.
pre_layernorm (`bool`, *optional*, defaults to `False`):
Apply layernorm before self attention and the feed forward network. Without this, post layernorm will be
used.
apply_graphormer_init (`bool`, *optional*, defaults to `False`):
Apply a custom graphormer initialisation to the model before training.
freeze_embeddings (`bool`, *optional*, defaults to `False`):
Freeze the embedding layer, or train it along the model.
encoder_normalize_before (`bool`, *optional*, defaults to `False`):
Apply the layer norm before each encoder block.
q_noise (`float`, *optional*, defaults to 0.0):
Amount of quantization noise (see "Training with Quantization Noise for Extreme Model Compression"). (For
more detail, see fairseq's documentation on quant_noise).
qn_block_size (`int`, *optional*, defaults to 8):
Size of the blocks for subsequent quantization with iPQ (see q_noise).
kdim (`int`, *optional*, defaults to None):
Dimension of the key in the attention, if different from the other values.
vdim (`int`, *optional*, defaults to None):
Dimension of the value in the attention, if different from the other values.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
traceable (`bool`, *optional*, defaults to `False`):
Changes return value of the encoder's inner_state to stacked tensors.
Example:
```python
>>> from transformers import GraphormerForGraphClassification, GraphormerConfig
>>> # Initializing a Graphormer graphormer-base-pcqm4mv2 style configuration
>>> configuration = GraphormerConfig()
>>> # Initializing a model from the graphormer-base-pcqm4mv1 style configuration
>>> model = GraphormerForGraphClassification(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
"""
model_type = "graphormer"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
num_classes: int = 2,
num_atoms: int = 512 * 9,
num_edges: int = 512 * 3,
num_in_degree: int = 512,
num_out_degree: int = 512,
num_spatial: int = 512,
num_edge_dis: int = 128,
multi_hop_max_dist: int = 5, # sometimes is 20
spatial_pos_max: int = 1024,
edge_type: str = "multi_hop",
max_nodes: int = 512,
share_input_output_embed: bool = False,
num_hidden_layers: int = 12,
embedding_dim: int = 768,
ffn_embedding_dim: int = 768,
num_attention_heads: int = 32,
dropout: float = 0.1,
attention_dropout: float = 0.1,
activation_dropout: float = 0.1,
layerdrop: float = 0.0,
encoder_normalize_before: bool = False,
pre_layernorm: bool = False,
apply_graphormer_init: bool = False,
activation_fn: str = "gelu",
embed_scale: float = None,
freeze_embeddings: bool = False,
num_trans_layers_to_freeze: int = 0,
traceable: bool = False,
q_noise: float = 0.0,
qn_block_size: int = 8,
kdim: int = None,
vdim: int = None,
bias: bool = True,
self_attention: bool = True,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
**kwargs,
):
self.num_classes = num_classes
self.num_atoms = num_atoms
self.num_in_degree = num_in_degree
self.num_out_degree = num_out_degree
self.num_edges = num_edges
self.num_spatial = num_spatial
self.num_edge_dis = num_edge_dis
self.edge_type = edge_type
self.multi_hop_max_dist = multi_hop_max_dist
self.spatial_pos_max = spatial_pos_max
self.max_nodes = max_nodes
self.num_hidden_layers = num_hidden_layers
self.embedding_dim = embedding_dim
self.hidden_size = embedding_dim
self.ffn_embedding_dim = ffn_embedding_dim
self.num_attention_heads = num_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.layerdrop = layerdrop
self.encoder_normalize_before = encoder_normalize_before
self.pre_layernorm = pre_layernorm
self.apply_graphormer_init = apply_graphormer_init
self.activation_fn = activation_fn
self.embed_scale = embed_scale
self.freeze_embeddings = freeze_embeddings
self.num_trans_layers_to_freeze = num_trans_layers_to_freeze
self.share_input_output_embed = share_input_output_embed
self.traceable = traceable
self.q_noise = q_noise
self.qn_block_size = qn_block_size
# These parameters are here for future extensions
# atm, the model only supports self attention
self.kdim = kdim
self.vdim = vdim
self.self_attention = self_attention
self.bias = bias
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
**kwargs,
)
# coding=utf-8
# Copyright 2022 Microsoft, clefourrier The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch Graphormer model."""
import math
from typing import Optional, Tuple, Union
import torch
import torch.nn as nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from ...activations import ACT2FN
from ...modeling_outputs import BaseModelOutputWithNoAttention, SequenceClassifierOutput
from ...modeling_utils import PreTrainedModel
from ...utils import logging
from .configuration_graphormer import GraphormerConfig
logger = logging.get_logger(__name__)
_CHECKPOINT_FOR_DOC = "graphormer-base-pcqm4mv1"
_CONFIG_FOR_DOC = "GraphormerConfig"
GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
"clefourrier/graphormer-base-pcqm4mv1",
"clefourrier/graphormer-base-pcqm4mv2",
# See all Graphormer models at https://huggingface.co/models?filter=graphormer
]
def quant_noise(module, p, block_size):
"""
From:
https://github.com/facebookresearch/fairseq/blob/dd0079bde7f678b0cd0715cbd0ae68d661b7226d/fairseq/modules/quant_noise.py
Wraps modules and applies quantization noise to the weights for subsequent quantization with Iterative Product
Quantization as described in "Training with Quantization Noise for Extreme Model Compression"
Args:
- module: nn.Module
- p: amount of Quantization Noise
- block_size: size of the blocks for subsequent quantization with iPQ
Remarks:
- Module weights must have the right sizes wrt the block size
- Only Linear, Embedding and Conv2d modules are supported for the moment
- For more detail on how to quantize by blocks with convolutional weights, see "And the Bit Goes Down:
Revisiting the Quantization of Neural Networks"
- We implement the simplest form of noise here as stated in the paper which consists in randomly dropping
blocks
"""
# if no quantization noise, don't register hook
if p <= 0:
return module
# supported modules
if not isinstance(module, (nn.Linear, nn.Embedding, nn.Conv2d)):
raise NotImplementedError("Module unsupported for quant_noise.")
# test whether module.weight has the right sizes wrt block_size
is_conv = module.weight.ndim == 4
# 2D matrix
if not is_conv:
if module.weight.size(1) % block_size != 0:
raise AssertionError("Input features must be a multiple of block sizes")
# 4D matrix
else:
# 1x1 convolutions
if module.kernel_size == (1, 1):
if module.in_channels % block_size != 0:
raise AssertionError("Input channels must be a multiple of block sizes")
# regular convolutions
else:
k = module.kernel_size[0] * module.kernel_size[1]
if k % block_size != 0:
raise AssertionError("Kernel size must be a multiple of block size")
def _forward_pre_hook(mod, input):
# no noise for evaluation
if mod.training:
if not is_conv:
# gather weight and sizes
weight = mod.weight
in_features = weight.size(1)
out_features = weight.size(0)
# split weight matrix into blocks and randomly drop selected blocks
mask = torch.zeros(in_features // block_size * out_features, device=weight.device)
mask.bernoulli_(p)
mask = mask.repeat_interleave(block_size, -1).view(-1, in_features)
else:
# gather weight and sizes
weight = mod.weight
in_channels = mod.in_channels
out_channels = mod.out_channels
# split weight matrix into blocks and randomly drop selected blocks
if mod.kernel_size == (1, 1):
mask = torch.zeros(
int(in_channels // block_size * out_channels),
device=weight.device,
)
mask.bernoulli_(p)
mask = mask.repeat_interleave(block_size, -1).view(-1, in_channels)
else:
mask = torch.zeros(weight.size(0), weight.size(1), device=weight.device)
mask.bernoulli_(p)
mask = mask.unsqueeze(2).unsqueeze(3).repeat(1, 1, mod.kernel_size[0], mod.kernel_size[1])
# scale weights and apply mask
mask = mask.to(torch.bool) # x.bool() is not currently supported in TorchScript
s = 1 / (1 - p)
mod.weight.data = s * weight.masked_fill(mask, 0)
module.register_forward_pre_hook(_forward_pre_hook)
return module
class LayerDropModuleList(nn.ModuleList):
"""
From:
https://github.com/facebookresearch/fairseq/blob/dd0079bde7f678b0cd0715cbd0ae68d661b7226d/fairseq/modules/layer_drop.py
A LayerDrop implementation based on [`torch.nn.ModuleList`]. LayerDrop as described in
https://arxiv.org/abs/1909.11556.
We refresh the choice of which layers to drop every time we iterate over the LayerDropModuleList instance. During
evaluation we always iterate over all layers.
Usage:
```python
layers = LayerDropList(p=0.5, modules=[layer1, layer2, layer3])
for layer in layers: # this might iterate over layers 1 and 3
x = layer(x)
for layer in layers: # this might iterate over all layers
x = layer(x)
for layer in layers: # this might not iterate over any layers
x = layer(x)
```
Args:
p (float): probability of dropping out each layer
modules (iterable, optional): an iterable of modules to add
"""
def __init__(self, p, modules=None):
super().__init__(modules)
self.p = p
def __iter__(self):
dropout_probs = torch.empty(len(self)).uniform_()
for i, m in enumerate(super().__iter__()):
if not self.training or (dropout_probs[i] > self.p):
yield m
class GraphormerGraphNodeFeature(nn.Module):
"""
Compute node features for each node in the graph.
"""
def __init__(self, config):
super().__init__()
self.num_heads = config.num_attention_heads
self.num_atoms = config.num_atoms
self.atom_encoder = nn.Embedding(config.num_atoms + 1, config.hidden_size, padding_idx=config.pad_token_id)
self.in_degree_encoder = nn.Embedding(
config.num_in_degree, config.hidden_size, padding_idx=config.pad_token_id
)
self.out_degree_encoder = nn.Embedding(
config.num_out_degree, config.hidden_size, padding_idx=config.pad_token_id
)
self.graph_token = nn.Embedding(1, config.hidden_size)
def forward(self, input_nodes, in_degree, out_degree):
n_graph, n_node = input_nodes.size()[:2]
node_feature = ( # node feature + graph token
self.atom_encoder(input_nodes).sum(dim=-2) # [n_graph, n_node, n_hidden]
+ self.in_degree_encoder(in_degree)
+ self.out_degree_encoder(out_degree)
)
graph_token_feature = self.graph_token.weight.unsqueeze(0).repeat(n_graph, 1, 1)
graph_node_feature = torch.cat([graph_token_feature, node_feature], dim=1)
return graph_node_feature
class GraphormerGraphAttnBias(nn.Module):
"""
Compute attention bias for each head.
"""
def __init__(self, config):
super().__init__()
self.num_heads = config.num_attention_heads
self.multi_hop_max_dist = config.multi_hop_max_dist
# We do not change edge feature embedding learning, as edge embeddings are represented as a combination of the original features
# + shortest path
self.edge_encoder = nn.Embedding(config.num_edges + 1, config.num_attention_heads, padding_idx=0)
self.edge_type = config.edge_type
if self.edge_type == "multi_hop":
self.edge_dis_encoder = nn.Embedding(
config.num_edge_dis * config.num_attention_heads * config.num_attention_heads,
1,
)
self.spatial_pos_encoder = nn.Embedding(config.num_spatial, config.num_attention_heads, padding_idx=0)
self.graph_token_virtual_distance = nn.Embedding(1, config.num_attention_heads)
def forward(self, input_nodes, attn_bias, spatial_pos, input_edges, attn_edge_type):
n_graph, n_node = input_nodes.size()[:2]
graph_attn_bias = attn_bias.clone()
graph_attn_bias = graph_attn_bias.unsqueeze(1).repeat(
1, self.num_heads, 1, 1
) # [n_graph, n_head, n_node+1, n_node+1]
# spatial pos
# [n_graph, n_node, n_node, n_head] -> [n_graph, n_head, n_node, n_node]
spatial_pos_bias = self.spatial_pos_encoder(spatial_pos).permute(0, 3, 1, 2)
graph_attn_bias[:, :, 1:, 1:] = graph_attn_bias[:, :, 1:, 1:] + spatial_pos_bias
# reset spatial pos here
t = self.graph_token_virtual_distance.weight.view(1, self.num_heads, 1)
graph_attn_bias[:, :, 1:, 0] = graph_attn_bias[:, :, 1:, 0] + t
graph_attn_bias[:, :, 0, :] = graph_attn_bias[:, :, 0, :] + t
# edge feature
if self.edge_type == "multi_hop":
spatial_pos_ = spatial_pos.clone()
spatial_pos_[spatial_pos_ == 0] = 1 # set pad to 1
# set 1 to 1, input_nodes > 1 to input_nodes - 1
spatial_pos_ = torch.where(spatial_pos_ > 1, spatial_pos_ - 1, spatial_pos_)
if self.multi_hop_max_dist > 0:
spatial_pos_ = spatial_pos_.clamp(0, self.multi_hop_max_dist)
input_edges = input_edges[:, :, :, : self.multi_hop_max_dist, :]
# [n_graph, n_node, n_node, max_dist, n_head]
input_edges = self.edge_encoder(input_edges).mean(-2)
max_dist = input_edges.size(-2)
edge_input_flat = input_edges.permute(3, 0, 1, 2, 4).reshape(max_dist, -1, self.num_heads)
edge_input_flat = torch.bmm(
edge_input_flat,
self.edge_dis_encoder.weight.reshape(-1, self.num_heads, self.num_heads)[:max_dist, :, :],
)
input_edges = edge_input_flat.reshape(max_dist, n_graph, n_node, n_node, self.num_heads).permute(
1, 2, 3, 0, 4
)
input_edges = (input_edges.sum(-2) / (spatial_pos_.float().unsqueeze(-1))).permute(0, 3, 1, 2)
else:
# [n_graph, n_node, n_node, n_head] -> [n_graph, n_head, n_node, n_node]
input_edges = self.edge_encoder(attn_edge_type).mean(-2).permute(0, 3, 1, 2)
graph_attn_bias[:, :, 1:, 1:] = graph_attn_bias[:, :, 1:, 1:] + input_edges
graph_attn_bias = graph_attn_bias + attn_bias.unsqueeze(1) # reset
return graph_attn_bias
class GraphormerMultiheadAttention(nn.Module):
"""Multi-headed attention.
See "Attention Is All You Need" for more details.
"""
def __init__(self, config):
super().__init__()
self.embedding_dim = config.embedding_dim
self.kdim = config.kdim if config.kdim is not None else config.embedding_dim
self.vdim = config.vdim if config.vdim is not None else config.embedding_dim
self.qkv_same_dim = self.kdim == config.embedding_dim and self.vdim == config.embedding_dim
self.num_heads = config.num_attention_heads
self.dropout_module = torch.nn.Dropout(p=config.dropout, inplace=False)
self.head_dim = config.embedding_dim // config.num_attention_heads
if not (self.head_dim * config.num_attention_heads == self.embedding_dim):
raise AssertionError("The embedding_dim must be divisible by num_heads.")
self.scaling = self.head_dim**-0.5
self.self_attention = True # config.self_attention
if not (self.self_attention):
raise NotImplementedError("The Graphormer model only supports self attention for now.")
if self.self_attention and not self.qkv_same_dim:
raise AssertionError("Self-attention requires query, key and value to be of the same size.")
self.k_proj = quant_noise(
nn.Linear(self.kdim, config.embedding_dim, bias=config.bias),
config.q_noise,
config.qn_block_size,
)
self.v_proj = quant_noise(
nn.Linear(self.vdim, config.embedding_dim, bias=config.bias),
config.q_noise,
config.qn_block_size,
)
self.q_proj = quant_noise(
nn.Linear(config.embedding_dim, config.embedding_dim, bias=config.bias),
config.q_noise,
config.qn_block_size,
)
self.out_proj = quant_noise(
nn.Linear(config.embedding_dim, config.embedding_dim, bias=config.bias),
config.q_noise,
config.qn_block_size,
)
self.onnx_trace = False
def reset_parameters(self):
if self.qkv_same_dim:
# Empirically observed the convergence to be much better with
# the scaled initialization
nn.init.xavier_uniform_(self.k_proj.weight, gain=1 / math.sqrt(2))
nn.init.xavier_uniform_(self.v_proj.weight, gain=1 / math.sqrt(2))
nn.init.xavier_uniform_(self.q_proj.weight, gain=1 / math.sqrt(2))
else:
nn.init.xavier_uniform_(self.k_proj.weight)
nn.init.xavier_uniform_(self.v_proj.weight)
nn.init.xavier_uniform_(self.q_proj.weight)
nn.init.xavier_uniform_(self.out_proj.weight)
if self.out_proj.bias is not None:
nn.init.constant_(self.out_proj.bias, 0.0)
def forward(
self,
query,
key: Optional[torch.Tensor],
value: Optional[torch.Tensor],
attn_bias: Optional[torch.Tensor],
key_padding_mask: Optional[torch.Tensor] = None,
need_weights: bool = True,
attn_mask: Optional[torch.Tensor] = None,
before_softmax: bool = False,
need_head_weights: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
"""
Args:
key_padding_mask (Bytetorch.Tensor, optional): mask to exclude
keys that are pads, of shape `(batch, src_len)`, where padding elements are indicated by 1s.
need_weights (bool, optional): return the attention weights,
averaged over heads (default: False).
attn_mask (Bytetorch.Tensor, optional): typically used to
implement causal attention, where the mask prevents the attention from looking forward in time
(default: None).
before_softmax (bool, optional): return the raw attention
weights and values before the attention softmax.
need_head_weights (bool, optional): return the attention
weights for each head. Implies *need_weights*. Default: return the average attention weights over all
heads.
"""
if need_head_weights:
need_weights = True
tgt_len, bsz, embedding_dim = query.size()
src_len = tgt_len
if not (embedding_dim == self.embedding_dim):
raise AssertionError(
f"The query embedding dimension {embedding_dim} is not equal to the expected embedding_dim"
f" {self.embedding_dim}."
)
if not (list(query.size()) == [tgt_len, bsz, embedding_dim]):
raise AssertionError("Query size incorrect in Graphormer, compared to model dimensions.")
if key is not None:
src_len, key_bsz, _ = key.size()
if not torch.jit.is_scripting():
if (key_bsz != bsz) or (value is None) or not (src_len, bsz == value.shape[:2]):
raise AssertionError(
"The batch shape does not match the key or value shapes provided to the attention."
)
q = self.q_proj(query)
k = self.k_proj(query)
v = self.v_proj(query)
q *= self.scaling
q = q.contiguous().view(tgt_len, bsz * self.num_heads, self.head_dim).transpose(0, 1)
if k is not None:
k = k.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
if v is not None:
v = v.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
if (k is None) or not (k.size(1) == src_len):
raise AssertionError("The shape of the key generated in the attention is incorrect")
# This is part of a workaround to get around fork/join parallelism
# not supporting Optional types.
if key_padding_mask is not None and key_padding_mask.dim() == 0:
key_padding_mask = None
if key_padding_mask is not None:
if key_padding_mask.size(0) != bsz or key_padding_mask.size(1) != src_len:
raise AssertionError(
"The shape of the generated padding mask for the key does not match expected dimensions."
)
attn_weights = torch.bmm(q, k.transpose(1, 2))
attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
if list(attn_weights.size()) != [bsz * self.num_heads, tgt_len, src_len]:
raise AssertionError("The attention weights generated do not match the expected dimensions.")
if attn_bias is not None:
attn_weights += attn_bias.view(bsz * self.num_heads, tgt_len, src_len)
if attn_mask is not None:
attn_mask = attn_mask.unsqueeze(0)
attn_weights += attn_mask
if key_padding_mask is not None:
# don't attend to padding symbols
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
attn_weights = attn_weights.masked_fill(
key_padding_mask.unsqueeze(1).unsqueeze(2).to(torch.bool), float("-inf")
)
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
if before_softmax:
return attn_weights, v
attn_weights_float = torch.nn.functional.softmax(attn_weights, dim=-1)
attn_weights = attn_weights_float.type_as(attn_weights)
attn_probs = self.dropout_module(attn_weights)
if v is None:
raise AssertionError("No value generated")
attn = torch.bmm(attn_probs, v)
if list(attn.size()) != [bsz * self.num_heads, tgt_len, self.head_dim]:
raise AssertionError("The attention generated do not match the expected dimensions.")
attn = attn.transpose(0, 1).contiguous().view(tgt_len, bsz, embedding_dim)
attn = self.out_proj(attn)
attn_weights = None
if need_weights:
attn_weights = attn_weights_float.contiguous().view(bsz, self.num_heads, tgt_len, src_len).transpose(1, 0)
if not need_head_weights:
# average attention weights over heads
attn_weights = attn_weights.mean(dim=0)
return attn, attn_weights
def apply_sparse_mask(self, attn_weights, tgt_len: int, src_len: int, bsz: int):
return attn_weights
class GraphormerGraphEncoderLayer(nn.Module):
def __init__(self, config) -> None:
super().__init__()
# Initialize parameters
self.embedding_dim = config.embedding_dim
self.num_attention_heads = config.num_attention_heads
self.attention_dropout = config.attention_dropout
self.q_noise = config.q_noise
self.qn_block_size = config.qn_block_size
self.pre_layernorm = config.pre_layernorm
self.dropout_module = torch.nn.Dropout(p=config.dropout, inplace=False)
self.activation_dropout_module = torch.nn.Dropout(p=config.dropout, inplace=False)
# Initialize blocks
self.activation_fn = ACT2FN[config.activation_fn]
self.self_attn = GraphormerMultiheadAttention(config)
# layer norm associated with the self attention layer
self.self_attn_layer_norm = nn.LayerNorm(self.embedding_dim)
self.fc1 = self.build_fc(
self.embedding_dim,
config.ffn_embedding_dim,
q_noise=config.q_noise,
qn_block_size=config.qn_block_size,
)
self.fc2 = self.build_fc(
config.ffn_embedding_dim,
self.embedding_dim,
q_noise=config.q_noise,
qn_block_size=config.qn_block_size,
)
# layer norm associated with the position wise feed-forward NN
self.final_layer_norm = nn.LayerNorm(self.embedding_dim)
def build_fc(self, input_dim, output_dim, q_noise, qn_block_size):
return quant_noise(nn.Linear(input_dim, output_dim), q_noise, qn_block_size)
def forward(
self,
input_nodes: torch.Tensor,
self_attn_bias: Optional[torch.Tensor] = None,
self_attn_mask: Optional[torch.Tensor] = None,
self_attn_padding_mask: Optional[torch.Tensor] = None,
):
"""
nn.LayerNorm is applied either before or after the self-attention/ffn modules similar to the original
Transformer implementation.
"""
residual = input_nodes
if self.pre_layernorm:
input_nodes = self.self_attn_layer_norm(input_nodes)
input_nodes, attn = self.self_attn(
query=input_nodes,
key=input_nodes,
value=input_nodes,
attn_bias=self_attn_bias,
key_padding_mask=self_attn_padding_mask,
need_weights=False,
attn_mask=self_attn_mask,
)
input_nodes = self.dropout_module(input_nodes)
input_nodes = residual + input_nodes
if not self.pre_layernorm:
input_nodes = self.self_attn_layer_norm(input_nodes)
residual = input_nodes
if self.pre_layernorm:
input_nodes = self.final_layer_norm(input_nodes)
input_nodes = self.activation_fn(self.fc1(input_nodes))
input_nodes = self.activation_dropout_module(input_nodes)
input_nodes = self.fc2(input_nodes)
input_nodes = self.dropout_module(input_nodes)
input_nodes = residual + input_nodes
if not self.pre_layernorm:
input_nodes = self.final_layer_norm(input_nodes)
return input_nodes, attn
class GraphormerGraphEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.dropout_module = torch.nn.Dropout(p=config.dropout, inplace=False)
self.layerdrop = config.layerdrop
self.embedding_dim = config.embedding_dim
self.apply_graphormer_init = config.apply_graphormer_init
self.traceable = config.traceable
self.graph_node_feature = GraphormerGraphNodeFeature(config)
self.graph_attn_bias = GraphormerGraphAttnBias(config)
self.embed_scale = config.embed_scale
if config.q_noise > 0:
self.quant_noise = quant_noise(
nn.Linear(self.embedding_dim, self.embedding_dim, bias=False),
config.q_noise,
config.qn_block_size,
)
else:
self.quant_noise = None
if config.encoder_normalize_before:
self.emb_layer_norm = nn.LayerNorm(self.embedding_dim)
else:
self.emb_layer_norm = None
if config.pre_layernorm:
self.final_layer_norm = nn.LayerNorm(self.embedding_dim)
if self.layerdrop > 0.0:
self.layers = LayerDropModuleList(p=self.layerdrop)
else:
self.layers = nn.ModuleList([])
self.layers.extend([GraphormerGraphEncoderLayer(config) for _ in range(config.num_hidden_layers)])
# Apply initialization of model params after building the model
if config.freeze_embeddings:
raise NotImplementedError("Freezing embeddings is not implemented yet.")
for layer in range(config.num_trans_layers_to_freeze):
m = self.layers[layer]
if m is not None:
for p in m.parameters():
p.requires_grad = False
def forward(
self,
input_nodes,
input_edges,
attn_bias,
in_degree,
out_degree,
spatial_pos,
attn_edge_type,
perturb=None,
last_state_only: bool = False,
token_embeddings: Optional[torch.Tensor] = None,
attn_mask: Optional[torch.Tensor] = None,
) -> Tuple[torch.torch.Tensor, torch.Tensor]:
# compute padding mask. This is needed for multi-head attention
data_x = input_nodes
n_graph, n_node = data_x.size()[:2]
padding_mask = (data_x[:, :, 0]).eq(0)
padding_mask_cls = torch.zeros(n_graph, 1, device=padding_mask.device, dtype=padding_mask.dtype)
padding_mask = torch.cat((padding_mask_cls, padding_mask), dim=1)
attn_bias = self.graph_attn_bias(input_nodes, attn_bias, spatial_pos, input_edges, attn_edge_type)
if token_embeddings is not None:
input_nodes = token_embeddings
else:
input_nodes = self.graph_node_feature(input_nodes, in_degree, out_degree)
if perturb is not None:
input_nodes[:, 1:, :] += perturb
if self.embed_scale is not None:
input_nodes = input_nodes * self.embed_scale
if self.quant_noise is not None:
input_nodes = self.quant_noise(input_nodes)
if self.emb_layer_norm is not None:
input_nodes = self.emb_layer_norm(input_nodes)
input_nodes = self.dropout_module(input_nodes)
input_nodes = input_nodes.transpose(0, 1)
inner_states = []
if not last_state_only:
inner_states.append(input_nodes)
for layer in self.layers:
input_nodes, _ = layer(
input_nodes,
self_attn_padding_mask=padding_mask,
self_attn_mask=attn_mask,
self_attn_bias=attn_bias,
)
if not last_state_only:
inner_states.append(input_nodes)
graph_rep = input_nodes[0, :, :]
if last_state_only:
inner_states = [input_nodes]
if self.traceable:
return torch.stack(inner_states), graph_rep
else:
return inner_states, graph_rep
class GraphormerDecoderHead(nn.Module):
def __init__(self, embedding_dim, num_classes):
super().__init__()
"""num_classes should be 1 for regression, or the number of classes for classification"""
self.lm_output_learned_bias = nn.Parameter(torch.zeros(1))
self.classifier = nn.Linear(embedding_dim, num_classes, bias=False)
self.num_classes = num_classes
def forward(self, input_nodes, **unused):
input_nodes = self.classifier(input_nodes)
input_nodes = input_nodes + self.lm_output_learned_bias
return input_nodes
class GraphormerPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = GraphormerConfig
base_model_prefix = "graphormer"
supports_gradient_checkpointing = True
_keys_to_ignore_on_load_missing = [r"position_ids"]
main_input_name_nodes = "input_nodes"
main_input_name_edges = "input_edges"
def normal_(self, data):
# with FSDP, module params will be on CUDA, so we cast them back to CPU
# so that the RNG is consistent with and without FSDP
data.copy_(data.cpu().normal_(mean=0.0, std=0.02).to(data.device))
def init_graphormer_params(self, module):
"""
Initialize the weights specific to the Graphormer Model.
"""
if isinstance(module, nn.Linear):
self.normal_(module.weight.data)
if module.bias is not None:
module.bias.data.zero_()
if isinstance(module, nn.Embedding):
self.normal_(module.weight.data)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
if isinstance(module, GraphormerMultiheadAttention):
self.normal_(module.q_proj.weight.data)
self.normal_(module.k_proj.weight.data)
self.normal_(module.v_proj.weight.data)
def _init_weights(self, module):
"""
Initialize the weights
"""
if isinstance(module, (nn.Linear, nn.Conv2d)):
# We might be missing part of the Linear init, dependant on the layer num
module.weight.data.normal_(mean=0.0, std=0.02)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=0.02)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, GraphormerMultiheadAttention):
module.q_proj.weight.data.normal_(mean=0.0, std=0.02)
module.k_proj.weight.data.normal_(mean=0.0, std=0.02)
module.v_proj.weight.data.normal_(mean=0.0, std=0.02)
module.reset_parameters()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, GraphormerGraphEncoder):
if module.apply_graphormer_init:
module.apply(self.init_graphormer_params)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
def _set_gradient_checkpointing(self, module, value=False):
if isinstance(module, GraphormerModel):
module.gradient_checkpointing = value
class GraphormerModel(GraphormerPreTrainedModel):
"""The Graphormer model is a graph-encoder model.
It goes from a graph to its representation. If you want to use the model for a downstream classification task, use
GraphormerForGraphClassification instead. For any other downstream task, feel free to add a new class, or combine
this model with a downstream model of your choice, following the example in GraphormerForGraphClassification.
"""
def __init__(self, config):
super().__init__(config)
self.max_nodes = config.max_nodes
self.graph_encoder = GraphormerGraphEncoder(config)
self.share_input_output_embed = config.share_input_output_embed
self.lm_output_learned_bias = None
# Remove head is set to true during fine-tuning
self.load_softmax = not getattr(config, "remove_head", False)
self.lm_head_transform_weight = nn.Linear(config.embedding_dim, config.embedding_dim)
self.activation_fn = ACT2FN[config.activation_fn]
self.layer_norm = nn.LayerNorm(config.embedding_dim)
self.post_init()
def reset_output_layer_parameters(self):
self.lm_output_learned_bias = nn.Parameter(torch.zeros(1))
def forward(
self,
input_nodes,
input_edges,
attn_bias,
in_degree,
out_degree,
spatial_pos,
attn_edge_type,
perturb=None,
masked_tokens=None,
return_dict: Optional[bool] = True,
**unused
):
inner_states, graph_rep = self.graph_encoder(
input_nodes, input_edges, attn_bias, in_degree, out_degree, spatial_pos, attn_edge_type, perturb=perturb
)
# last inner state, then revert Batch and Graph len
input_nodes = inner_states[-1].transpose(0, 1)
# project masked tokens only
if masked_tokens is not None:
raise NotImplementedError
input_nodes = self.layer_norm(self.activation_fn(self.lm_head_transform_weight(input_nodes)))
# project back to size of vocabulary
if self.share_input_output_embed and hasattr(self.graph_encoder.embed_tokens, "weight"):
input_nodes = torch.nn.functional.linear(input_nodes, self.graph_encoder.embed_tokens.weight)
if not return_dict:
return (input_nodes, inner_states)
return BaseModelOutputWithNoAttention(last_hidden_state=input_nodes, hidden_states=inner_states)
def max_nodes(self):
"""Maximum output length supported by the encoder."""
return self.max_nodes
class GraphormerForGraphClassification(GraphormerPreTrainedModel):
"""
This model can be used for graph-level classification or regression tasks.
It can be trained on
- regression (by setting config.num_classes to 1); there should be one float-type label per graph
- one task classification (by setting config.num_classes to the number of classes); there should be one integer
label per graph
- binary multi-task classification (by setting config.num_classes to the number of labels); there should be a list
of integer labels for each graph.
"""
def __init__(self, config):
super().__init__(config)
self.encoder = GraphormerModel(config)
self.embedding_dim = config.embedding_dim
self.num_classes = config.num_classes
self.classifier = GraphormerDecoderHead(self.embedding_dim, self.num_classes)
self.is_encoder_decoder = True
# Initialize weights and apply final processing
self.post_init()
def forward(
self,
input_nodes,
input_edges,
attn_bias,
in_degree,
out_degree,
spatial_pos,
attn_edge_type,
labels: Optional[torch.LongTensor] = None,
return_dict: Optional[bool] = True,
**unused,
) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
encoder_outputs = self.encoder(
input_nodes,
input_edges,
attn_bias,
in_degree,
out_degree,
spatial_pos,
attn_edge_type,
)
outputs, hidden_states = encoder_outputs["last_hidden_state"], encoder_outputs["hidden_states"]
head_outputs = self.classifier(outputs)
logits = head_outputs[:, 0, :].contiguous()
if labels is not None:
mask = ~torch.isnan(labels)
if self.num_classes == 1: # regression
loss_fct = MSELoss()
loss = loss_fct(logits[mask].squeeze(), labels[mask].squeeze().float())
elif self.num_classes > 1 and len(labels.shape) == 1: # One task classification
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits[mask].view(-1, self.num_classes), labels[mask].view(-1))
else: # Binary multi-task classification
loss_fct = BCEWithLogitsLoss(reduction="sum")
loss = loss_fct(logits[mask], labels[mask])
if not return_dict:
return (loss, logits, hidden_states)
return SequenceClassifierOutput(loss=loss, logits=logits, hidden_states=hidden_states, attentions=None)
...@@ -51,6 +51,7 @@ from .utils import ( ...@@ -51,6 +51,7 @@ from .utils import (
is_apex_available, is_apex_available,
is_bitsandbytes_available, is_bitsandbytes_available,
is_bs4_available, is_bs4_available,
is_cython_available,
is_decord_available, is_decord_available,
is_detectron2_available, is_detectron2_available,
is_faiss_available, is_faiss_available,
...@@ -711,6 +712,13 @@ def require_jumanpp(test_case): ...@@ -711,6 +712,13 @@ def require_jumanpp(test_case):
return unittest.skipUnless(is_jumanpp_available(), "test requires jumanpp")(test_case) return unittest.skipUnless(is_jumanpp_available(), "test requires jumanpp")(test_case)
def require_cython(test_case):
"""
Decorator marking a test that requires jumanpp
"""
return unittest.skipUnless(is_cython_available(), "test requires cython")(test_case)
def get_gpu_count(): def get_gpu_count():
""" """
Return the number of available gpus (regardless of whether torch, tf or jax is used) Return the number of available gpus (regardless of whether torch, tf or jax is used)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment