Decision transformer gym (#15845)

* Created the Decision Transformer Modle * updating tests, copy to other machine * Added last hidden size to Decision Transformer modelling outputs * Removed copy of original DT file * made a temporary change to gpt2 to have it conform with the Decision Transformer version * Updated tests * Ignoring a file used to test the DT model * added comments to config file * added comments and argument descriptions to decision transformer file * Updated doc * Ran "make style" * Remove old model imports * Removed unused imports, cleaned up init file * Update docs/source/model_doc/decision_transformer.mdx added my username Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Reverted changes made to gpt2 * Removed datasets submodule * Update the modeling outputs to include gpt2 attentions, hidden states and last hidden states * Added support for return of hidden states, attentions and return dict of gpt2 model. * Updated tests to include many of the ModelTesterMixin tests. The following tests are skipped: test_generate_without_input_ids, test_pruning, test_resize_embeddings, test_head_masking, test_attention_outputs, test_hidden_states_output, test_inputs_embeds, test_model_common_attributes * Added missing line to the end of gpt2 file * Added an integration test for the Decision Transformer Test performs and autoregressive evaluation for two time steps * Set done and info to _ to fix failing test * Updated integration test to be deterministic and check expected outputs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Removed unnecessary config options * Cleaned up commented code and old comments. * Cleaned up commented code. * Changed DecisionTransformer to Decision Transformer * Added Decision Transformer to the main README file * Added copy of GTP2 called DecisionTranformerGPT2Model * isorted imports * isorted imports * Added model to non-English README files * Ran make fix-copies and corrected some cases. * Updated index file to include Decision Transformer * Added gpt2 model as copy inside the Decision Transformer model file * Added the unit test file to the list of TEST_FILES_WITH_NO_COMMON_TESTS * Deleted redundant checkpoint files (I don't know how these got committed) * Removed testing files. (These should have never been committed) * Removed accidentally committed files * Moved the Decision Transformer test to its own directory * Add type hints for Pegasus (#16324) * Funnel type hints (#16323) * add pt funnel type hints * add tf funnel type hints * Add type hints for ProphetNet PyTorch (#16272) * [GLPN] Improve docs (#16331) * Add link to notebook * Add link * Fix bug Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> * Added type hints for Pytorch Marian calls (#16200) * Added type hinting for forward functions in pytorch marian * typo correction * Removed type hints on functions from BART per Suraj Patil request * fix import pb * fix typo * corrected tuple call * ran black * after fix-copies Some optional tags on primitives were removed, past_key_values in MarianForCausalLM changed from Tuple of Tuple to List * Fixing copies to roformer and pegasus Co-authored-by: Clementine Fourrier <cfourrie@inria.fr> Co-authored-by: matt <rocketknight1@gmail.com> * Moved DecisionTransformOutput to modeling_decision_transformer * Moved the example usage to research project and cleaned comments * Made tests ignore the copy of gpt2 in Decision Transformer * Added module output to modelling decision transformer * removed copied gpt2 model from list of transformers models * Updated tests and created __init__ file for new test location * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/configuration_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Removed unneeded summary type from config file * Fixed copies * Updated pretrained config map to refer to hopper-medium checkpoint * done (#16340) * Added Decision transformer to model docs * Update src/transformers/models/decision_transformer/modeling_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/modeling_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/configuration_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add type annotations for Rembert/Splinter and copies (#16338) * undo black autoformat * minor fix to rembert forward with default * make fix-copies, make quality * Adding types to template model * Removing List from the template types * Remove `Optional` from a couple of types that don't accept `None` Co-authored-by: matt <rocketknight1@gmail.com> * [Bug template] Shift responsibilities for long-range (#16344) * Fix code repetition in serialization guide (#16346) * Adopt framework-specific blocks for content (#16342) * ✨ refactor code samples with framework-specific blocks * ✨ update training.mdx * 🖍 apply feedback * Updates the default branch from master to main (#16326) * Updates the default branch from master to main * Links from `master` to `main` * Typo * Update examples/flax/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Updated model with custom docstring example * Created the Decision Transformer Modle * updating tests, copy to other machine * Added last hidden size to Decision Transformer modelling outputs * Removed copy of original DT file * made a temporary change to gpt2 to have it conform with the Decision Transformer version * Updated tests * Ignoring a file used to test the DT model * added comments to config file * added comments and argument descriptions to decision transformer file * Updated doc * Ran "make style" * Remove old model imports * Removed unused imports, cleaned up init file * Update docs/source/model_doc/decision_transformer.mdx added my username Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Reverted changes made to gpt2 * Removed datasets submodule * Update the modeling outputs to include gpt2 attentions, hidden states and last hidden states * Added support for return of hidden states, attentions and return dict of gpt2 model. * Updated tests to include many of the ModelTesterMixin tests. The following tests are skipped: test_generate_without_input_ids, test_pruning, test_resize_embeddings, test_head_masking, test_attention_outputs, test_hidden_states_output, test_inputs_embeds, test_model_common_attributes * Added missing line to the end of gpt2 file * Added an integration test for the Decision Transformer Test performs and autoregressive evaluation for two time steps * Set done and info to _ to fix failing test * Updated integration test to be deterministic and check expected outputs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Removed unnecessary config options * Cleaned up commented code and old comments. * Cleaned up commented code. * Changed DecisionTransformer to Decision Transformer * Added Decision Transformer to the main README file * Added copy of GTP2 called DecisionTranformerGPT2Model * isorted imports * isorted imports * Added model to non-English README files * Ran make fix-copies and corrected some cases. * Updated index file to include Decision Transformer * Added gpt2 model as copy inside the Decision Transformer model file * Added the unit test file to the list of TEST_FILES_WITH_NO_COMMON_TESTS * Deleted redundant checkpoint files (I don't know how these got committed) * Removed testing files. (These should have never been committed) * Removed accidentally committed files * Moved the Decision Transformer test to its own directory * Moved DecisionTransformOutput to modeling_decision_transformer * Moved the example usage to research project and cleaned comments * Made tests ignore the copy of gpt2 in Decision Transformer * Added module output to modelling decision transformer * removed copied gpt2 model from list of transformers models * Updated tests and created __init__ file for new test location * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/configuration_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Removed unneeded summary type from config file * Fixed copies * Updated pretrained config map to refer to hopper-medium checkpoint * Added Decision transformer to model docs * Update src/transformers/models/decision_transformer/modeling_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/modeling_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/configuration_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Updated model with custom docstring example * Updated copies, config auto, and readme files. Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Dan Tegzes <48134725+Tegzes@users.noreply.github.com> Co-authored-by: Adam Montgomerie <adam@avanssion.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Clementine Fourrier <cfourrie@inria.fr> Co-authored-by: matt <rocketknight1@gmail.com> Co-authored-by: Francesco Saverio Zuppichini <francesco.zuppichini@gmail.com> Co-authored-by: Jacob Dineen <54680234+jacobdineen@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>

Decision transformer gym (#15845)
* Created the Decision Transformer Modle * updating tests, copy to other machine * Added last hidden size to Decision Transformer modelling outputs * Removed copy of original DT file * made a temporary change to gpt2 to have it conform with the Decision Transformer version * Updated tests * Ignoring a file used to test the DT model * added comments to config file * added comments and argument descriptions to decision transformer file * Updated doc * Ran "make style" * Remove old model imports * Removed unused imports, cleaned up init file * Update docs/source/model_doc/decision_transformer.mdx added my username Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Reverted changes made to gpt2 * Removed datasets submodule * Update the modeling outputs to include gpt2 attentions, hidden states and last hidden states * Added support for return of hidden states, attentions and return dict of gpt2 model. * Updated tests to include many of the ModelTesterMixin tests. The following tests are skipped: test_generate_without_input_ids, test_pruning, test_resize_embeddings, test_head_masking, test_attention_outputs, test_hidden_states_output, test_inputs_embeds, test_model_common_attributes * Added missing line to the end of gpt2 file * Added an integration test for the Decision Transformer Test performs and autoregressive evaluation for two time steps * Set done and info to _ to fix failing test * Updated integration test to be deterministic and check expected outputs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Removed unnecessary config options * Cleaned up commented code and old comments. * Cleaned up commented code. * Changed DecisionTransformer to Decision Transformer * Added Decision Transformer to the main README file * Added copy of GTP2 called DecisionTranformerGPT2Model * isorted imports * isorted imports * Added model to non-English README files * Ran make fix-copies and corrected some cases. * Updated index file to include Decision Transformer * Added gpt2 model as copy inside the Decision Transformer model file * Added the unit test file to the list of TEST_FILES_WITH_NO_COMMON_TESTS * Deleted redundant checkpoint files (I don't know how these got committed) * Removed testing files. (These should have never been committed) * Removed accidentally committed files * Moved the Decision Transformer test to its own directory * Add type hints for Pegasus (#16324) * Funnel type hints (#16323) * add pt funnel type hints * add tf funnel type hints * Add type hints for ProphetNet PyTorch (#16272) * [GLPN] Improve docs (#16331) * Add link to notebook * Add link * Fix bug Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> * Added type hints for Pytorch Marian calls (#16200) * Added type hinting for forward functions in pytorch marian * typo correction * Removed type hints on functions from BART per Suraj Patil request * fix import pb * fix typo * corrected tuple call * ran black * after fix-copies Some optional tags on primitives were removed, past_key_values in MarianForCausalLM changed from Tuple of Tuple to List * Fixing copies to roformer and pegasus Co-authored-by: Clementine Fourrier <cfourrie@inria.fr> Co-authored-by: matt <rocketknight1@gmail.com> * Moved DecisionTransformOutput to modeling_decision_transformer * Moved the example usage to research project and cleaned comments * Made tests ignore the copy of gpt2 in Decision Transformer * Added module output to modelling decision transformer * removed copied gpt2 model from list of transformers models * Updated tests and created __init__ file for new test location * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/configuration_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Removed unneeded summary type from config file * Fixed copies * Updated pretrained config map to refer to hopper-medium checkpoint * done (#16340) * Added Decision transformer to model docs * Update src/transformers/models/decision_transformer/modeling_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/modeling_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/configuration_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add type annotations for Rembert/Splinter and copies (#16338) * undo black autoformat * minor fix to rembert forward with default * make fix-copies, make quality * Adding types to template model * Removing List from the template types * Remove `Optional` from a couple of types that don't accept `None` Co-authored-by: matt <rocketknight1@gmail.com> * [Bug template] Shift responsibilities for long-range (#16344) * Fix code repetition in serialization guide (#16346) * Adopt framework-specific blocks for content (#16342) * ✨ refactor code samples with framework-specific blocks * ✨ update training.mdx * 🖍 apply feedback * Updates the default branch from master to main (#16326) * Updates the default branch from master to main * Links from `master` to `main` * Typo * Update examples/flax/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Updated model with custom docstring example * Created the Decision Transformer Modle * updating tests, copy to other machine * Added last hidden size to Decision Transformer modelling outputs * Removed copy of original DT file * made a temporary change to gpt2 to have it conform with the Decision Transformer version * Updated tests * Ignoring a file used to test the DT model * added comments to config file * added comments and argument descriptions to decision transformer file * Updated doc * Ran "make style" * Remove old model imports * Removed unused imports, cleaned up init file * Update docs/source/model_doc/decision_transformer.mdx added my username Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Reverted changes made to gpt2 * Removed datasets submodule * Update the modeling outputs to include gpt2 attentions, hidden states and last hidden states * Added support for return of hidden states, attentions and return dict of gpt2 model. * Updated tests to include many of the ModelTesterMixin tests. The following tests are skipped: test_generate_without_input_ids, test_pruning, test_resize_embeddings, test_head_masking, test_attention_outputs, test_hidden_states_output, test_inputs_embeds, test_model_common_attributes * Added missing line to the end of gpt2 file * Added an integration test for the Decision Transformer Test performs and autoregressive evaluation for two time steps * Set done and info to _ to fix failing test * Updated integration test to be deterministic and check expected outputs * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Removed unnecessary config options * Cleaned up commented code and old comments. * Cleaned up commented code. * Changed DecisionTransformer to Decision Transformer * Added Decision Transformer to the main README file * Added copy of GTP2 called DecisionTranformerGPT2Model * isorted imports * isorted imports * Added model to non-English README files * Ran make fix-copies and corrected some cases. * Updated index file to include Decision Transformer * Added gpt2 model as copy inside the Decision Transformer model file * Added the unit test file to the list of TEST_FILES_WITH_NO_COMMON_TESTS * Deleted redundant checkpoint files (I don't know how these got committed) * Removed testing files. (These should have never been committed) * Removed accidentally committed files * Moved the Decision Transformer test to its own directory * Moved DecisionTransformOutput to modeling_decision_transformer * Moved the example usage to research project and cleaned comments * Made tests ignore the copy of gpt2 in Decision Transformer * Added module output to modelling decision transformer * removed copied gpt2 model from list of transformers models * Updated tests and created __init__ file for new test location * Update README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/configuration_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Removed unneeded summary type from config file * Fixed copies * Updated pretrained config map to refer to hopper-medium checkpoint * Added Decision transformer to model docs * Update src/transformers/models/decision_transformer/modeling_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/modeling_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/decision_transformer/configuration_decision_transformer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Updated model with custom docstring example * Updated copies, config auto, and readme files. Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Dan Tegzes <48134725+Tegzes@users.noreply.github.com> Co-authored-by: Adam Montgomerie <adam@avanssion.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Clementine Fourrier <cfourrie@inria.fr> Co-authored-by: matt <rocketknight1@gmail.com> Co-authored-by: Francesco Saverio Zuppichini <francesco.zuppichini@gmail.com> Co-authored-by: Jacob Dineen <54680234+jacobdineen@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
aff9bc40 · Edward Beeching · GitHub · c595b6e6 · aff9bc40 · aff9bc40
Unverified Commit aff9bc40 authored Mar 23, 2022 by Edward Beeching Committed by GitHub Mar 23, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -252,7 +252,8 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[Data2Vec](https://huggingface.co/docs/transformers/main/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec:  A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
 1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
-1. **[DiT](https://huggingface.co/docs/transformers/main/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
+1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
+1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.

--- a/README_ko.md
+++ b/README_ko.md
@@ -233,11 +233,12 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[Data2Vec](https://huggingface.co/docs/transformers/main/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec:  A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
 1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
+1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/distillation) and a German version of DistilBERT.
-1. **[DiT](https://huggingface.co/docs/transformers/main/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
+1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -257,11 +257,12 @@ conda install -c huggingface transformers
 1. **[Data2Vec](https://huggingface.co/docs/transformers/main/model_doc/data2vec)** (来自 Facebook) 伴随论文 [Data2Vec:  A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) 由 Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli 发布。
 1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (来自 Microsoft) 伴随论文 [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) 由 Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen 发布。
 1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (来自 Microsoft) 伴随论文 [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) 由 Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen 发布。
+1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (来自 Berkeley/Facebook/Google) 伴随论文 [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) 由 Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch 发布。
 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (来自 Facebook) 伴随论文 [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) 由 Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou 发布。
 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (来自 Facebook) 伴随论文 [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) 由 Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko 发布。
 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (来自 Microsoft Research) 伴随论文 [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) 由 Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan 发布。
 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (来自 HuggingFace), 伴随论文 [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) 由 Victor Sanh, Lysandre Debut and Thomas Wolf 发布。 同样的方法也应用于压缩 GPT-2 到 [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/distillation), RoBERTa 到 [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/distillation), Multilingual BERT 到 [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/distillation) 和德语版 DistilBERT。
-1. **[DiT](https://huggingface.co/docs/transformers/main/model_doc/dit)** (来自 Microsoft Research) 伴随论文 [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) 由 Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei 发布。
+1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (来自 Microsoft Research) 伴随论文 [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) 由 Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei 发布。
 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (来自 Facebook) 伴随论文 [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) 由 Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih 发布。
 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (来自 Google Research/Stanford University) 伴随论文 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 由 Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 发布。
 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (来自 Google Research) 伴随论文 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 由 Sascha Rothe, Shashi Narayan, Aliaksei Severyn 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -269,11 +269,12 @@ conda install -c huggingface transformers
 1. **[Data2Vec](https://huggingface.co/docs/transformers/main/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec:  A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
 1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
+1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/distillation) and a German version of DistilBERT.
-1. **[DiT](https://huggingface.co/docs/transformers/main/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
+1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.

--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -194,6 +194,8 @@
      title: DeBERTa
    - local: model_doc/deberta-v2
      title: DeBERTa-v2
+    - local: model_doc/decision_transformer
+      title: Decision Transformer
    - local: model_doc/deit
      title: DeiT
    - local: model_doc/detr

--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -78,6 +78,7 @@ conversion utilities for the following models.
 1. **[Data2Vec](model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec:  A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
 1. **[DeBERTa](model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
 1. **[DeBERTa-v2](model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
+1. **[Decision Transformer](model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
 1. **[DiT](model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei.
 1. **[DeiT](model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
 1. **[DETR](model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
@@ -191,6 +192,7 @@ Flax), PyTorch, and/or TensorFlow.
 |        Data2VecText         |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |           DeBERTa           |       ✅       |       ✅       |       ✅        |         ✅         |      ❌      |
 |         DeBERTa-v2          |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
+|    Decision Transformer     |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |            DeiT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |            DETR             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |         DistilBERT          |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |

--- a/docs/source/model_doc/decision_transformer.mdx
+++ b/docs/source/model_doc/decision_transformer.mdx
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Decision Transformer
+
+## Overview
+
+The Decision Transformer model was proposed in [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345)  
+by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
+
+The abstract from the paper is the following:
+
+*We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. 
+This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances
+ in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that 
+ casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or 
+ compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked 
+ Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our 
+ Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, 
+ Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on 
+ Atari, OpenAI Gym, and Key-to-Door tasks.*
+
+Tips:
+
+This version of the model is for tasks where the state is a vector, image-based states will come soon.
+
+This model was contributed by [edbeeching](https://huggingface.co/edbeeching). The original code can be found [here](https://github.com/kzl/decision-transformer).
+
+## DecisionTransformerConfig
+
+[[autodoc]] DecisionTransformerConfig
+
+
+## DecisionTransformerGPT2Model
+
+[[autodoc]] DecisionTransformerGPT2Model
+    - forward
+
+## DecisionTransformerModel
+
+[[autodoc]] DecisionTransformerModel
+    - forward
--- a/examples/research_projects/decision_transformer/requirements.txt
+++ b/examples/research_projects/decision_transformer/requirements.txt
+absl-py==1.0.0
+aiohttp==3.8.1
+aiosignal==1.2.0
+alembic==1.7.7
+appdirs==1.4.4
+APScheduler==3.9.1
+arrow==1.2.2
+asttokens==2.0.5
+astunparse==1.6.3
+async-timeout==4.0.2
+attrs==21.4.0
+audioread==2.1.9
+autopage==0.5.0
+backcall==0.2.0
+backoff==1.11.1
+backports.zoneinfo==0.2.1
+binaryornot==0.4.4
+black==22.1.0
+boto3==1.16.34
+botocore==1.19.63
+Brotli==1.0.9
+cachetools==5.0.0
+certifi==2021.10.8
+cffi==1.15.0
+chardet==4.0.0
+charset-normalizer==2.0.12
+chex==0.1.1
+click==8.0.4
+cliff==3.10.1
+clldutils==3.11.1
+cloudpickle==2.0.0
+cmaes==0.8.2
+cmd2==2.4.0
+codecarbon==1.2.0
+colorlog==6.6.0
+cookiecutter==1.7.2
+cryptography==36.0.2
+csvw==2.0.0
+cycler==0.11.0
+Cython==0.29.28
+dash==2.3.0
+dash-bootstrap-components==1.0.3
+dash-core-components==2.0.0
+dash-html-components==2.0.0
+dash-table==5.0.0
+datasets==2.0.0
+decorator==5.1.1
+Deprecated==1.2.13
+dill==0.3.4
+dlinfo==1.2.1
+dm-tree==0.1.6
+docker==4.4.4
+execnet==1.9.0
+executing==0.8.3
+faiss-cpu==1.7.2
+fasteners==0.17.3
+filelock==3.6.0
+fire==0.4.0
+flake8==4.0.1
+Flask==2.0.3
+Flask-Compress==1.11
+flatbuffers==2.0
+flax==0.4.0
+fonttools==4.31.1
+frozenlist==1.3.0
+fsspec==2022.2.0
+fugashi==1.1.2
+gast==0.5.3
+gitdb==4.0.9
+GitPython==3.1.18
+glfw==2.5.1
+google-auth==2.6.2
+google-auth-oauthlib==0.4.6
+google-pasta==0.2.0
+greenlet==1.1.2
+grpcio==1.44.0
+gym==0.23.1
+gym-notices==0.0.6
+h5py==3.6.0
+huggingface-hub==0.4.0
+hypothesis==6.39.4
+idna==3.3
+imageio==2.16.1
+importlib-metadata==4.11.3
+importlib-resources==5.4.0
+iniconfig==1.1.1
+ipadic==1.0.0
+ipython==8.1.1
+isodate==0.6.1
+isort==5.10.1
+itsdangerous==2.1.1
+jax==0.3.4
+jaxlib==0.3.2
+jedi==0.18.1
+Jinja2==2.11.3
+jinja2-time==0.2.0
+jmespath==0.10.0
+joblib==1.1.0
+jsonschema==4.4.0
+keras==2.8.0
+Keras-Preprocessing==1.1.2
+kiwisolver==1.4.0
+kubernetes==12.0.1
+libclang==13.0.0
+librosa==0.9.1
+llvmlite==0.38.0
+Mako==1.2.0
+Markdown==3.3.6
+MarkupSafe==1.1.1
+matplotlib==3.5.1
+matplotlib-inline==0.1.3
+mccabe==0.6.1
+msgpack==1.0.3
+mujoco-py==2.1.2.14
+multidict==6.0.2
+multiprocess==0.70.12.2
+mypy-extensions==0.4.3
+nltk==3.7
+numba==0.55.1
+numpy==1.22.3
+oauthlib==3.2.0
+onnx==1.11.0
+onnxconverter-common==1.9.0
+opt-einsum==3.3.0
+optax==0.1.1
+optuna==2.10.0
+packaging==21.3
+pandas==1.4.1
+parameterized==0.8.1
+parso==0.8.3
+pathspec==0.9.0
+pbr==5.8.1
+pexpect==4.8.0
+phonemizer==3.0.1
+pickleshare==0.7.5
+Pillow==9.0.1
+Pint==0.16.1
+plac==1.3.4
+platformdirs==2.5.1
+plotly==5.6.0
+pluggy==1.0.0
+pooch==1.6.0
+portalocker==2.0.0
+poyo==0.5.0
+prettytable==3.2.0
+prompt-toolkit==3.0.28
+protobuf==3.19.4
+psutil==5.9.0
+ptyprocess==0.7.0
+pure-eval==0.2.2
+py==1.11.0
+py-cpuinfo==8.0.0
+pyarrow==7.0.0
+pyasn1==0.4.8
+pyasn1-modules==0.2.8
+pycodestyle==2.8.0
+pycparser==2.21
+pyctcdecode==0.3.0
+pyflakes==2.4.0
+Pygments==2.11.2
+pygtrie==2.4.2
+pynvml==11.4.1
+pyOpenSSL==22.0.0
+pyparsing==3.0.7
+pyperclip==1.8.2
+pypng==0.0.21
+pyrsistent==0.18.1
+pytest==7.1.1
+pytest-forked==1.4.0
+pytest-timeout==2.1.0
+pytest-xdist==2.5.0
+python-dateutil==2.8.2
+python-slugify==6.1.1
+pytz==2022.1
+pytz-deprecation-shim==0.1.0.post0
+PyYAML==6.0
+ray==1.11.0
+redis==4.1.4
+regex==2022.3.15
+requests==2.27.1
+requests-oauthlib==1.3.1
+resampy==0.2.2
+responses==0.18.0
+rfc3986==1.5.0
+rouge-score==0.0.4
+rsa==4.8
+s3transfer==0.3.7
+sacrebleu==1.5.1
+sacremoses==0.0.49
+scikit-learn==1.0.2
+scipy==1.8.0
+segments==2.2.0
+sentencepiece==0.1.96
+sigopt==8.2.0
+six==1.16.0
+smmap==5.0.0
+sortedcontainers==2.4.0
+SoundFile==0.10.3.post1
+SQLAlchemy==1.4.32
+stack-data==0.2.0
+stevedore==3.5.0
+tabulate==0.8.9
+tenacity==8.0.1
+tensorboard==2.8.0
+tensorboard-data-server==0.6.1
+tensorboard-plugin-wit==1.8.1
+tensorboardX==2.5
+tensorflow==2.8.0
+tensorflow-io-gcs-filesystem==0.24.0
+termcolor==1.1.0
+text-unidecode==1.3
+tf-estimator-nightly==2.8.0.dev2021122109
+tf2onnx==1.9.3
+threadpoolctl==3.1.0
+timeout-decorator==0.5.0
+timm==0.5.4
+tokenizers==0.11.6
+tomli==2.0.1
+toolz==0.11.2
+torch==1.11.0
+torchaudio==0.11.0
+torchvision==0.12.0
+tqdm==4.63.0
+traitlets==5.1.1
+-e git+git@github.com:edbeeching/transformers.git@77b90113ca0a0e4058b046796c874bdc98f1da61#egg=transformers
+typing-extensions==4.1.1
+tzdata==2022.1
+tzlocal==4.1
+unidic==1.1.0
+unidic-lite==1.0.8
+uritemplate==4.1.1
+urllib3==1.26.9
+wasabi==0.9.0
+wcwidth==0.2.5
+websocket-client==1.3.1
+Werkzeug==2.0.3
+wrapt==1.14.0
+xxhash==3.0.0
+yarl==1.7.2
+zipp==3.7.0
\ No newline at end of file
--- a/examples/research_projects/decision_transformer/run_decision_transformer.py
+++ b/examples/research_projects/decision_transformer/run_decision_transformer.py
+import numpy as np
+import torch
+
+import gym
+from mujoco_py import GlfwContext
+from transformers import DecisionTransformerModel
+
+
+GlfwContext(offscreen=True)  # Create a window to init GLFW.
+
+
+def get_action(model, states, actions, rewards, returns_to_go, timesteps):
+    # we don't care about the past rewards in this model
+
+    states = states.reshape(1, -1, model.config.state_dim)
+    actions = actions.reshape(1, -1, model.config.act_dim)
+    returns_to_go = returns_to_go.reshape(1, -1, 1)
+    timesteps = timesteps.reshape(1, -1)
+
+    if model.config.max_length is not None:
+        states = states[:, -model.config.max_length :]
+        actions = actions[:, -model.config.max_length :]
+        returns_to_go = returns_to_go[:, -model.config.max_length :]
+        timesteps = timesteps[:, -model.config.max_length :]
+
+        # pad all tokens to sequence length
+        attention_mask = torch.cat(
+            [torch.zeros(model.config.max_length - states.shape[1]), torch.ones(states.shape[1])]
+        )
+        attention_mask = attention_mask.to(dtype=torch.long, device=states.device).reshape(1, -1)
+        states = torch.cat(
+            [
+                torch.zeros(
+                    (states.shape[0], model.config.max_length - states.shape[1], model.config.state_dim),
+                    device=states.device,
+                ),
+                states,
+            ],
+            dim=1,
+        ).to(dtype=torch.float32)
+        actions = torch.cat(
+            [
+                torch.zeros(
+                    (actions.shape[0], model.config.max_length - actions.shape[1], model.config.act_dim),
+                    device=actions.device,
+                ),
+                actions,
+            ],
+            dim=1,
+        ).to(dtype=torch.float32)
+        returns_to_go = torch.cat(
+            [
+                torch.zeros(
+                    (returns_to_go.shape[0], model.config.max_length - returns_to_go.shape[1], 1),
+                    device=returns_to_go.device,
+                ),
+                returns_to_go,
+            ],
+            dim=1,
+        ).to(dtype=torch.float32)
+        timesteps = torch.cat(
+            [
+                torch.zeros(
+                    (timesteps.shape[0], model.config.max_length - timesteps.shape[1]), device=timesteps.device
+                ),
+                timesteps,
+            ],
+            dim=1,
+        ).to(dtype=torch.long)
+    else:
+        attention_mask = None
+
+    _, action_preds, _ = model(
+        states=states,
+        actions=actions,
+        rewards=rewards,
+        returns_to_go=returns_to_go,
+        timesteps=timesteps,
+        attention_mask=attention_mask,
+        return_dict=False,
+    )
+
+    return action_preds[0, -1]
+
+
+# build the environment
+
+env = gym.make("Hopper-v3")
+state_dim = env.observation_space.shape[0]
+act_dim = env.action_space.shape[0]
+max_ep_len = 1000
+device = "cuda"
+scale = 1000.0  # normalization for rewards/returns
+TARGET_RETURN = 3600 / scale  # evaluation conditioning targets, 3600 is reasonable from the paper LINK
+state_mean = np.array(
+    [
+        1.311279,
+        -0.08469521,
+        -0.5382719,
+        -0.07201576,
+        0.04932366,
+        2.1066856,
+        -0.15017354,
+        0.00878345,
+        -0.2848186,
+        -0.18540096,
+        -0.28461286,
+    ]
+)
+state_std = np.array(
+    [
+        0.17790751,
+        0.05444621,
+        0.21297139,
+        0.14530419,
+        0.6124444,
+        0.85174465,
+        1.4515252,
+        0.6751696,
+        1.536239,
+        1.6160746,
+        5.6072536,
+    ]
+)
+state_mean = torch.from_numpy(state_mean).to(device=device)
+state_std = torch.from_numpy(state_std).to(device=device)
+
+# Create the decision transformer model
+model = DecisionTransformerModel.from_pretrained("edbeeching/decision-transformer-gym-hopper-medium")
+model = model.to(device)
+model.eval()
+
+for ep in range(10):
+    episode_return, episode_length = 0, 0
+    state = env.reset()
+    target_return = torch.tensor(TARGET_RETURN, device=device, dtype=torch.float32).reshape(1, 1)
+    states = torch.from_numpy(state).reshape(1, state_dim).to(device=device, dtype=torch.float32)
+    actions = torch.zeros((0, act_dim), device=device, dtype=torch.float32)
+    rewards = torch.zeros(0, device=device, dtype=torch.float32)
+
+    timesteps = torch.tensor(0, device=device, dtype=torch.long).reshape(1, 1)
+    for t in range(max_ep_len):
+        env.render()
+        # add padding
+        actions = torch.cat([actions, torch.zeros((1, act_dim), device=device)], dim=0)
+        rewards = torch.cat([rewards, torch.zeros(1, device=device)])
+
+        action = get_action(
+            model,
+            (states.to(dtype=torch.float32) - state_mean) / state_std,
+            actions.to(dtype=torch.float32),
+            rewards.to(dtype=torch.float32),
+            target_return.to(dtype=torch.float32),
+            timesteps.to(dtype=torch.long),
+        )
+        actions[-1] = action
+        action = action.detach().cpu().numpy()
+
+        state, reward, done, _ = env.step(action)
+
+        cur_state = torch.from_numpy(state).to(device=device).reshape(1, state_dim)
+        states = torch.cat([states, cur_state], dim=0)
+        rewards[-1] = reward
+
+        pred_return = target_return[0, -1] - (reward / scale)
+        target_return = torch.cat([target_return, pred_return.reshape(1, 1)], dim=1)
+        timesteps = torch.cat([timesteps, torch.ones((1, 1), device=device, dtype=torch.long) * (t + 1)], dim=1)
+
+        episode_return += reward
+        episode_length += 1
+
+        if done:
+            break
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -173,6 +173,7 @@ _import_structure = {
    "models.data2vec": ["DATA2VEC_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP", "Data2VecAudioConfig", "Data2VecTextConfig"],
    "models.deberta": ["DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaConfig", "DebertaTokenizer"],
    "models.deberta_v2": ["DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaV2Config"],
+    "models.decision_transformer": ["DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "DecisionTransformerConfig"],
    "models.deit": ["DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DeiTConfig"],
    "models.detr": ["DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DetrConfig"],
    "models.dialogpt": [],
@@ -901,6 +902,15 @@ if is_torch_available():
            "DebertaV2PreTrainedModel",
        ]
    )
+    _import_structure["models.decision_transformer"].extend(
+        [
+            "DECISION_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "DecisionTransformerGPT2Model",
+            "DecisionTransformerGPT2PreTrainedModel",
+            "DecisionTransformerModel",
+            "DecisionTransformerPreTrainedModel",
+        ]
+    )
    _import_structure["models.deit"].extend(
        [
            "DEIT_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -2509,6 +2519,10 @@ if TYPE_CHECKING:
    from .models.data2vec import DATA2VEC_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, Data2VecAudioConfig, Data2VecTextConfig
    from .models.deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig, DebertaTokenizer
    from .models.deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config
+    from .models.decision_transformer import (
+        DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        DecisionTransformerConfig,
+    )
    from .models.deit import DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, DeiTConfig
    from .models.detr import DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DetrConfig
    from .models.distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig, DistilBertTokenizer
@@ -3128,6 +3142,13 @@ if TYPE_CHECKING:
            DebertaV2Model,
            DebertaV2PreTrainedModel,
        )
+        from .models.decision_transformer import (
+            DECISION_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            DecisionTransformerGPT2Model,
+            DecisionTransformerGPT2PreTrainedModel,
+            DecisionTransformerModel,
+            DecisionTransformerPreTrainedModel,
+        )
        from .models.deit import (
            DEIT_PRETRAINED_MODEL_ARCHIVE_LIST,
            DeiTForImageClassification,

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -43,6 +43,7 @@ from . import (
    data2vec,
    deberta,
    deberta_v2,
+    decision_transformer,
    deit,
    detr,
    dialogpt,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -29,8 +29,10 @@ logger = logging.get_logger(__name__)
 CONFIG_MAPPING_NAMES = OrderedDict(
    [
        # Add configs here
+        ("decision_transformer", "DecisionTransformerConfig"),
        ("glpn", "GLPNConfig"),
        ("maskformer", "MaskFormerConfig"),
+        ("decision_transformer", "DecisionTransformerConfig"),
        ("poolformer", "PoolFormerConfig"),
        ("convnext", "ConvNextConfig"),
        ("van", "VanConfig"),
@@ -222,6 +224,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
 MODEL_NAMES_MAPPING = OrderedDict(
    [
        # Add full (and cased) model names here
+        ("decision_transformer", "Decision Transformer"),
        ("glpn", "GLPN"),
        ("maskformer", "MaskFormer"),
        ("poolformer", "PoolFormer"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -28,8 +28,11 @@ logger = logging.get_logger(__name__)
 MODEL_MAPPING_NAMES = OrderedDict(
    [
        # Base model mapping
+        ("decision_transformer", "DecisionTransformerModel"),
        ("glpn", "GLPNModel"),
        ("maskformer", "MaskFormerModel"),
+        ("decision_transformer", "DecisionTransformerModel"),
+        ("decision_transformer_gpt2", "DecisionTransformerGPT2Model"),
        ("poolformer", "PoolFormerModel"),
        ("convnext", "ConvNextModel"),
        ("van", "VanModel"),

--- a/src/transformers/models/decision_transformer/__init__.py
+++ b/src/transformers/models/decision_transformer/__init__.py
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+
+# Copyright 2020 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+# rely on isort to merge the imports
+from ...file_utils import _LazyModule, is_torch_available
+
+
+_import_structure = {
+    "configuration_decision_transformer": [
+        "DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "DecisionTransformerConfig",
+    ],
+}
+
+if is_torch_available():
+    _import_structure["modeling_decision_transformer"] = [
+        "DECISION_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "DecisionTransformerGPT2Model",
+        "DecisionTransformerGPT2PreTrainedModel",
+        "DecisionTransformerModel",
+        "DecisionTransformerPreTrainedModel",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_decision_transformer import (
+        DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        DecisionTransformerConfig,
+    )
+
+    if is_torch_available():
+        from .modeling_decision_transformer import (
+            DECISION_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            DecisionTransformerGPT2Model,
+            DecisionTransformerGPT2PreTrainedModel,
+            DecisionTransformerModel,
+            DecisionTransformerPreTrainedModel,
+        )
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
--- a/src/transformers/models/decision_transformer/configuration_decision_transformer.py
+++ b/src/transformers/models/decision_transformer/configuration_decision_transformer.py
+# coding=utf-8
+# Copyright 2022 The HuggingFace Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Decision Transformer model configuration"""
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "edbeeching/decision-transformer-gym-hopper-medium": "https://huggingface.co/edbeeching/decision-transformer-gym-hopper-medium/resolve/main/config.json",
+    # See all DecisionTransformer models at https://huggingface.co/models?filter=decision_transformer
+}
+
+
+class DecisionTransformerConfig(PretrainedConfig):
+    """
+    This is the configuration class to store the configuration of a [`DecisionTransformerModel`]. It is used to
+    instantiate a Decision Transformer model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the standard
+    DecisionTransformer architecture. Many of the config options are used to instatiate the GPT2 model that is used as
+    part of the architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        state_dim (`int`, *optional*, defaults to 17):
+            The state size for the RL environment
+        act_dim (`int`, *optional*, defaults to 4):
+            The size of the output action space
+        hidden_size (`int`, *optional*, defaults to 128):
+            The size of the hidden layers
+        max_ep_len (`int`, *optional*, defaults to 4096):
+            The maximum length of an episode in the environment
+        action_tanh (`bool`, *optional*, defaults to True):
+            Whether to use a tanh activation on action prediction
+        vocab_size (`int`, *optional*, defaults to 50257):
+            Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`DecisionTransformerModel`].
+        n_positions (`int`, *optional*, defaults to 1024):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        n_embd (`int`, *optional*, defaults to 768):
+            Dimensionality of the embeddings and hidden states.
+        n_layer (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        n_head (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        n_inner (`int`, *optional*):
+            Dimensionality of the inner feed-forward layers. If unset, will default to 4 times `n_embd`.
+        activation_function (`str`, *optional*, defaults to `"gelu"`):
+            Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`.
+        resid_pdrop (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        embd_pdrop (`int`, *optional*, defaults to 0.1):
+            The dropout ratio for the embeddings.
+        attn_pdrop (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention.
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
+            The epsilon to use in the layer normalization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        scale_attn_weights (`bool`, *optional*, defaults to `True`):
+            Scale attention weights by dividing by sqrt(hidden_size)..
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        scale_attn_by_inverse_layer_idx (`bool`, *optional*, defaults to `False`):
+            Whether to additionally scale attention weights by `1 / layer_idx + 1`.
+        reorder_and_upcast_attn (`bool`, *optional*, defaults to `False`):
+            Whether to scale keys (K) prior to computing attention (dot-product) and upcast attention
+            dot-product/softmax to float() when training with mixed precision.
+
+    Example:
+
+    ```python
+    >>> from transformers import DecisionTransformerModel, DecisionTransformerConfig
+
+    >>> # Initializing a DecisionTransformer configuration
+    >>> configuration = DecisionTransformerConfig()
+
+    >>> # Initializing a model from the configuration
+    >>> model = DecisionTransformerConfig(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "decision_transformer"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    attribute_map = {
+        "max_position_embeddings": "n_positions",
+        "num_attention_heads": "n_head",
+        "num_hidden_layers": "n_layer",
+    }
+
+    def __init__(
+        self,
+        state_dim=17,
+        act_dim=4,
+        hidden_size=128,
+        max_ep_len=4096,
+        action_tanh=True,
+        vocab_size=1,
+        n_positions=1024,
+        n_embd=768,
+        n_layer=3,
+        n_head=1,
+        n_inner=None,
+        activation_function="relu",
+        resid_pdrop=0.1,
+        embd_pdrop=0.1,
+        attn_pdrop=0.1,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        summary_type="cls_index",
+        summary_use_proj=True,
+        summary_activation=None,
+        summary_proj_to_labels=True,
+        summary_first_dropout=0.1,
+        scale_attn_weights=True,
+        use_cache=True,
+        bos_token_id=50256,
+        eos_token_id=50256,
+        scale_attn_by_inverse_layer_idx=False,
+        reorder_and_upcast_attn=False,
+        **kwargs,
+    ):
+
+        self.state_dim = state_dim
+        self.act_dim = act_dim
+        self.hidden_size = hidden_size
+        self.max_ep_len = max_ep_len
+        self.action_tanh = action_tanh
+        self.vocab_size = vocab_size
+        self.n_positions = n_positions
+        self.n_embd = n_embd
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.n_inner = n_inner
+        self.activation_function = activation_function
+        self.resid_pdrop = resid_pdrop
+        self.embd_pdrop = embd_pdrop
+        self.attn_pdrop = attn_pdrop
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.summary_type = summary_type
+        self.summary_use_proj = summary_use_proj
+        self.summary_activation = summary_activation
+        self.summary_first_dropout = summary_first_dropout
+        self.summary_proj_to_labels = summary_proj_to_labels
+        self.scale_attn_weights = scale_attn_weights
+        self.use_cache = use_cache
+        self.scale_attn_by_inverse_layer_idx = scale_attn_by_inverse_layer_idx
+        self.reorder_and_upcast_attn = reorder_and_upcast_attn
+
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+
+        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
--- a/src/transformers/models/decision_transformer/modeling_decision_transformer.py
+++ b/src/transformers/models/decision_transformer/modeling_decision_transformer.py
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -1425,6 +1425,37 @@ class DebertaV2PreTrainedModel(metaclass=DummyObject):
        requires_backends(self, ["torch"])


+DECISION_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class DecisionTransformerGPT2Model(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class DecisionTransformerGPT2PreTrainedModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class DecisionTransformerModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class DecisionTransformerPreTrainedModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
 DEIT_PRETRAINED_MODEL_ARCHIVE_LIST = None



--- a/tests/decision_transformer/__init__.py
+++ b/tests/decision_transformer/__init__.py
--- a/tests/decision_transformer/test_modeling_decision_transformer.py
+++ b/tests/decision_transformer/test_modeling_decision_transformer.py
+# coding=utf-8
+# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the PyTorch DecisionTransformer model. """
+
+
+import inspect
+import unittest
+
+from transformers import DecisionTransformerConfig, is_torch_available
+from transformers.testing_utils import require_torch, slow, torch_device
+
+from ..generation.test_generation_utils import GenerationTesterMixin
+from ..test_configuration_common import ConfigTester
+from ..test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
+
+
+if is_torch_available():
+    import torch
+
+    from transformers import DecisionTransformerModel
+    from transformers.models.decision_transformer.modeling_decision_transformer import (
+        DECISION_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+    )
+
+
+class DecisionTransformerModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        act_dim=6,
+        state_dim=17,
+        hidden_size=23,
+        max_length=11,
+        is_training=True,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.act_dim = act_dim
+        self.state_dim = state_dim
+        self.hidden_size = hidden_size
+        self.max_length = max_length
+        self.is_training = is_training
+
+    def prepare_config_and_inputs(self):
+        states = floats_tensor((self.batch_size, self.seq_length, self.state_dim))
+        actions = floats_tensor((self.batch_size, self.seq_length, self.act_dim))
+        rewards = floats_tensor((self.batch_size, self.seq_length, 1))
+        returns_to_go = floats_tensor((self.batch_size, self.seq_length, 1))
+        timesteps = ids_tensor((self.batch_size, self.seq_length), vocab_size=1000)
+        attention_mask = random_attention_mask((self.batch_size, self.seq_length))
+
+        config = self.get_config()
+
+        return (
+            config,
+            states,
+            actions,
+            rewards,
+            returns_to_go,
+            timesteps,
+            attention_mask,
+        )
+
+    def get_config(self):
+        return DecisionTransformerConfig(
+            batch_size=self.batch_size,
+            seq_length=self.seq_length,
+            act_dim=self.act_dim,
+            state_dim=self.state_dim,
+            hidden_size=self.hidden_size,
+            max_length=self.max_length,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        states,
+        actions,
+        rewards,
+        returns_to_go,
+        timesteps,
+        attention_mask,
+    ):
+        model = DecisionTransformerModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        result = model(states, actions, rewards, returns_to_go, timesteps, attention_mask)
+
+        self.parent.assertEqual(result.state_preds.shape, states.shape)
+        self.parent.assertEqual(result.action_preds.shape, actions.shape)
+        self.parent.assertEqual(result.return_preds.shape, returns_to_go.shape)
+        self.parent.assertEqual(
+            result.last_hidden_state.shape, (self.batch_size, self.seq_length * 3, self.hidden_size)
+        )  # seq length *3 as there are 3 modelities: states, returns and actions
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            states,
+            actions,
+            rewards,
+            returns_to_go,
+            timesteps,
+            attention_mask,
+        ) = config_and_inputs
+        inputs_dict = {
+            "states": states,
+            "actions": actions,
+            "rewards": rewards,
+            "returns_to_go": returns_to_go,
+            "timesteps": timesteps,
+            "attention_mask": attention_mask,
+        }
+        return config, inputs_dict
+
+
+@require_torch
+class DecisionTransformerModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+
+    all_model_classes = (DecisionTransformerModel,) if is_torch_available() else ()
+    all_generative_model_classes = ()
+
+    # Ignoring of a failing test from GenerationTesterMixin, as the model does not use inputs_ids
+    test_generate_without_input_ids = False
+
+    # Ignoring of a failing tests from ModelTesterMixin, as the model does not implement these features
+    test_pruning = False
+    test_resize_embeddings = False
+    test_head_masking = False
+    test_attention_outputs = False
+    test_hidden_states_output = False
+    test_inputs_embeds = False
+    test_model_common_attributes = False
+    test_gradient_checkpointing = False
+
+    def setUp(self):
+        self.model_tester = DecisionTransformerModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=DecisionTransformerConfig, hidden_size=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    @slow
+    def test_model_from_pretrained(self):
+        for model_name in DECISION_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
+            model = DecisionTransformerModel.from_pretrained(model_name)
+            self.assertIsNotNone(model)
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = [
+                "states",
+                "actions",
+                "rewards",
+                "returns_to_go",
+                "timesteps",
+                "attention_mask",
+            ]
+
+            self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
+
+
+@require_torch
+class DecisionTransformerModelIntegrationTest(unittest.TestCase):
+    @slow
+    def test_autoregressive_prediction(self):
+        """
+        An integration test that performs autoregressive prediction of state, action and return
+        from a sequence of state, actions and returns. Test is performed over two timesteps.
+
+        """
+
+        NUM_STEPS = 2  # number of steps of autoregressive prediction we will perform
+        TARGET_RETURN = 10  # defined by the RL environment, may be normalized
+        model = DecisionTransformerModel.from_pretrained("edbeeching/decision-transformer-gym-hopper-expert")
+        model = model.to(torch_device)
+        config = model.config
+        torch.manual_seed(0)
+        state = torch.randn(1, 1, config.state_dim).to(device=torch_device, dtype=torch.float32)  # env.reset()
+
+        expected_outputs = torch.tensor([[0.2384, -0.2955, 0.8741], [0.6765, -0.0793, -0.1298]], device=torch_device)
+
+        returns_to_go = torch.tensor(TARGET_RETURN, device=torch_device, dtype=torch.float32).reshape(1, 1, 1)
+        states = state
+        actions = torch.zeros(1, 0, config.act_dim, device=torch_device, dtype=torch.float32)
+        rewards = torch.zeros(1, 0, device=torch_device, dtype=torch.float32)
+        timesteps = torch.tensor(0, device=torch_device, dtype=torch.long).reshape(1, 1)
+
+        for step in range(NUM_STEPS):
+            actions = torch.cat([actions, torch.zeros(1, 1, config.act_dim, device=torch_device)], dim=1)
+            rewards = torch.cat([rewards, torch.zeros(1, 1, device=torch_device)], dim=1)
+
+            attention_mask = torch.ones(1, states.shape[1]).to(dtype=torch.long, device=states.device)
+
+            with torch.no_grad():
+                _, action_pred, _ = model(
+                    states=states,
+                    actions=actions,
+                    rewards=rewards,
+                    returns_to_go=returns_to_go,
+                    timesteps=timesteps,
+                    attention_mask=attention_mask,
+                    return_dict=False,
+                )
+
+            self.assertEqual(action_pred.shape, actions.shape)
+            self.assertTrue(torch.allclose(action_pred[0, -1], expected_outputs[step], atol=1e-4))
+            state, reward, _, _ = (  # env.step(action)
+                torch.randn(1, 1, config.state_dim).to(device=torch_device, dtype=torch.float32),
+                1.0,
+                False,
+                {},
+            )
+
+            actions[-1] = action_pred[0, -1]
+            states = torch.cat([states, state], dim=1)
+            pred_return = returns_to_go[0, -1] - reward
+            returns_to_go = torch.cat([returns_to_go, pred_return.reshape(1, 1, 1)], dim=1)
+            timesteps = torch.cat(
+                [timesteps, torch.ones((1, 1), device=torch_device, dtype=torch.long) * (step + 1)], dim=1
+            )
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -45,6 +45,7 @@ PRIVATE_MODELS = [
 # Being in this list is an exception and should **not** be the rule.
 IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
    # models to ignore for not tested
+    "DecisionTransformerGPT2Model",  # Building part of bigger (tested) model.
    "SegformerDecodeHead",  # Building part of bigger (tested) model.
    "PLBartEncoder",  # Building part of bigger (tested) model.
    "PLBartDecoder",  # Building part of bigger (tested) model.
@@ -95,6 +96,7 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
 # Update this list with test files that don't have a tester with a `all_model_classes` variable and which don't
 # trigger the common tests.
 TEST_FILES_WITH_NO_COMMON_TESTS = [
+    "decision_transformer/test_modeling_decision_transformer.py",
    "camembert/test_modeling_camembert.py",
    "mt5/test_modeling_flax_mt5.py",
    "mbart/test_modeling_mbart.py",
@@ -108,12 +110,14 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
    "xlm_roberta/test_modeling_xlm_roberta.py",
    "vision_text_dual_encoder/test_modeling_vision_text_dual_encoder.py",
    "vision_text_dual_encoder/test_modeling_flax_vision_text_dual_encoder.py",
+    "decision_transformer/test_modeling_decision_transformer.py",
 ]

 # Update this list for models that are not in any of the auto MODEL_XXX_MAPPING. Being in this list is an exception and
 # should **not** be the rule.
 IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
    # models to ignore for model xxx mapping
+    "DecisionTransformerGPT2Model",
    "GLPNForDepthEstimation",
    "ViltForQuestionAnswering",
    "ViltForImagesAndTextClassification",