Add QDQBert model and quantization examples of SQUAD task (#14066)

* clean up branch for add-qdqbert-model * README update for QAT example; update docstrings in modeling_qdqbert.py * Update qdqbert.rst * Update README.md * Update README.md * calibration data using traning set; QAT example runs in fp32 * re-use BERTtokenizer for qdqbert * Update docs/source/model_doc/qdqbert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/model_doc/qdqbert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/model_doc/qdqbert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove qdqbert tokenizer * Update qdqbert.rst * update evaluate-hf-trt-qa.py * update configuration_qdqbert.py * update modeling_qdqbert.py: add copied statement; replace assert with ValueError * update copied from statement * add is_quantization_available; run make fix-copies * unittest add require_quantization * add backend dependency to qdqbert model * update README; update evaluate script; make style * lint * docs qdqbert update * circleci build_doc add pytorch-quantization for qdqbert * update README * update example readme with instructions to upgrade TensorRT to 8.2 * Update src/transformers/models/qdqbert/configuration_qdqbert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/qdqbert/configuration_qdqbert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/qdqbert/configuration_qdqbert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/qdqbert/configuration_qdqbert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * change quantization to pytorch_quantization for backend requirement * feed_forward_chunking not supported in QDQBert * make style * update model docstrings and comments in testing scripts * rename example to quantization-qdqbert; rename example scripts from qat to quant * Update src/transformers/models/qdqbert/modeling_qdqbert.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * rm experimental functions in quant_trainer * qa cleanup * make fix-copies for docs index.rst * fix doctree; use post_init() for qdqbert * fix early device assignment for qdqbert * fix CI:Model templates runner Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

Add QDQBert model and quantization examples of SQUAD task (#14066)
* clean up branch for add-qdqbert-model * README update for QAT example; update docstrings in modeling_qdqbert.py * Update qdqbert.rst * Update README.md * Update README.md * calibration data using traning set; QAT example runs in fp32 * re-use BERTtokenizer for qdqbert * Update docs/source/model_doc/qdqbert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/model_doc/qdqbert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/model_doc/qdqbert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove qdqbert tokenizer * Update qdqbert.rst * update evaluate-hf-trt-qa.py * update configuration_qdqbert.py * update modeling_qdqbert.py: add copied statement; replace assert with ValueError * update copied from statement * add is_quantization_available; run make fix-copies * unittest add require_quantization * add backend dependency to qdqbert model * update README; update evaluate script; make style * lint * docs qdqbert update * circleci build_doc add pytorch-quantization for qdqbert * update README * update example readme with instructions to upgrade TensorRT to 8.2 * Update src/transformers/models/qdqbert/configuration_qdqbert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/qdqbert/configuration_qdqbert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/qdqbert/configuration_qdqbert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/models/qdqbert/configuration_qdqbert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * change quantization to pytorch_quantization for backend requirement * feed_forward_chunking not supported in QDQBert * make style * update model docstrings and comments in testing scripts * rename example to quantization-qdqbert; rename example scripts from qat to quant * Update src/transformers/models/qdqbert/modeling_qdqbert.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * rm experimental functions in quant_trainer * qa cleanup * make fix-copies for docs index.rst * fix doctree; use post_init() for qdqbert * fix early device assignment for qdqbert * fix CI:Model templates runner Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
a59e7c1e · Shang Zhang · GitHub · 81fe8afa · a59e7c1e · a59e7c1e
Unverified Commit a59e7c1e authored Nov 19, 2021 by Shang Zhang Committed by GitHub Nov 19, 2021
20 changed files
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -754,6 +754,7 @@ jobs:
            - run: pip install --upgrade pip
            - run: pip install ."[docs]"
            - run: pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.10.0+cpu.html
+            - run: pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com
            - save_cache:
                  key: v0.4-build_doc-{{ checksum "setup.py" }}
                  paths:

--- a/README.md
+++ b/README.md
@@ -268,6 +268,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
 1. **[PhoBERT](https://huggingface.co/transformers/model_doc/phobert.html)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[QDQBert](https://huggingface.co/transformers/model_doc/qdqbert.html)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
 1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.

--- a/README_ko.md
+++ b/README_ko.md
@@ -266,6 +266,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
 1. **[PhoBERT](https://huggingface.co/transformers/model_doc/phobert.html)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[QDQBert](https://huggingface.co/transformers/model_doc/qdqbert.html)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
 1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.

--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -290,6 +290,7 @@ conda install -c huggingface transformers
 1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (来自 Google) 伴随论文 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 由 Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 发布。
 1. **[PhoBERT](https://huggingface.co/transformers/model_doc/phobert.html)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。
 1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
+1. **[QDQBert](https://huggingface.co/transformers/model_doc/qdqbert.html)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。
 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (来自 Google Research) 伴随论文 [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) 由 Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya 发布。
 1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (来自 Google Research) 伴随论文 [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) 由 Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder 发布。
 1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (来自 Facebook), 伴随论文 [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) 由 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov 发布。

--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -302,6 +302,7 @@ conda install -c huggingface transformers
 1. **[Pegasus](https://huggingface.co/transformers/model_doc/pegasus.html)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
 1. **[PhoBERT](https://huggingface.co/transformers/model_doc/phobert.html)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen.
 1. **[ProphetNet](https://huggingface.co/transformers/model_doc/prophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
+1. **[QDQBert](https://huggingface.co/transformers/model_doc/qdqbert.html)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
 1. **[Reformer](https://huggingface.co/transformers/model_doc/reformer.html)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
 1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -270,85 +270,88 @@ Supported models
 57. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
    Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
    Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-58. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
+58. :doc:`QDQBert <model_doc/qdqbert>` (from NVIDIA) released with the paper `Integer Quantization for Deep Learning
+    Inference: Principles and Empirical Evaluation <https://arxiv.org/abs/2004.09602>`__ by Hao Wu, Patrick Judd,
+    Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
+59. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
    Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
-59. :doc:`RemBERT <model_doc/rembert>` (from Google Research) released with the paper `Rethinking embedding coupling in
+60. :doc:`RemBERT <model_doc/rembert>` (from Google Research) released with the paper `Rethinking embedding coupling in
    pre-trained language models <https://arxiv.org/pdf/2010.12821.pdf>`__ by Hyung Won Chung, Thibault Févry, Henry
    Tsai, M. Johnson, Sebastian Ruder.
-60. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
+61. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
    Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
    Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
-61. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
+62. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
    Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
    Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
-62. :doc:`SegFormer <model_doc/segformer>` (from NVIDIA) released with the paper `SegFormer: Simple and Efficient
+63. :doc:`SegFormer <model_doc/segformer>` (from NVIDIA) released with the paper `SegFormer: Simple and Efficient
    Design for Semantic Segmentation with Transformers <https://arxiv.org/abs/2105.15203>`__ by Enze Xie, Wenhai Wang,
    Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
-63. :doc:`SEW <model_doc/sew>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in Unsupervised
+64. :doc:`SEW <model_doc/sew>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in Unsupervised
    Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu
    Han, Kilian Q. Weinberger, Yoav Artzi.
-64. :doc:`SEW-D <model_doc/sew_d>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in
+65. :doc:`SEW-D <model_doc/sew_d>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in
    Unsupervised Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim,
    Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
-65. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
+66. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
    `fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
    Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
-66. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
+67. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
    `Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
    Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
-67. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
+68. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
    Question Answering by Pretraining Span Selection <https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain,
    Jonathan Berant, Amir Globerson, Omer Levy.
-68. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
+69. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
    vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola,
    Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
-69. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
+70. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
    Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
    Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-70. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
+71. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
    `google-research/text-to-text-transfer-transformer
    <https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__ by
    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi
    Zhou and Wei Li and Peter J. Liu.
-71. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
+72. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
    Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
    Francesco Piccinno and Julian Martin Eisenschlos.
-72. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
+73. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
    Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
    Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-73. :doc:`TrOCR <model_doc/trocr>` (from Microsoft), released together with the paper `TrOCR: Transformer-based Optical
+74. :doc:`TrOCR <model_doc/trocr>` (from Microsoft), released together with the paper `TrOCR: Transformer-based Optical
    Character Recognition with Pre-trained Models <https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei
    Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
-74. :doc:`UniSpeech <model_doc/unispeech>` (from Microsoft Research) released with the paper `UniSpeech: Unified Speech
+75. :doc:`UniSpeech <model_doc/unispeech>` (from Microsoft Research) released with the paper `UniSpeech: Unified Speech
    Representation Learning with Labeled and Unlabeled Data <https://arxiv.org/abs/2101.07597>`__ by Chengyi Wang, Yu
    Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
-75. :doc:`UniSpeechSat <model_doc/unispeech_sat>` (from Microsoft Research) released with the paper `UNISPEECH-SAT:
+76. :doc:`UniSpeechSat <model_doc/unispeech_sat>` (from Microsoft Research) released with the paper `UNISPEECH-SAT:
    UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING <https://arxiv.org/abs/2110.05752>`__ by
    Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li,
    Xiangzhan Yu.
-76. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
+77. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
    Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
    Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
    Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
-77. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
+78. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
    Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
    Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
-78. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
+79. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
    Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
    Zhou, Abdelrahman Mohamed, Michael Auli.
-79. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
+80. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
    Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
-80. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
+81. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
    Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
    Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-81. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
+82. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
    Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
    Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
    Zettlemoyer and Veselin Stoyanov.
-82. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
+83. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
    Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
    Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-83. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
+84. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
    Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
    Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.

@@ -464,6 +467,8 @@ Flax), PyTorch, and/or TensorFlow.
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |         ProphetNet          |       ✅       |       ❌       |       ✅        |         ❌         |      ❌      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
+|           QDQBert           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |             RAG             |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |          Reformer           |       ✅       |       ✅       |       ✅        |         ❌         |      ❌      |
@@ -658,6 +663,7 @@ Flax), PyTorch, and/or TensorFlow.
    model_doc/pegasus
    model_doc/phobert
    model_doc/prophetnet
+    model_doc/qdqbert
    model_doc/rag
    model_doc/reformer
    model_doc/rembert

--- a/docs/source/model_doc/qdqbert.rst
+++ b/docs/source/model_doc/qdqbert.rst
+.. 
+    Copyright 2021 NVIDIA Corporation and The HuggingFace Team. All rights reserved.
+
+    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+    the License. You may obtain a copy of the License at
+
+        http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+    an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+    specific language governing permissions and limitations under the License.
+
+QDQBERT
+-----------------------------------------------------------------------------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The QDQBERT model can be referenced in `Integer Quantization for Deep Learning Inference: Principles and Empirical
+Evaluation <https://arxiv.org/abs/2004.09602>`__ by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius
+Micikevicius.
+
+The abstract from the paper is the following:
+
+*Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by
+taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of
+quantization parameters and evaluate their choices on a wide range of neural network models for different application
+domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration
+by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is
+able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are
+more difficult to quantize, such as MobileNets and BERT-large.*
+
+Tips:
+
+- QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer
+  inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.
+
+- QDQBERT requires the dependency of `Pytorch Quantization Toolkit
+  <https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization>`__. To install ``pip install
+  pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com``
+
+- QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example *bert-base-uncased*), and
+  perform Quantization Aware Training/Post Training Quantization.
+
+- A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for
+  SQUAD task can be found at `transformers/examples/research_projects/quantization-qdqbert/
+  </examples/research_projects/quantization-qdqbert/>`_.
+
+This model was contributed by `shangz <https://huggingface.co/shangz>`__.
+
+
+Set default quantizers
+_______________________________________________________________________________________________________________________
+
+QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by
+:obj:`TensorQuantizer` in `Pytorch Quantization Toolkit
+<https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization>`__. :obj:`TensorQuantizer` is the module
+for quantizing tensors, with :obj:`QuantDescriptor` defining how the tensor should be quantized. Refer to `Pytorch
+Quantization Toolkit userguide
+<https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html>`__ for more details.
+
+Before creating QDQBERT model, one has to set the default :obj:`QuantDescriptor` defining default tensor quantizers.
+Example:
+
+.. code-block::
+
+    >>> import pytorch_quantization.nn as quant_nn
+    >>> from pytorch_quantization.tensor_quant import QuantDescriptor
+
+    >>> # The default tensor quantizer is set to use Max calibration method
+    >>> input_desc = QuantDescriptor(num_bits=8, calib_method="max")
+    >>> # The default tensor quantizer is set to be per-channel quantization for weights
+    >>> weight_desc = QuantDescriptor(num_bits=8, axis=((0,)))
+    >>> quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
+    >>> quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)
+
+
+Calibration
+_______________________________________________________________________________________________________________________
+
+Calibration is the terminology of passing data samples to the quantizer and deciding the best scaling factors for
+tensors. After setting up the tensor quantizers, one can use the following example to calibrate the model:
+
+.. code-block::
+
+    >>> # Find the TensorQuantizer and enable calibration
+    >>> for name, module in model.named_modules():
+    >>>     if name.endswith('_input_quantizer'):
+    >>>         module.enable_calib()
+    >>>         module.disable_quant()  # Use full precision data to calibrate
+
+    >>> # Feeding data samples
+    >>> model(x)
+    >>> # ...
+
+    >>> # Finalize calibration
+    >>> for name, module in model.named_modules():
+    >>>     if name.endswith('_input_quantizer'):
+    >>>         module.load_calib_amax()
+    >>>         module.enable_quant()
+
+    >>> # If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process
+    >>> model.cuda()
+
+    >>> # Keep running the quantized model
+    >>> # ...
+
+
+Export to ONNX
+_______________________________________________________________________________________________________________________
+
+The goal of exporting to ONNX is to deploy inference by `TensorRT <https://developer.nvidia.com/tensorrt>`__. Fake
+quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. After setting static member of
+TensorQuantizer to use Pytorch’s own fake quantization functions, fake quantized model can be exported to ONNX, follow
+the instructions in `torch.onnx <https://pytorch.org/docs/stable/onnx.html>`__. Example:
+
+.. code-block::
+
+    >>> from pytorch_quantization.nn import TensorQuantizer
+    >>> TensorQuantizer.use_fb_fake_quant = True
+
+    >>> # Load the calibrated model
+    >>> ...
+    >>> # ONNX export
+    >>> torch.onnx.export(...)
+
+
+QDQBertConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertConfig
+    :members:
+
+
+QDQBertModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertModel
+    :members: forward
+
+
+QDQBertLMHeadModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertLMHeadModel
+    :members: forward
+
+
+QDQBertForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertForMaskedLM
+    :members: forward
+
+
+QDQBertForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertForSequenceClassification
+    :members: forward
+
+
+QDQBertForNextSentencePrediction
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertForNextSentencePrediction
+    :members: forward
+
+
+QDQBertForMultipleChoice
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertForMultipleChoice
+    :members: forward
+
+
+QDQBertForTokenClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertForTokenClassification
+    :members: forward
+
+
+QDQBertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.QDQBertForQuestionAnswering
+    :members: forward
+
--- a/examples/research_projects/quantization-qdqbert/Dockerfile
+++ b/examples/research_projects/quantization-qdqbert/Dockerfile
+# coding=utf-8
+# Copyright 2021 NVIDIA Corporation. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+FROM nvcr.io/nvidia/pytorch:21.07-py3
+LABEL maintainer="Hugging Face"
+LABEL repository="transformers"
+
+RUN apt-get update
+RUN apt-get install sudo
+
+RUN python3 -m pip install --no-cache-dir --upgrade pip
+RUN python3 -m pip install --no-cache-dir --ignore-installed ruamel.yaml \
+    mkl \
+    absl-py \
+    yamlpy \
+    tensorboardX
+RUN python3 -m pip install --no-cache-dir \
+    pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com
+
+WORKDIR /workspace
+COPY . transformers/
+RUN cd transformers/ && \
+    python3 -m pip install --no-cache-dir .
+
+RUN python3 -m pip install --no-cache-dir datasets \
+    accelerate
\ No newline at end of file
--- a/examples/research_projects/quantization-qdqbert/README.md
+++ b/examples/research_projects/quantization-qdqbert/README.md
+<!---
+Copyright 2021 NVIDIA Corporation. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Huggingface QDQBERT Quantization Example
+
+The QDQBERT model adds fake quantization (pair of QuantizeLinear/DequantizeLinear ops) to:
+ * linear layer inputs and weights
+ * matmul inputs
+ * residual add inputs
+
+In this example, we use QDQBERT model to do quantization on SQuAD task, including Quantization Aware Training (QAT), Post Training Quantization (PTQ) and inferencing using TensorRT.
+
+Required:
+- [pytorch-quantization toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization)
+- [TensorRT >= 8.2](https://developer.nvidia.com/tensorrt)
+- PyTorch >= 1.10.0
+
+## Setup the environment with Dockerfile
+
+Under the directory of `transformers/`, build the docker image:
+```
+docker build . -f examples/research_projects/quantization-qdqbert/Dockerfile -t bert_quantization:latest
+```
+
+Run the docker:
+```
+docker run --gpus all --privileged --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 bert_quantization:latest
+```
+
+*Note that the current NGC pytorch container (pytorch:21.07-py3) has TensorRT 8.0 which doesn't meet the requiremnt of TensorRT >= 8.2. One can either update the Dockerfile with the latest [NGC pytorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) once it supports TensorRT 8.2, or manually download and install [TensorRT >= 8.2](https://developer.nvidia.com/nvidia-tensorrt-download) in the container.*
+
+
+In the container:
+```
+cd transformers/examples/research_projects/quantization-qdqbert/
+```
+
+## Quantization Aware Training (QAT)
+
+Calibrate the pretrained model and finetune with quantization awared:
+
+```
+python3 run_quant_qa.py \
+  --model_name_or_path bert-base-uncased \
+  --dataset_name squad \
+  --max_seq_length 128 \
+  --doc_stride 32 \
+  --output_dir calib/bert-base-uncased \
+  --do_calib \
+  --calibrator percentile \
+  --percentile 99.99
+```
+
+```
+python3 run_quant_qa.py \
+  --model_name_or_path calib/bert-base-uncased \
+  --dataset_name squad \
+  --do_train \
+  --do_eval \
+  --per_device_train_batch_size 12 \
+  --learning_rate 4e-5 \
+  --num_train_epochs 2 \
+  --max_seq_length 128 \
+  --doc_stride 32 \
+  --output_dir finetuned_int8/bert-base-uncased \
+  --tokenizer_name bert-base-uncased \
+  --save_steps 0
+```
+
+### Export QAT model to ONNX
+
+To export the QAT model finetuned above:
+
+```
+python3 run_quant_qa.py \
+  --model_name_or_path finetuned_int8/bert-base-uncased \
+  --output_dir ./ \
+  --save_onnx \
+  --per_device_eval_batch_size 1 \
+  --max_seq_length 128 \
+  --doc_stride 32 \
+  --dataset_name squad \
+  --tokenizer_name bert-base-uncased
+```
+
+Use `--recalibrate-weights` to calibrate the weight ranges according to the quantizer axis. Use `--quant-per-tensor` for per tensor quantization (default is per channel).
+Recalibrating will affect the accuracy of the model, but the change should be minimal (< 0.5 F1).
+
+### Benchmark the INT8 QAT ONNX model inference with TensorRT using dummy input
+
+```
+trtexec --onnx=model.onnx --explicitBatch --workspace=16384 --int8 --shapes=input_ids:64x128,attention_mask:64x128,token_type_ids:64x128 --verbose
+```
+
+### Evaluate the INT8 QAT ONNX model inference with TensorRT
+
+```
+python3 evaluate-hf-trt-qa.py \
+  --onnx_model_path=./model.onnx \
+  --output_dir ./ \
+  --per_device_eval_batch_size 64 \
+  --max_seq_length 128 \
+  --doc_stride 32 \
+  --dataset_name squad \
+  --tokenizer_name bert-base-uncased \
+  --int8 \
+  --seed 42
+```
+
+## Fine-tuning of FP32 model for comparison
+
+Finetune a fp32 precision model with [transformers/examples/pytorch/question-answering/](../../pytorch/question-answering/):
+
+```
+python3 ../../pytorch/question-answering/run_qa.py \
+  --model_name_or_path bert-base-uncased \
+  --dataset_name squad \
+  --per_device_train_batch_size 12 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 2 \
+  --max_seq_length 128 \
+  --doc_stride 32 \
+  --output_dir ./finetuned_fp32/bert-base-uncased \
+  --save_steps 0 \
+  --do_train \
+  --do_eval
+```
+
+## Post Training Quantization (PTQ)
+
+### PTQ by calibrating and evaluating the finetuned FP32 model above:
+
+```
+python3 run_quant_qa.py \
+  --model_name_or_path ./finetuned_fp32/bert-base-uncased \
+  --dataset_name squad \
+  --calibrator percentile \
+  --percentile 99.99 \
+  --max_seq_length 128 \
+  --doc_stride 32 \
+  --output_dir ./calib/bert-base-uncased \
+  --save_steps 0 \
+  --do_calib \
+  --do_eval
+```
+
+### Export the INT8 PTQ model to ONNX
+
+```
+python3 run_quant_qa.py \
+  --model_name_or_path ./calib/bert-base-uncased \
+  --output_dir ./ \
+  --save_onnx \
+  --per_device_eval_batch_size 1 \
+  --max_seq_length 128 \
+  --doc_stride 32 \
+  --dataset_name squad \
+  --tokenizer_name bert-base-uncased
+```
+
+### Evaluate the INT8 PTQ ONNX model inference with TensorRT
+
+```
+python3 evaluate-hf-trt-qa.py \
+  --onnx_model_path=./model.onnx \
+  --output_dir ./ \
+  --per_device_eval_batch_size 64 \
+  --max_seq_length 128 \
+  --doc_stride 32 \
+  --dataset_name squad \
+  --tokenizer_name bert-base-uncased \
+  --int8 \
+  --seed 42
+```
+
+### Quantization options
+
+Some useful options to support different implementations and optimizations. These should be specified for both calibration and finetuning.
+
+|argument|description|
+|--------|-----------|
+|`--quant-per-tensor`| quantize weights with one quantization range per tensor |
+|`--fuse-qkv` | use a single range (the max) for quantizing QKV weights and output activations  |
+|`--clip-gelu N` | clip the output of GELU to a maximum of N when quantizing (e.g. 10) |
+|`--disable-dropout` | disable dropout for consistent activation ranges |
--- a/examples/research_projects/quantization-qdqbert/evaluate-hf-trt-qa.py
+++ b/examples/research_projects/quantization-qdqbert/evaluate-hf-trt-qa.py
+# coding=utf-8
+# Copyright 2021 NVIDIA Corporation. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Finetuning the library models for question-answering on SQuAD (DistilBERT, Bert, XLM, XLNet)."""
+import argparse
+import logging
+import os
+import time
+import timeit
+
+import datasets
+import numpy as np
+import torch
+from absl import logging as absl_logging
+from datasets import load_dataset, load_metric
+from torch.utils.data import DataLoader
+
+import pycuda.autoinit  # noqa: F401
+import pycuda.driver as cuda
+import tensorrt as trt
+import transformers
+from accelerate import Accelerator
+from transformers import AutoTokenizer, EvalPrediction, default_data_collator, set_seed
+from transformers.trainer_pt_utils import nested_concat, nested_truncate
+from utils_qa import postprocess_qa_predictions
+
+
+TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+absl_logger = absl_logging.get_absl_logger()
+absl_logger.setLevel(logging.WARNING)
+
+logger = logging.getLogger(__name__)
+
+parser = argparse.ArgumentParser()
+
+# Required parameters
+parser.add_argument(
+    "--onnx_model_path",
+    default=None,
+    type=str,
+    required=True,
+    help="Path to ONNX model: ",
+)
+
+parser.add_argument(
+    "--output_dir",
+    default=None,
+    type=str,
+    required=True,
+    help="The output directory where the model checkpoints and predictions will be written.",
+)
+
+# Other parameters
+
+parser.add_argument(
+    "--tokenizer_name",
+    default="",
+    type=str,
+    required=True,
+    help="Pretrained tokenizer name or path if not the same as model_name",
+)
+
+parser.add_argument(
+    "--version_2_with_negative",
+    action="store_true",
+    help="If true, the SQuAD examples contain some that do not have an answer.",
+)
+parser.add_argument(
+    "--null_score_diff_threshold",
+    type=float,
+    default=0.0,
+    help="If null_score - best_non_null is greater than the threshold predict null.",
+)
+
+parser.add_argument(
+    "--max_seq_length",
+    default=384,
+    type=int,
+    help="The maximum total input sequence length after WordPiece tokenization. Sequences "
+    "longer than this will be truncated, and sequences shorter than this will be padded.",
+)
+parser.add_argument(
+    "--doc_stride",
+    default=128,
+    type=int,
+    help="When splitting up a long document into chunks, how much stride to take between chunks.",
+)
+
+parser.add_argument("--per_device_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.")
+
+parser.add_argument(
+    "--n_best_size",
+    default=20,
+    type=int,
+    help="The total number of n-best predictions to generate in the nbest_predictions.json output file.",
+)
+parser.add_argument(
+    "--max_answer_length",
+    default=30,
+    type=int,
+    help="The maximum length of an answer that can be generated. This is needed because the start "
+    "and end predictions are not conditioned on one another.",
+)
+
+parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+
+parser.add_argument(
+    "--dataset_name",
+    type=str,
+    default=None,
+    required=True,
+    help="The name of the dataset to use (via the datasets library).",
+)
+parser.add_argument(
+    "--dataset_config_name",
+    type=str,
+    default=None,
+    help="The configuration name of the dataset to use (via the datasets library).",
+)
+parser.add_argument(
+    "--preprocessing_num_workers", type=int, default=4, help="A csv or a json file containing the training data."
+)
+parser.add_argument(
+    "--overwrite_cache", type=bool, default=False, help="Overwrite the cached training and evaluation sets"
+)
+parser.add_argument(
+    "--fp16",
+    action="store_true",
+    help="Whether to use 16-bit (mixed) precision instead of 32-bit",
+)
+parser.add_argument(
+    "--int8",
+    action="store_true",
+    help="Whether to use INT8",
+)
+
+args = parser.parse_args()
+
+if args.tokenizer_name:
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=True)
+else:
+    raise ValueError(
+        "You are instantiating a new tokenizer from scratch. This is not supported by this script."
+        "You can do it from another script, save it, and load it from here, using --tokenizer_name."
+    )
+
+logger.info("Training/evaluation parameters %s", args)
+
+args.eval_batch_size = args.per_device_eval_batch_size
+
+INPUT_SHAPE = (args.eval_batch_size, args.max_seq_length)
+
+# TRT Engine properties
+STRICT_TYPES = True
+
+engine_name = "temp_engine/bert-fp32.engine"
+if args.fp16:
+    engine_name = "temp_engine/bert-fp16.engine"
+if args.int8:
+    engine_name = "temp_engine/bert-int8.engine"
+
+# import ONNX file
+if not os.path.exists("temp_engine"):
+    os.makedirs("temp_engine")
+
+EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, trt.OnnxParser(
+    network, TRT_LOGGER
+) as parser:
+    with open(args.onnx_model_path, "rb") as model:
+        if not parser.parse(model.read()):
+            for error in range(parser.num_errors):
+                print(parser.get_error(error))
+
+    # Query input names and shapes from parsed TensorRT network
+    network_inputs = [network.get_input(i) for i in range(network.num_inputs)]
+    input_names = [_input.name for _input in network_inputs]  # ex: ["actual_input1"]
+
+    with builder.create_builder_config() as config:
+        config.max_workspace_size = 1 << 50
+        if STRICT_TYPES:
+            config.set_flag(trt.BuilderFlag.STRICT_TYPES)
+        if args.fp16:
+            config.set_flag(trt.BuilderFlag.FP16)
+        if args.int8:
+            config.set_flag(trt.BuilderFlag.INT8)
+        profile = builder.create_optimization_profile()
+        config.add_optimization_profile(profile)
+        for i in range(len(input_names)):
+            profile.set_shape(input_names[i], INPUT_SHAPE, INPUT_SHAPE, INPUT_SHAPE)
+        engine = builder.build_engine(network, config)
+
+        # serialize_engine and store in file (can be directly loaded and deserialized):
+        with open(engine_name, "wb") as f:
+            f.write(engine.serialize())
+
+
+# run inference with TRT
+def model_infer(inputs, context, d_inputs, h_output0, h_output1, d_output0, d_output1, stream):
+    input_ids = np.asarray(inputs["input_ids"], dtype=np.int32)
+    attention_mask = np.asarray(inputs["attention_mask"], dtype=np.int32)
+    token_type_ids = np.asarray(inputs["token_type_ids"], dtype=np.int32)
+
+    # Copy inputs
+    cuda.memcpy_htod_async(d_inputs[0], input_ids.ravel(), stream)
+    cuda.memcpy_htod_async(d_inputs[1], attention_mask.ravel(), stream)
+    cuda.memcpy_htod_async(d_inputs[2], token_type_ids.ravel(), stream)
+    # start time
+    start_time = time.time()
+    # Run inference
+    context.execute_async(
+        bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output0), int(d_output1)], stream_handle=stream.handle
+    )
+    # Transfer predictions back from GPU
+    cuda.memcpy_dtoh_async(h_output0, d_output0, stream)
+    cuda.memcpy_dtoh_async(h_output1, d_output1, stream)
+    # Synchronize the stream and take time
+    stream.synchronize()
+    # end time
+    end_time = time.time()
+    infer_time = end_time - start_time
+    outputs = (h_output0, h_output1)
+    # print(outputs)
+    return outputs, infer_time
+
+
+# Initialize the accelerator. We will let the accelerator handle device placement for us in this example.
+accelerator = Accelerator()
+# Make one log on every process with the configuration for debugging.
+logging.basicConfig(
+    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+    datefmt="%m/%d/%Y %H:%M:%S",
+    level=logging.INFO,
+)
+
+# Setup logging, we only want one process per machine to log things on the screen.
+# accelerator.is_local_main_process is only True for one process per machine.
+logger.setLevel(logging.INFO if accelerator.is_local_main_process else logging.ERROR)
+if accelerator.is_local_main_process:
+    datasets.utils.logging.set_verbosity_warning()
+    transformers.utils.logging.set_verbosity_info()
+else:
+    datasets.utils.logging.set_verbosity_error()
+    transformers.utils.logging.set_verbosity_error()
+
+# If passed along, set the training seed now.
+if args.seed is not None:
+    set_seed(args.seed)
+
+# Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below)
+# or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
+# (the dataset will be downloaded automatically from the datasets Hub).
+#
+# For CSV/JSON files, this script will use the column called 'text' or the first column if no column called
+# 'text' is found. You can easily tweak this behavior (see below).
+if args.dataset_name is not None:
+    # Downloading and loading a dataset from the hub.
+    raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name)
+else:
+    raise ValueError("Evaluation requires a dataset name")
+# See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
+# https://huggingface.co/docs/datasets/loading_datasets.html.
+
+# Preprocessing the datasets.
+# Preprocessing is slighlty different for training and evaluation.
+
+column_names = raw_datasets["validation"].column_names
+
+question_column_name = "question" if "question" in column_names else column_names[0]
+context_column_name = "context" if "context" in column_names else column_names[1]
+answer_column_name = "answers" if "answers" in column_names else column_names[2]
+
+# Padding side determines if we do (question|context) or (context|question).
+pad_on_right = tokenizer.padding_side == "right"
+
+if args.max_seq_length > tokenizer.model_max_length:
+    logger.warning(
+        f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the"
+        f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}."
+    )
+
+max_seq_length = min(args.max_seq_length, tokenizer.model_max_length)
+
+
+# Validation preprocessing
+def prepare_validation_features(examples):
+    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
+    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
+    # left whitespace
+    examples[question_column_name] = [q.lstrip() for q in examples[question_column_name]]
+
+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    tokenized_examples = tokenizer(
+        examples[question_column_name if pad_on_right else context_column_name],
+        examples[context_column_name if pad_on_right else question_column_name],
+        truncation="only_second" if pad_on_right else "only_first",
+        max_length=max_seq_length,
+        stride=args.doc_stride,
+        return_overflowing_tokens=True,
+        return_offsets_mapping=True,
+        padding="max_length",
+    )
+
+    # Since one example might give us several features if it has a long context, we need a map from a feature to
+    # its corresponding example. This key gives us just that.
+    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
+
+    # For evaluation, we will need to convert our predictions to substrings of the context, so we keep the
+    # corresponding example_id and we will store the offset mappings.
+    tokenized_examples["example_id"] = []
+
+    for i in range(len(tokenized_examples["input_ids"])):
+        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
+        sequence_ids = tokenized_examples.sequence_ids(i)
+        context_index = 1 if pad_on_right else 0
+
+        # One example can give several spans, this is the index of the example containing this span of text.
+        sample_index = sample_mapping[i]
+        tokenized_examples["example_id"].append(examples["id"][sample_index])
+
+        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
+        # position is part of the context or not.
+        tokenized_examples["offset_mapping"][i] = [
+            (o if sequence_ids[k] == context_index else None)
+            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
+        ]
+
+    return tokenized_examples
+
+
+eval_examples = raw_datasets["validation"]
+# Validation Feature Creation
+eval_dataset = eval_examples.map(
+    prepare_validation_features,
+    batched=True,
+    num_proc=args.preprocessing_num_workers,
+    remove_columns=column_names,
+    load_from_cache_file=not args.overwrite_cache,
+    desc="Running tokenizer on validation dataset",
+)
+
+data_collator = default_data_collator
+
+eval_dataset_for_model = eval_dataset.remove_columns(["example_id", "offset_mapping"])
+eval_dataloader = DataLoader(
+    eval_dataset_for_model, collate_fn=data_collator, batch_size=args.per_device_eval_batch_size
+)
+
+
+# Post-processing:
+def post_processing_function(examples, features, predictions, stage="eval"):
+    # Post-processing: we match the start logits and end logits to answers in the original context.
+    predictions = postprocess_qa_predictions(
+        examples=examples,
+        features=features,
+        predictions=predictions,
+        version_2_with_negative=args.version_2_with_negative,
+        n_best_size=args.n_best_size,
+        max_answer_length=args.max_answer_length,
+        null_score_diff_threshold=args.null_score_diff_threshold,
+        output_dir=args.output_dir,
+        prefix=stage,
+    )
+    # Format the result to the format the metric expects.
+    if args.version_2_with_negative:
+        formatted_predictions = [
+            {"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()
+        ]
+    else:
+        formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()]
+
+    references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in examples]
+    return EvalPrediction(predictions=formatted_predictions, label_ids=references)
+
+
+metric = load_metric("squad_v2" if args.version_2_with_negative else "squad")
+
+# Evaluation!
+logger.info("Loading ONNX model %s for evaluation", args.onnx_model_path)
+with open(engine_name, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, runtime.deserialize_cuda_engine(
+    f.read()
+) as engine, engine.create_execution_context() as context:
+
+    # setup for TRT inferrence
+    for i in range(len(input_names)):
+        context.set_binding_shape(i, INPUT_SHAPE)
+    assert context.all_binding_shapes_specified
+
+    def binding_nbytes(binding):
+        return trt.volume(engine.get_binding_shape(binding)) * engine.get_binding_dtype(binding).itemsize
+
+    # Allocate device memory for inputs and outputs.
+    d_inputs = [cuda.mem_alloc(binding_nbytes(binding)) for binding in engine if engine.binding_is_input(binding)]
+
+    # Allocate output buffer
+    h_output0 = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)
+    h_output1 = cuda.pagelocked_empty(tuple(context.get_binding_shape(4)), dtype=np.float32)
+    d_output0 = cuda.mem_alloc(h_output0.nbytes)
+    d_output1 = cuda.mem_alloc(h_output1.nbytes)
+
+    # Create a stream in which to copy inputs/outputs and run inference.
+    stream = cuda.Stream()
+
+    # Evaluation
+    logger.info("***** Running Evaluation *****")
+    logger.info(f"  Num examples = {len(eval_dataset)}")
+    logger.info(f"  Batch size = {args.per_device_eval_batch_size}")
+
+    total_time = 0.0
+    niter = 0
+    start_time = timeit.default_timer()
+
+    all_preds = None
+    for step, batch in enumerate(eval_dataloader):
+
+        outputs, infer_time = model_infer(batch, context, d_inputs, h_output0, h_output1, d_output0, d_output1, stream)
+        total_time += infer_time
+        niter += 1
+
+        start_logits, end_logits = outputs
+        start_logits = torch.tensor(start_logits)
+        end_logits = torch.tensor(end_logits)
+
+        # necessary to pad predictions and labels for being gathered
+        start_logits = accelerator.pad_across_processes(start_logits, dim=1, pad_index=-100)
+        end_logits = accelerator.pad_across_processes(end_logits, dim=1, pad_index=-100)
+
+        logits = (accelerator.gather(start_logits).cpu().numpy(), accelerator.gather(end_logits).cpu().numpy())
+        all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100)
+
+    if all_preds is not None:
+        all_preds = nested_truncate(all_preds, len(eval_dataset))
+
+    evalTime = timeit.default_timer() - start_time
+    logger.info("  Evaluation done in total %f secs (%f sec per example)", evalTime, evalTime / len(eval_dataset))
+    # Inference time from TRT
+    logger.info("Average Inference Time = {:.3f} ms".format(total_time * 1000 / niter))
+    logger.info("Total Inference Time =  {:.3f} ms".format(total_time * 1000))
+    logger.info("Total Number of Inference =  %d", niter)
+
+prediction = post_processing_function(eval_examples, eval_dataset, all_preds)
+eval_metric = metric.compute(predictions=prediction.predictions, references=prediction.label_ids)
+logger.info(f"Evaluation metrics: {eval_metric}")
--- a/examples/research_projects/quantization-qdqbert/quant_trainer.py
+++ b/examples/research_projects/quantization-qdqbert/quant_trainer.py
+# coding=utf-8
+# Copyright 2021 NVIDIA Corporation. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Helper functions for training models with pytorch-quantization"""
+import logging
+import re
+
+import torch
+
+import pytorch_quantization
+import pytorch_quantization.nn as quant_nn
+from pytorch_quantization import calib
+from pytorch_quantization.tensor_quant import QuantDescriptor
+
+
+logger = logging.getLogger(__name__)
+
+name_width = 50  # max width of layer names
+qname_width = 70  # max width of quantizer names
+
+# ========================================== Quant Trainer API ==========================================
+
+
+def add_arguments(parser):
+    """Add arguments to parser for functions defined in quant_trainer."""
+
+    group = parser.add_argument_group("quant_trainer arguments")
+    group.add_argument("--wprec", type=int, default=8, help="weight precision")
+    group.add_argument("--aprec", type=int, default=8, help="activation precision")
+    group.add_argument("--quant-per-tensor", action="store_true", help="per tensor weight scaling")
+    group.add_argument("--quant-disable", action="store_true", help="disable all quantizers")
+    group.add_argument("--quant-disable-embeddings", action="store_true", help="disable all embeddings quantizers")
+    group.add_argument("--quant-disable-keyword", type=str, nargs="+", help="disable quantizers by keyword")
+    group.add_argument("--quant-disable-layer-module", type=str, help="disable quantizers by keyword under layer.\d+.")
+    group.add_argument("--quant-enable-layer-module", type=str, help="enable quantizers by keyword under layer.\d+.")
+    group.add_argument("--calibrator", default="max", help="which quantization range calibrator to use")
+    group.add_argument("--percentile", default=None, type=float, help="percentile for PercentileCalibrator")
+    group.add_argument("--fuse-qkv", action="store_true", help="use the same scale factor for qkv")
+    group.add_argument("--clip-gelu", metavar="N", type=float, help="clip gelu output maximum value to N")
+    group.add_argument(
+        "--recalibrate-weights",
+        action="store_true",
+        help="recalibrate weight amaxes by taking the max of the weights."
+        " amaxes will be computed with the current quantization granularity (axis).",
+    )
+
+
+def set_default_quantizers(args):
+    """Set default quantizers before creating the model."""
+
+    if args.calibrator == "max":
+        calib_method = "max"
+    elif args.calibrator == "percentile":
+        if args.percentile is None:
+            raise ValueError("Specify --percentile when using percentile calibrator")
+        calib_method = "histogram"
+    elif args.calibrator == "mse":
+        calib_method = "histogram"
+    else:
+        raise ValueError(f"Invalid calibrator {args.calibrator}")
+
+    input_desc = QuantDescriptor(num_bits=args.aprec, calib_method=calib_method)
+    weight_desc = QuantDescriptor(num_bits=args.wprec, axis=(None if args.quant_per_tensor else (0,)))
+    quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
+    quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)
+
+
+def configure_model(model, args, calib=False, eval=False):
+    """Function called before the training loop."""
+
+    logger.info("Configuring Model for Quantization")
+    logger.info(f"using quantization package {pytorch_quantization.__file__}")
+
+    if not calib:
+        if args.quant_disable_embeddings:
+            set_quantizer_by_name(model, ["embeddings"], which="weight", _disabled=True)
+
+        if args.quant_disable:
+            set_quantizer_by_name(model, [""], _disabled=True)
+
+        if args.quant_disable_keyword:
+            set_quantizer_by_name(model, args.quant_disable_keyword, _disabled=True)
+
+        if args.quant_disable_layer_module:
+            set_quantizer_by_name(model, ["layer.\d+." + args.quant_disable_layer_module], _disabled=True)
+
+        if args.quant_enable_layer_module:
+            set_quantizer_by_name(model, ["layer.\d+." + args.quant_enable_layer_module], _disabled=False)
+
+        if args.recalibrate_weights:
+            recalibrate_weights(model)
+
+        if args.fuse_qkv:
+            fuse_qkv(model, args)
+
+    if args.clip_gelu:
+        clip_gelu(model, args.clip_gelu)
+
+    # if args.local_rank in [-1, 0] and not calib:
+    print_quant_summary(model)
+
+
+def enable_calibration(model):
+    """Enable calibration of all *_input_quantizer modules in model."""
+
+    logger.info("Enabling Calibration")
+    for name, module in model.named_modules():
+        if name.endswith("_quantizer"):
+            if module._calibrator is not None:
+                module.disable_quant()
+                module.enable_calib()
+            else:
+                module.disable()
+            logger.info(f"{name:80}: {module}")
+
+
+def finish_calibration(model, args):
+    """Disable calibration and load amax for all "*_input_quantizer modules in model."""
+
+    logger.info("Loading calibrated amax")
+    for name, module in model.named_modules():
+        if name.endswith("_quantizer"):
+            if module._calibrator is not None:
+                if isinstance(module._calibrator, calib.MaxCalibrator):
+                    module.load_calib_amax()
+                else:
+                    module.load_calib_amax("percentile", percentile=args.percentile)
+                module.enable_quant()
+                module.disable_calib()
+            else:
+                module.enable()
+    model.cuda()
+    print_quant_summary(model)
+
+
+# ========================================== Helper Function ==========================================
+
+
+def fuse_qkv(model, args):
+    """Adjust quantization ranges to match an implementation where the QKV projections are implemented with a single GEMM.
+    Force the weight and output scale factors to match by taking the max of (Q,K,V).
+    """
+
+    def fuse3(qq, qk, qv):
+        for mod in [qq, qk, qv]:
+            if not hasattr(mod, "_amax"):
+                print("          WARNING: NO AMAX BUFFER")
+                return
+        q = qq._amax.detach().item()
+        k = qk._amax.detach().item()
+        v = qv._amax.detach().item()
+
+        amax = max(q, k, v)
+        qq._amax.fill_(amax)
+        qk._amax.fill_(amax)
+        qv._amax.fill_(amax)
+        logger.info(f"          q={q:5.2f} k={k:5.2f} v={v:5.2f} -> {amax:5.2f}")
+
+    for name, mod in model.named_modules():
+        if name.endswith(".attention.self"):
+            logger.info(f"FUSE_QKV: {name:{name_width}}")
+            fuse3(mod.matmul_q_input_quantizer, mod.matmul_k_input_quantizer, mod.matmul_v_input_quantizer)
+            if args.quant_per_tensor:
+                fuse3(mod.query._weight_quantizer, mod.key._weight_quantizer, mod.value._weight_quantizer)
+
+
+def clip_gelu(model, maxval):
+    """Clip activations generated by GELU to maxval when quantized.
+    Implemented by adjusting the amax of the following input_quantizer.
+    """
+
+    for name, mod in model.named_modules():
+        if name.endswith(".output.dense") and not name.endswith("attention.output.dense"):
+            amax_init = mod._input_quantizer._amax.data.detach().item()
+            mod._input_quantizer._amax.data.detach().clamp_(max=maxval)
+            amax = mod._input_quantizer._amax.data.detach().item()
+            logger.info(f"CLIP_GELU: {name:{name_width}} amax: {amax_init:5.2f} -> {amax:5.2f}")
+
+
+def expand_amax(model):
+    """Expand per-tensor amax to be per channel, where each channel is assigned the per-tensor amax."""
+
+    for name, mod in model.named_modules():
+        if hasattr(mod, "_weight_quantizer") and mod._weight_quantizer.axis is not None:
+            k = mod.weight.shape[0]
+            amax = mod._weight_quantizer._amax.detach()
+            mod._weight_quantizer._amax = torch.ones(k, dtype=amax.dtype, device=amax.device) * amax
+            print(f"expanding {name} {amax} -> {mod._weight_quantizer._amax}")
+
+
+def recalibrate_weights(model):
+    """Performs max calibration on the weights and updates amax."""
+
+    for name, mod in model.named_modules():
+        if hasattr(mod, "_weight_quantizer"):
+            if not hasattr(mod.weight_quantizer, "_amax"):
+                print("RECALIB: {name:{name_width}} WARNING: NO AMAX BUFFER")
+                continue
+
+            # determine which axes to reduce across
+            # e.g. a 4D tensor quantized per axis 0 should reduce over (1,2,3)
+            axis_set = set() if mod._weight_quantizer.axis is None else set(mod._weight_quantizer.axis)
+            reduce_axis = set(range(len(mod.weight.size()))) - axis_set
+            amax = pytorch_quantization.utils.reduce_amax(mod.weight, axis=reduce_axis, keepdims=True).detach()
+            logger.info(f"RECALIB: {name:{name_width}} {mod._weight_quantizer._amax.flatten()} -> {amax.flatten()}")
+            mod._weight_quantizer._amax = amax
+
+
+def print_model_summary(model, name_width=25, line_width=180, ignore=None):
+    """Print model quantization configuration."""
+
+    if ignore is None:
+        ignore = []
+    elif not isinstance(ignore, list):
+        ignore = [ignore]
+
+    name_width = 0
+    for name, mod in model.named_modules():
+        if not hasattr(mod, "weight"):
+            continue
+        name_width = max(name_width, len(name))
+
+    for name, mod in model.named_modules():
+        input_q = getattr(mod, "_input_quantizer", None)
+        weight_q = getattr(mod, "_weight_quantizer", None)
+        if not hasattr(mod, "weight"):
+            continue
+        if type(mod) in ignore:
+            continue
+        if [True for s in ignore if type(s) is str and s in name]:
+            continue
+        act_str = f"Act:{input_q.extra_repr()}"
+        wgt_str = f"Wgt:{weight_q.extra_repr()}"
+        s = f"{name:{name_width}} {act_str} {wgt_str}"
+        if len(s) <= line_width:
+            logger.info(s)
+        else:
+            logger.info(f"{name:{name_width}} {act_str}")
+            logger.info(f'{"  ":{name_width}} {wgt_str}')
+
+
+def print_quant_summary(model):
+    """Print summary of all quantizer modules in the model."""
+
+    count = 0
+    for name, mod in model.named_modules():
+        if isinstance(mod, pytorch_quantization.nn.TensorQuantizer):
+            print(f"{name:80} {mod}")
+            count += 1
+    print(f"{count} TensorQuantizers found in model")
+
+
+def set_quantizer(name, mod, quantizer, k, v):
+    """Set attributes for mod.quantizer."""
+
+    quantizer_mod = getattr(mod, quantizer, None)
+    if quantizer_mod is not None:
+        assert hasattr(quantizer_mod, k)
+        setattr(quantizer_mod, k, v)
+    else:
+        logger.warn(f"{name} has no {quantizer}")
+
+
+def set_quantizers(name, mod, which="both", **kwargs):
+    """Set quantizer attributes for mod."""
+
+    s = f"Warning: changing {which} quantizers of {name:{qname_width}}"
+    for k, v in kwargs.items():
+        s += f" {k}={v}"
+        if which in ["input", "both"]:
+            set_quantizer(name, mod, "_input_quantizer", k, v)
+        if which in ["weight", "both"]:
+            set_quantizer(name, mod, "_weight_quantizer", k, v)
+    logger.info(s)
+
+
+def set_quantizer_by_name(model, names, **kwargs):
+    """Set quantizer attributes for layers where name contains a substring in names."""
+
+    for name, mod in model.named_modules():
+        if hasattr(mod, "_input_quantizer") or hasattr(mod, "_weight_quantizer"):
+            for n in names:
+                if re.search(n, name):
+                    set_quantizers(name, mod, **kwargs)
+        elif name.endswith("_quantizer"):
+            for n in names:
+                if re.search(n, name):
+                    s = f"Warning: changing {name:{name_width}}"
+                    for k, v in kwargs.items():
+                        s += f" {k}={v}"
+                        setattr(mod, k, v)
+                    logger.info(s)
--- a/examples/research_projects/quantization-qdqbert/run_quant_qa.py
+++ b/examples/research_projects/quantization-qdqbert/run_quant_qa.py
--- a/examples/research_projects/quantization-qdqbert/trainer_quant_qa.py
+++ b/examples/research_projects/quantization-qdqbert/trainer_quant_qa.py
+# coding=utf-8
+# Copyright 2020 The HuggingFace Team All rights reserved.
+# Copyright 2021 NVIDIA Corporation. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+A subclass of `Trainer` specific to Question-Answering tasks
+"""
+
+import logging
+import os
+
+import torch
+from torch.utils.data import DataLoader
+
+import quant_trainer
+from transformers import Trainer, is_torch_tpu_available
+from transformers.trainer_utils import PredictionOutput
+
+
+logger = logging.getLogger(__name__)
+
+if is_torch_tpu_available():
+    import torch_xla.core.xla_model as xm
+    import torch_xla.debug.metrics as met
+
+
+class QuestionAnsweringTrainer(Trainer):
+    def __init__(self, *args, eval_examples=None, post_process_function=None, quant_trainer_args=None, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.eval_examples = eval_examples
+        self.post_process_function = post_process_function
+        self.quant_trainer_args = quant_trainer_args
+        self.calib_num = 128  # default number of calibration samples
+
+    def get_calib_dataloader(self, calib_dataset=None):
+        """
+        Returns the calibration dataloader :class:`~torch.utils.data.DataLoader`.
+
+        Args:
+            calib_dataset (:obj:`torch.utils.data.Dataset`, `optional`)
+        """
+        if calib_dataset is None and self.calib_dataset is None:
+            raise ValueError("Trainer: calibration requires an calib_dataset.")
+        calib_dataset = calib_dataset if calib_dataset is not None else self.calib_dataset
+
+        calib_dataset = self._remove_unused_columns(calib_dataset, description="Calibration")
+
+        return DataLoader(
+            calib_dataset,
+            batch_size=self.args.eval_batch_size,
+            collate_fn=self.data_collator,
+            drop_last=self.args.dataloader_drop_last,
+            num_workers=self.args.dataloader_num_workers,
+            pin_memory=self.args.dataloader_pin_memory,
+            shuffle=True,
+        )
+
+    def calibrate(self, calib_dataset=None):
+        calib_dataset = self.train_dataset if calib_dataset is None else calib_dataset
+        calib_dataloader = self.get_calib_dataloader(calib_dataset)
+
+        model = self.model
+        quant_trainer.configure_model(model, self.quant_trainer_args, calib=True)
+        model.eval()
+        quant_trainer.enable_calibration(model)
+
+        logger.info("***** Running calibration *****")
+        logger.info(f"  Num examples = {self.calib_num}")
+        logger.info(f"  Batch size = {calib_dataloader.batch_size}")
+
+        for step, inputs in enumerate(calib_dataloader):
+            # Prediction step
+            loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only=True)
+            if (step + 1) * calib_dataloader.batch_size >= self.calib_num:
+                break
+
+        quant_trainer.finish_calibration(model, self.quant_trainer_args)
+        self.model = model
+
+    def evaluate(self, eval_dataset=None, eval_examples=None, ignore_keys=None, metric_key_prefix: str = "eval"):
+        eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset
+        eval_dataloader = self.get_eval_dataloader(eval_dataset)
+        eval_examples = self.eval_examples if eval_examples is None else eval_examples
+
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        self.compute_metrics = None
+        eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
+        try:
+            output = eval_loop(
+                eval_dataloader,
+                description="Evaluation",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+            )
+        finally:
+            self.compute_metrics = compute_metrics
+
+        if self.post_process_function is not None and self.compute_metrics is not None:
+            eval_preds = self.post_process_function(eval_examples, eval_dataset, output.predictions)
+            metrics = self.compute_metrics(eval_preds)
+
+            # Prefix all keys with metric_key_prefix + '_'
+            for key in list(metrics.keys()):
+                if not key.startswith(f"{metric_key_prefix}_"):
+                    metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+            self.log(metrics)
+        else:
+            metrics = {}
+
+        if self.args.tpu_metrics_debug or self.args.debug:
+            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
+            xm.master_print(met.metrics_report())
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, metrics)
+        return metrics
+
+    def predict(self, predict_dataset, predict_examples, ignore_keys=None, metric_key_prefix: str = "test"):
+        predict_dataloader = self.get_test_dataloader(predict_dataset)
+
+        # Temporarily disable metric computation, we will do it in the loop here.
+        compute_metrics = self.compute_metrics
+        self.compute_metrics = None
+        eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
+        try:
+            output = eval_loop(
+                predict_dataloader,
+                description="Prediction",
+                # No point gathering the predictions if there are no metrics, otherwise we defer to
+                # self.args.prediction_loss_only
+                prediction_loss_only=True if compute_metrics is None else None,
+                ignore_keys=ignore_keys,
+            )
+        finally:
+            self.compute_metrics = compute_metrics
+
+        if self.post_process_function is None or self.compute_metrics is None:
+            return output
+
+        predictions = self.post_process_function(predict_examples, predict_dataset, output.predictions, "predict")
+        metrics = self.compute_metrics(predictions)
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(metrics.keys()):
+            if not key.startswith(f"{metric_key_prefix}_"):
+                metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+
+        return PredictionOutput(predictions=predictions.predictions, label_ids=predictions.label_ids, metrics=metrics)
+
+    def save_onnx(self, output_dir="./"):
+        eval_dataset = self.eval_dataset
+        eval_dataloader = self.get_eval_dataloader(eval_dataset)
+
+        batch = next(iter(eval_dataloader))
+
+        # saving device - to make it consistent
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+        # convert to tuple
+        input_tuple = tuple(v.to(device) for k, v in batch.items())
+
+        logger.info("Converting model to be onnx compatible")
+        from pytorch_quantization.nn import TensorQuantizer
+
+        TensorQuantizer.use_fb_fake_quant = True
+
+        model = self.model.to(device)
+
+        model.eval()
+        model.float()
+
+        model_to_save = model.module if hasattr(model, "module") else model
+        quant_trainer.configure_model(model_to_save, self.quant_trainer_args)
+
+        output_model_file = os.path.join(output_dir, "model.onnx")
+        logger.info(f"exporting model to {output_model_file}")
+
+        axes = {0: "batch_size", 1: "seq_len"}
+
+        torch.onnx.export(
+            model_to_save,
+            input_tuple,
+            output_model_file,
+            export_params=True,
+            opset_version=13,
+            do_constant_folding=True,
+            input_names=["input_ids", "attention_mask", "token_type_ids"],
+            output_names=["output_start_logits", "output_end_logits"],
+            dynamic_axes={
+                "input_ids": axes,
+                "attention_mask": axes,
+                "token_type_ids": axes,
+                "output_start_logits": axes,
+                "output_end_logits": axes,
+            },
+            verbose=True,
+        )
+        logger.info("onnx export finished")
--- a/examples/research_projects/quantization-qdqbert/utils_qa.py
+++ b/examples/research_projects/quantization-qdqbert/utils_qa.py
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -44,6 +44,7 @@ from . import dependency_versions_check
 from .file_utils import (
    _LazyModule,
    is_flax_available,
+    is_pytorch_quantization_available,
    is_scatter_available,
    is_sentencepiece_available,
    is_speech_available,
@@ -248,6 +249,7 @@ _import_structure = {
    "models.pegasus": ["PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP", "PegasusConfig", "PegasusTokenizer"],
    "models.phobert": ["PhobertTokenizer"],
    "models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"],
+    "models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"],
    "models.rag": ["RagConfig", "RagRetriever", "RagTokenizer"],
    "models.reformer": ["REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "ReformerConfig"],
    "models.rembert": ["REMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RemBertConfig"],
@@ -529,6 +531,30 @@ else:
        name for name in dir(dummy_scatter_objects) if not name.startswith("_")
    ]

+if is_torch_available() and is_pytorch_quantization_available():
+    _import_structure["models.qdqbert"].extend(
+        [
+            "QDQBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "QDQBertForMaskedLM",
+            "QDQBertForMultipleChoice",
+            "QDQBertForNextSentencePrediction",
+            "QDQBertForQuestionAnswering",
+            "QDQBertForSequenceClassification",
+            "QDQBertForTokenClassification",
+            "QDQBertLayer",
+            "QDQBertLMHeadModel",
+            "QDQBertModel",
+            "QDQBertPreTrainedModel",
+            "load_tf_weights_in_qdqbert",
+        ]
+    )
+else:
+    from .utils import dummy_pytorch_quantization_and_torch_objects
+
+    _import_structure["utils.dummy_pytorch_quantization_and_torch_objects"] = [
+        name for name in dir(dummy_pytorch_quantization_and_torch_objects) if not name.startswith("_")
+    ]
+
 # PyTorch-backed objects
 if is_torch_available():
    _import_structure["benchmark.benchmark"] = ["PyTorchBenchmark"]
@@ -2188,6 +2214,7 @@ if TYPE_CHECKING:
    from .models.pegasus import PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, PegasusConfig, PegasusTokenizer
    from .models.phobert import PhobertTokenizer
    from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer
+    from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig
    from .models.rag import RagConfig, RagRetriever, RagTokenizer
    from .models.reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig
    from .models.rembert import REMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RemBertConfig
@@ -2415,6 +2442,24 @@ if TYPE_CHECKING:
    else:
        from .utils.dummy_scatter_objects import *

+    if is_torch_available() and is_pytorch_quantization_available():
+        from .models.qdqbert import (
+            QDQBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            QDQBertForMaskedLM,
+            QDQBertForMultipleChoice,
+            QDQBertForNextSentencePrediction,
+            QDQBertForQuestionAnswering,
+            QDQBertForSequenceClassification,
+            QDQBertForTokenClassification,
+            QDQBertLayer,
+            QDQBertLMHeadModel,
+            QDQBertModel,
+            QDQBertPreTrainedModel,
+            load_tf_weights_in_qdqbert,
+        )
+    else:
+        from .utils.dummy_pytorch_quantization_and_torch_objects import *
+
    if is_torch_available():
        # Benchmarks
        from .benchmark.benchmark import PyTorchBenchmark

--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -196,6 +196,14 @@ except importlib_metadata.PackageNotFoundError:
    _scatter_available = False


+_pytorch_quantization_available = importlib.util.find_spec("pytorch_quantization") is not None
+try:
+    _pytorch_quantization_version = importlib_metadata.version("pytorch_quantization")
+    logger.debug(f"Successfully imported pytorch-quantization version {_pytorch_quantization_version}")
+except importlib_metadata.PackageNotFoundError:
+    _pytorch_quantization_available = False
+
+
 _soundfile_available = importlib.util.find_spec("soundfile") is not None
 try:
    _soundfile_version = importlib_metadata.version("soundfile")
@@ -431,6 +439,10 @@ def is_scatter_available():
    return _scatter_available


+def is_pytorch_quantization_available():
+    return _pytorch_quantization_available
+
+
 def is_pandas_available():
    return importlib.util.find_spec("pandas") is not None

@@ -610,6 +622,12 @@ SCATTER_IMPORT_ERROR = """
 explained here: https://github.com/rusty1s/pytorch_scatter.
 """

+# docstyle-ignore
+PYTORCH_QUANTIZATION_IMPORT_ERROR = """
+{0} requires the pytorch-quantization library but it was not found in your environment. You can install it with pip:
+`pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com`
+"""
+

 # docstyle-ignore
 PANDAS_IMPORT_ERROR = """
@@ -661,6 +679,7 @@ BACKENDS_MAPPING = OrderedDict(
        ("protobuf", (is_protobuf_available, PROTOBUF_IMPORT_ERROR)),
        ("pytesseract", (is_pytesseract_available, PYTESSERACT_IMPORT_ERROR)),
        ("scatter", (is_scatter_available, SCATTER_IMPORT_ERROR)),
+        ("pytorch_quantization", (is_pytorch_quantization_available, PYTORCH_QUANTIZATION_IMPORT_ERROR)),
        ("sentencepiece", (is_sentencepiece_available, SENTENCEPIECE_IMPORT_ERROR)),
        ("sklearn", (is_sklearn_available, SKLEARN_IMPORT_ERROR)),
        ("speech", (is_speech_available, SPEECH_IMPORT_ERROR)),

--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -79,6 +79,7 @@ from . import (
    pegasus,
    phobert,
    prophetnet,
+    qdqbert,
    rag,
    reformer,
    rembert,

--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -31,6 +31,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
    [
        # Add configs here
        ("imagegpt", "ImageGPTConfig"),
+        ("qdqbert", "QDQBertConfig"),
        ("vision-encoder-decoder", "VisionEncoderDecoderConfig"),
        ("trocr", "TrOCRConfig"),
        ("fnet", "FNetConfig"),
@@ -113,6 +114,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
    [
        # Add archive maps here
        ("imagegpt", "IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("qdqbert", "QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("fnet", "FNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("pegasus", "PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -185,6 +187,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
    [
        # Add full (and cased) model names here
        ("imagegpt", "ImageGPT"),
+        ("qdqbert", "QDQBert"),
        ("vision-encoder-decoder", "Vision Encoder decoder"),
        ("trocr", "TrOCR"),
        ("fnet", "FNet"),

--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -29,6 +29,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
    [
        # Base model mapping
        ("imagegpt", "ImageGPTModel"),
+        ("qdqbert", "QDQBertModel"),
        ("fnet", "FNetModel"),
        ("segformer", "SegformerModel"),
        ("gptj", "GPTJModel"),
@@ -147,6 +148,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
    [
        # Model with LM heads mapping
        ("imagegpt", "ImageGPTForCausalLM"),
+        ("qdqbert", "QDQBertForMaskedLM"),
        ("fnet", "FNetForMaskedLM"),
        ("gptj", "GPTJForCausalLM"),
        ("rembert", "RemBertForMaskedLM"),
@@ -198,6 +200,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
    [
        # Model for Causal LM mapping
        ("imagegpt", "ImageGPTForCausalLM"),
+        ("qdqbert", "QDQBertLMHeadModel"),
        ("trocr", "TrOCRForCausalLM"),
        ("gptj", "GPTJForCausalLM"),
        ("rembert", "RemBertForCausalLM"),
@@ -257,6 +260,7 @@ MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict(
 MODEL_FOR_MASKED_LM_MAPPING_NAMES = OrderedDict(
    [
        # Model for Masked LM mapping
+        ("qdqbert", "QDQBertForMaskedLM"),
        ("fnet", "FNetForMaskedLM"),
        ("rembert", "RemBertForMaskedLM"),
        ("roformer", "RoFormerForMaskedLM"),
@@ -327,6 +331,7 @@ MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
 MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
    [
        # Model for Sequence Classification mapping
+        ("qdqbert", "QDQBertForSequenceClassification"),
        ("fnet", "FNetForSequenceClassification"),
        ("gptj", "GPTJForSequenceClassification"),
        ("layoutlmv2", "LayoutLMv2ForSequenceClassification"),
@@ -372,6 +377,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
 MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict(
    [
        # Model for Question Answering mapping
+        ("qdqbert", "QDQBertForQuestionAnswering"),
        ("fnet", "FNetForQuestionAnswering"),
        ("layoutlmv2", "LayoutLMv2ForQuestionAnswering"),
        ("rembert", "RemBertForQuestionAnswering"),
@@ -418,6 +424,7 @@ MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict(
 MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
    [
        # Model for Token Classification mapping
+        ("qdqbert", "QDQBertForTokenClassification"),
        ("fnet", "FNetForTokenClassification"),
        ("layoutlmv2", "LayoutLMv2ForTokenClassification"),
        ("rembert", "RemBertForTokenClassification"),
@@ -452,6 +459,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
 MODEL_FOR_MULTIPLE_CHOICE_MAPPING_NAMES = OrderedDict(
    [
        # Model for Multiple Choice mapping
+        ("qdqbert", "QDQBertForMultipleChoice"),
        ("fnet", "FNetForMultipleChoice"),
        ("rembert", "RemBertForMultipleChoice"),
        ("canine", "CanineForMultipleChoice"),
@@ -480,6 +488,7 @@ MODEL_FOR_MULTIPLE_CHOICE_MAPPING_NAMES = OrderedDict(

 MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING_NAMES = OrderedDict(
    [
+        ("qdqbert", "QDQBertForNextSentencePrediction"),
        ("bert", "BertForNextSentencePrediction"),
        ("fnet", "FNetForNextSentencePrediction"),
        ("megatron-bert", "MegatronBertForNextSentencePrediction"),

--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -173,6 +173,7 @@ else:
                ),
            ),
            ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)),
+            ("qdqbert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
            ("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
            ("hubert", ("Wav2Vec2CTCTokenizer", None)),
            ("gpt_neo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),