Transformers — transformers 4.3.0 documentation (2025)

The library currently contains PyTorch, Tensorflow and Flax implementations, pretrained model weights, usage scriptsand conversion utilities for the following models:

  • ALBERT (from Google Research and the Toyota Technological Institute at Chicago) releasedwith the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, PiyushSharma, Radu Soricut.

  • BART (from Facebook) released with the paper BART: Denoising Sequence-to-SequencePre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, AbdelrahmanMohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.

  • BARThez (from École polytechnique) released with the paper BARThez: a Skilled PretrainedFrench Sequence-to-Sequence Model by Moussa Kamal Eddine, Antoine J.-P.Tixier, Michalis Vazirgiannis.

  • BERT (from Google) released with the paper BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding by Jacob Devlin, Ming-Wei Chang,Kenton Lee and Kristina Toutanova.

  • BERT For Sequence Generation (from Google) released with the paper LeveragingPre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, ShashiNarayan, Aliaksei Severyn.

  • Blenderbot (from Facebook) released with the paper Recipes for building anopen-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, MaryWilliamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.

  • BlenderbotSmall (from Facebook) released with the paper Recipes for building anopen-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, MaryWilliamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.

  • BORT (from Alexa) released with the paper Optimal Subarchitecture Extraction For BERT by Adrian de Wynter and Daniel J. Perry.

  • CamemBERT (from Inria/Facebook/Sorbonne) released with the paper CamemBERT: a TastyFrench Language Model by Louis Martin*, Benjamin Muller*, Pedro Javier OrtizSuárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.

  • ConvBERT (from YituTech) released with the paper ConvBERT: Improving BERT withSpan-based Dynamic Convolution by Zihang Jiang, Weihao Yu, Daquan Zhou,Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

  • CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer LanguageModel for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*,Lav R. Varshney, Caiming Xiong and Richard Socher.

  • DeBERTa (from Microsoft Research) released with the paper DeBERTa: Decoding-enhancedBERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao,Weizhu Chen.

  • DialoGPT (from Microsoft Research) released with the paper DialoGPT: Large-ScaleGenerative Pre-training for Conversational Response Generation by YizheZhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.

  • DistilBERT (from HuggingFace), released together with the paper DistilBERT, adistilled version of BERT: smaller, faster, cheaper and lighter by VictorSanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2, RoBERTa into DistilRoBERTa, Multilingual BERT intoDistilmBERT and a Germanversion of DistilBERT.

  • DPR (from Facebook) released with the paper Dense Passage Retrieval for Open-DomainQuestion Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, PatrickLewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.

  • ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA:Pre-training text encoders as discriminators rather than generators by KevinClark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.

  • FlauBERT (from CNRS) released with the paper FlauBERT: Unsupervised Language ModelPre-training for French by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.

  • Funnel Transformer (from CMU/Google Brain) released with the paper Funnel-Transformer:Filtering out Sequential Redundancy for Efficient Language Processing byZihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.

  • GPT (from OpenAI) released with the paper Improving Language Understanding by GenerativePre-Training by Alec Radford, Karthik Narasimhan, Tim Salimansand Ilya Sutskever.

  • GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised MultitaskLearners by Alec Radford*, Jeffrey Wu*, Rewon Child, DavidLuan, Dario Amodei** and Ilya Sutskever**.

  • LayoutLM (from Microsoft Research Asia) released with the paper LayoutLM: Pre-trainingof Text and Layout for Document Image Understanding by Yiheng Xu, Minghao Li,Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.

  • LED (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.

  • Longformer (from AllenAI) released with the paper Longformer: The Long-DocumentTransformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.

  • LXMERT (from UNC Chapel Hill) released with the paper LXMERT: Learning Cross-ModalityEncoder Representations from Transformers for Open-Domain Question Answeringby Hao Tan and Mohit Bansal.

  • MarianMT Machine translation models trained using OPUS data byJörg Tiedemann. The Marian Framework is being developed by the MicrosoftTranslator Team.

  • MBart (from Facebook) released with the paper Multilingual Denoising Pre-training forNeural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

  • MPNet (from Microsoft Research) released with the paper MPNet: Masked and PermutedPre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin,Jianfeng Lu, Tie-Yan Liu.

  • MT5 (from Google AI) released with the paper mT5: A massively multilingual pre-trainedtext-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, MihirKale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.

  • Pegasus (from Google) released with the paper PEGASUS: Pre-training with ExtractedGap-sentences for Abstractive Summarization> by Jingqing Zhang, Yao Zhao,Mohammad Saleh and Peter J. Liu.

  • ProphetNet (from Microsoft Research) released with the paper ProphetNet: PredictingFuture N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi,Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.

  • Reformer (from Google Research) released with the paper Reformer: The EfficientTransformer by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.

  • RoBERTa (from Facebook), released together with the paper a Robustly Optimized BERTPretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, MandarJoshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.

  • SqueezeBert released with the paper SqueezeBERT: What can computer vision teach NLPabout efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, RaviKrishna, and Kurt W. Keutzer.

  • T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with aUnified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and AdamRoberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.

  • TAPAS (from Google AI) released with the paper TAPAS: Weakly Supervised Table Parsing viaPre-training by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,Francesco Piccinno and Julian Martin Eisenschlos.

  • Transformer-XL (from Google/CMU) released with the paper Transformer-XL:Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai*,Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.

  • Wav2Vec2 (from Facebook AI) released with the paper wav2vec 2.0: A Framework forSelf-Supervised Learning of Speech Representations by Alexei Baevski, HenryZhou, Abdelrahman Mohamed, Michael Auli.

  • XLM (from Facebook) released together with the paper Cross-lingual Language ModelPretraining by Guillaume Lample and Alexis Conneau.

  • XLM-ProphetNet (from Microsoft Research) released with the paper ProphetNet:Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan,Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.

  • XLM-RoBERTa (from Facebook AI), released together with the paper UnsupervisedCross-lingual Representation Learning at Scale by Alexis Conneau*, KartikayKhandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, LukeZettlemoyer and Veselin Stoyanov.

  • XLNet (from Google/CMU) released with the paper ​XLNet: Generalized AutoregressivePretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, YimingYang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.

  • The table below represents the current support in the library for each of those models, whether they have a Pythontokenizer (called “slow”). A “fast” tokenizer backed by the 🤗 Tokenizers library, whether they have support in PyTorch,TensorFlow and/or Flax.

    Transformers — transformers 4.3.0 documentation (2025)
    Top Articles
    Latest Posts
    Recommended Articles
    Article information

    Author: Allyn Kozey

    Last Updated:

    Views: 6522

    Rating: 4.2 / 5 (63 voted)

    Reviews: 94% of readers found this page helpful

    Author information

    Name: Allyn Kozey

    Birthday: 1993-12-21

    Address: Suite 454 40343 Larson Union, Port Melia, TX 16164

    Phone: +2456904400762

    Job: Investor Administrator

    Hobby: Sketching, Puzzles, Pet, Mountaineering, Skydiving, Dowsing, Sports

    Introduction: My name is Allyn Kozey, I am a outstanding, colorful, adventurous, encouraging, zealous, tender, helpful person who loves writing and wants to share my knowledge and understanding with you.