PaddleNLP Transformer Pre-trained Models#
With the advancement of deep learning, the NLP field has witnessed a surge of high-quality Transformer-based pre-trained models, which have repeatedly broken SOTA (State of the Art) records across various NLP tasks, greatly advancing the progress of natural language processing.
PaddleNLP provides users with commonly used pre-trained models and their corresponding weights, such as BERT
, ERNIE
, ALBERT
, RoBERTa
, XLNet
, etc., using unified APIs for loading, training, and inference.
This enables developers to conveniently and efficiently apply various Transformer-based pre-trained models and their downstream tasks, while ensuring fast and stable downloads of corresponding pre-trained model weights.
Usage of Pre-trained Models#
PaddleNLP Transformer API not only offers rich pre-trained models but also lowers the barrier to entry for users. Using the Auto module, users can load pre-trained models with different network architectures without needing to look up model-specific categories. With just a dozen lines of code, users can complete model loading and downstream task fine-tuning.
(The code block content remains unchanged as per requirements) ```python from functools import partial import numpy as np
import paddle from paddlenlp.datasets import load_dataset from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
train_ds = load_dataset(“chnsenticorp”, splits=[“train”])
model = AutoModelForSequenceClassification.from_pretrained(“bert-wwm-chinese”, num_classes=len(train_ds.label_list))
tokenizer = AutoTokenizer.from_pretrained(“bert-wwm-chinese”)
- def convert_example(example, tokenizer):
encoded_inputs = tokenizer(text=example[“text”], max_seq_len=512, pad_to_max_seq_len=True) return tuple([np.array(x, dtype=”int64”) for x in [
encoded_inputs[“input_ids”], encoded_inputs[“token_type_ids”], [example[“label”]]]])
train_ds = train_ds.map(partial(convert_example, tokenizer=tokenizer))
batch_sampler = paddle.io.BatchSampler(dataset=train_ds, batch_size=8, shuffle=True) train_data_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=batch_sampler, return_list=True)
optimizer = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())
criterion = paddle.nn.loss.CrossEntropyLoss()
- for input_ids, token_type_ids, labels in train_data_loader():
logits = model(input_ids, token_type_ids) loss = criterion(logits, labels) loss.backward() optimizer.step() optimizer.clear_grad()
The code above provides a concise example of using a pre-trained model. For more complete and detailed sample code, please refer to: Fine-tune Pretrained Models for Chinese Text Classification Task
Load Dataset: PaddleNLP provides built-in datasets that can be loaded with one-click.
Load Pretrained Models: PaddleNLP’s pretrained models can be easily loaded via
from_pretrained()
method. The Auto modules (including AutoModel, AutoTokenizer, and various downstream task classes) provide user-friendly APIs, eliminating the need to specify model classes while loading different architectures. The first parameter corresponds to thePretrained Weight
in the summary table, which loads the corresponding pretrained weights. Additional initialization parameters forAutoModelForSequenceClassification
, such asnum_classes
, are also passed throughfrom_pretrained()
. TheTokenizer
is loaded using the samefrom_pretrained
method.Data Preprocessing: Use the
map
function ofDataset
to process raw text into model inputs viatokenizer
.Data Preparation: Define
BatchSampler
andDataLoader
to shuffle data and create batches.Model Training: Define optimizer, loss function, and other training components to start fine-tuning.
Transformer Pretrained Models Summary#
PaddleNLP’s Transformer pretrained models include converted weights from huggingface.co and Baidu’s self-developed models, facilitating community migration. Currently covering 40+ mainstream pretrained models with 500+ model weights.
Applicable Tasks Summary for Transformer Pre-trained Models#
Model |
Sequence Classification |
Token Classification |
Question Answering |
Text Generation |
Multiple Choice |
---|---|---|---|---|---|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
✅ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
❌ |
❌ |
❌ |
✅ |
❌ |
|
❌ |
❌ |
❌ |
✅ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
❌ |
❌ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
❌ |
✅ |
❌ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
❌ |
|
❌ |
❌ |
❌ |
✅ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
❌ |
|
✅ |
✅ |
❌ |
✅ |
❌ |
|
✅ |
✅ |
❌ |
❌ |
❌ |
|
❌ |
✅ |
❌ |
❌ |
❌ |
|
❌ |
✅ |
❌ |
❌ |
❌ |
|
❌ |
✅ |
✅ |
❌ |
❌ |
|
✅ |
❌ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
❌ |
✅ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
❌ |
❌ |
❌ |
❌ |
|
❌ |
❌ |
❌ |
✅ |
❌ |
|
✅ |
❌ |
✅ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
|
✅ |
✅ |
✅ |
❌ |
❌ |
|
✅ |
✅ |
❌ |
❌ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
❌ |
|
❌ |
❌ |
❌ |
✅ |
❌ |
|
✅ |
❌ |
❌ |
❌ |
❌ |
|
❌ |
❌ |
❌ |
✅ |
❌ |
|
✅ |
✅ |
✅ |
❌ |
✅ |
Reference#
- Some Chinese pre-trained models are from:
brightmart/albert_zh,
ymcui/Chinese-BERT-wwm,
huawei-noah/Pretrained-Language-Model/TinyBERT,
ymcui/Chinese-XLNet,
huggingface/xlnet_chinese_large,
Knover/luge-dialogue,
huawei-noah/Pretrained-Language-Model/NEZHA-PyTorch/,
ZhuiyiTechnology/simbert
- Lan, Zhenzhong, et al. “Albert: A lite bert for self-supervised learning of language representations.” arXiv preprint arXiv:1909.11942 (2019).
- Lewis, Mike, et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.” arXiv preprint arXiv:1910.13461 (2019).
- Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
- Zaheer, Manzil, et al. “Big bird: Transformers for longer sequences.” arXiv preprint arXiv:2007.14062 (2020).
- Stephon, Emily, et al. “Blenderbot: Recipes for building an open-domain chatbot.” arXiv preprint arXiv:2004.13637 (2020).
- Stephon, Emily, et al. “Blenderbot-Small: Recipes for building an open-domain chatbot.” arXiv preprint arXiv:2004.13637 (2020).
- Sun, Zijun, et al. “Chinesebert: Chinese pretraining enhanced by glyph and pinyin information.” arXiv preprint arXiv:2106.16038 (2021).
- Zhang, zhengyan, et al. “CPM: A Large-scale Generative Chinese Pre-trained Language Model.” arXiv preprint arXiv:2012.00413 (2020).
- Jiang, Zihang, et al. “ConvBERT: Improving BERT with Span-based Dynamic Convolution.” arXiv preprint arXiv:2008.02496 (2020).
- Nitish, Bryan, et al. “CTRL: A Conditional Transformer Language Model for Controllable Generation.” arXiv preprint arXiv:1909.05858 (2019).
- Sanh, Victor, et al. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” arXiv preprint arXiv:1910.01108 (2019).
- Clark, Kevin, et al. “Electra: Pre-training text encoders as discriminators rather than generators.” arXiv preprint arXiv:2003.10555 (2020).
- Sun, Yu, et al. “Ernie: Enhanced representation through knowledge integration.” arXiv preprint arXiv:1904.09223 (2019).
- Ding, Siyu, et al. “ERNIE-Doc: A retrospective long-document modeling transformer.” arXiv preprint arXiv:2012.15688 (2020).
- Xiao, Dongling, et al. “Ernie-gen: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation.” arXiv preprint arXiv:2001.11314 (2020).
- Xiao, Dongling, et al. “ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.” arXiv preprint arXiv:2010.12148 (2020).
- Ouyang, Xuan, et al. “ERNIE-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora.” arXiv preprint arXiv:2012.15674 (2020).
- Lee-Thorp, James, et al. “Fnet: Mixing tokens with fourier transforms.” arXiv preprint arXiv:2105.03824 (2021).
- Dai, Zihang, et al. “Funnel-transformer: Filtering out sequential redundancy for efficient language processing.” Advances in neural information processing systems 33 (2020): 4271-4282.
- Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog 1.8 (2019): 9.
- Xu, Yiheng, et al. “LayoutLM: Pre-training of Text and Layout for Document Image Understanding.” arXiv preprint arXiv:1912.13318 (2019).
- Xu, Yang, et al. “LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding” arXiv preprint arXiv:2012.14740 (2020).
- Xu, Yiheng, et al. “LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding” arXiv preprint arXiv:2104.08836 (2021).
- Yamada, Ikuya, et al. “Luke: deep contextualized entity representations with entity-aware self-attention.” arXiv preprint arXiv:2010.01057 (2020).
- Liu, Yinhan, et al. “MBart: Multilingual Denoising Pre-training for Neural Machine Translation” arXiv preprint arXiv:2001.08210 (2020).
- Shoeybi, Mohammad, et al. “Megatron-lm: Training multi-billion parameter language models using model parallelism.” arXiv preprint arXiv:1909.08053 (2019).
- Sun, Zhiqing, et al. “MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices” arXiv preprint arXiv:2004.02984 (2020).
- Song, Kaitao, et al. “MPNet: Masked and Permuted Pre-training for Language Understanding.” arXiv preprint arXiv:2004.09297 (2020).
- Wei, Junqiu, et al. “NEZHA: Neural contextualized representation for chinese language understanding.” arXiv preprint arXiv:1909.00204 (2019).
- Qi, Weizhen, et al. “Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training.” arXiv preprint arXiv:2001.04063 (2020).
- Kitaev, Nikita, et al. “Reformer: The efficient Transformer.” arXiv preprint arXiv:2001.04451 (2020).
- Chung, Hyung Won, et al. “Rethinking embedding coupling in pre-trained language models.” arXiv preprint arXiv:2010.12821 (2020).
- Liu, Yinhan, et al. “Roberta: A robustly optimized bert pretraining approach.” arXiv preprint arXiv:1907.11692 (2019).
- Su Jianlin, et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv preprint arXiv:2104.09864 (2021).
- Tian, Hao, et al. “SKEP: Sentiment knowledge enhanced pre-training for sentiment analysis.” arXiv preprint arXiv:2005.05635 (2020).
- Forrest, ALbert, et al. “SqueezeBERT: What can computer vision teach NLP about efficient neural networks?” arXiv preprint arXiv:2006.11316 (2020).
- Raffel, Colin, et al. “T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” arXiv preprint arXiv:1910.10683 (2019).
- Vaswani, Ashish, et al. “Attention is all you need.” arXiv preprint arXiv:1706.03762 (2017).
- Jiao, Xiaoqi, et al. “Tinybert: Distilling bert for natural language understanding.” arXiv preprint arXiv:1909.10351 (2019).
- Bao, Siqi, et al. “Plato-2: Towards building an open-domain chatbot via curriculum learning.” arXiv preprint arXiv:2006.16779 (2020).
- Yang, Zhilin, et al. “Xlnet: Generalized autoregressive pretraining for language understanding.” arXiv preprint arXiv:1906.08237 (2019).
- Cui, Yiming, et al. “Pre-training with whole word masking for chinese bert.” arXiv preprint arXiv:1906.08101 (2019).
- Wang, Quan, et al. “Building Chinese Biomedical Language Models via Multi-Level Text Discrimination.” arXiv preprint arXiv:2110.07244 (2021).