mt5-base#
README(From Huggingface)#
language:
multilingual
af
am
ar
az
be
bg
bn
ca
ceb
co
cs
cy
da
de
el
en
eo
es
et
eu
fa
fi
fil
fr
fy
ga
gd
gl
gu
ha
haw
hi
hmn
ht
hu
hy
ig
is
it
iw
ja
jv
ka
kk
km
kn
ko
ku
ky
la
lb
lo
lt
lv
mg
mi
mk
ml
mn
mr
ms
mt
my
ne
nl
no
ny
pa
pl
ps
pt
ro
ru
sd
si
sk
sl
sm
sn
so
sq
sr
st
su
sv
sw
ta
te
tg
th
tr
uk
und
ur
uz
vi
xh
yi
yo
zh
zu datasets:
mc4
license: apache-2.0#
mT5 is pretrained on the mC4 corpus, covering 101 languages:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.
Note: mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
Pretraining Dataset: mC4
Other Community Checkpoints: here
Paper: mT5: A massively multilingual pre-trained text-to-text transformer
Authors: Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel
Abstract#
The recent "Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We describe the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. All of the code and model checkpoints used in this work are publicly available.
Model Files#
README.md (2.8 KB)
config.json (702.0 B)
model_state.pdparams (3.6 GB)
special_tokens_map.json (65.0 B)
spiece.model (4.1 MB)
tokenizer_config.json (376.0 B)