CLUECorpus2020 Corpus

CLUECorpus2020 Corpus#

Name Text Type Plain Text Size
CLUECorpus2020 Chinese 200GB

CLUECorpus2020 is obtained by cleaning the Chinese portion of Common Crawl. The open-source portion provides approximately 200GB of corpus text. Detailed information can be found on the official website. Users can apply for download via email as follows:

Data Download Application method: Submit research purpose and intended use of the corpus, research plan, institutional affiliation and applicant introduction to the email address below, with a commitment not to provide the data to third parties.

Email: CLUEbenchmark@163.com, Subject: CLUECorpus2020 200G Corpus