OpenWebText2#
Name | Text Type | Plain Text Size |
---|---|---|
OpenWebText2 | English | 70GB |
Data Acquisition#
OpenWebTextCorpus is an open-source English web text dataset sourced from Reddit. After deduplication, cleaning, and extraction, it contains over 8 million documents. This example uses the cleaned OpenWebText2 data by EleutherAI.
After downloading, decompress with the following command:
# wget https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/openwebtext2.jsonl.zst.tar
tar -xvf openwebtext2.json.zst.tar -C /path/to/openwebtext
Llama Training Data Preparation#
Then use the create_pretraining_data.py
script for dataset preparation:
python -u create_pretraining_data.py \
--model_name meta-llama/Llama-2-7b \
--tokenizer_name LlamaTokenizer \
--data_format JSON \
--input_path /path/to/openwebtext/ \
--append_eos \
--output_prefix llama_openwebtext \
--workers 40 \
--log_interval 10000 \
--data_impl "mmap"
Processing takes approximately one hour, yielding the required dataset files llama_openwebtext.bin
and llama_openwebtext.idx
.
Organize all preprocessed files into a unified directory for training:
mkdir data
mv llama_openwebtext.bin ./data
mv llama_openwebtext.idx ./data