Blame · examples/detxoify_lm/annotations/preprocess.sh · d48d95ab8a8b4d4d1dec10c8d6ed7abe90e3ac32 · OpenDAS / Megatron-LM · GitLab

Switch branch/tag

megatron-lm

examples

detxoify_lm

annotations

preprocess.sh
Find file
Normal viewHistoryPermalink

preprocess.sh

284 Bytes

Newer

Older

Open sourcing lm detoxification code

Boxin Wang
committed
Nov 23, 2022

VOCAB_FILE=pt2-vocab.json
MERGE_FILE=gpt2-merges.txt

python3 tools/preprocess_data.py \
    --input $1 \
    --output-prefix $2 \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --tokenizer-type GPT2BPETokenizer \
    --append-eod  --workers 20 --chunk-size 25