Commit 97070abf authored by Aleksey Vlasenko's avatar Aleksey Vlasenko
Browse files

Fixed preprocessing link and added clarifications about vocabulary size.

parent c280c4ee
...@@ -68,7 +68,7 @@ Note that the dataset is large (~1TB). ...@@ -68,7 +68,7 @@ Note that the dataset is large (~1TB).
### Preprocess the data ### Preprocess the data
Follow the instructions in [Data Preprocessing](data/preprocessing) to Follow the instructions in [Data Preprocessing](./preprocessing) to
preprocess the Criteo Terabyte dataset. preprocess the Criteo Terabyte dataset.
Data preprocessing steps are summarized below. Data preprocessing steps are summarized below.
...@@ -87,7 +87,8 @@ Categorical features: ...@@ -87,7 +87,8 @@ Categorical features:
function such as modulus will suffice, i.e. feature_value % MAX_INDEX. function such as modulus will suffice, i.e. feature_value % MAX_INDEX.
The vocabulary sizes resulting from pre-processing are passed in to the model The vocabulary sizes resulting from pre-processing are passed in to the model
trainer using the model.vocab_sizes config. trainer using the model.vocab_sizes config. Note that provided values in sample below
are only valid for Criteo Terabyte dataset.
The full dataset is composed of 24 directories. Partition the data into training The full dataset is composed of 24 directories. Partition the data into training
and eval sets, for example days 1-23 for training and day 24 for evaluation. and eval sets, for example days 1-23 for training and day 24 for evaluation.
......
...@@ -69,7 +69,9 @@ python3 criteo_preprocess.py \ ...@@ -69,7 +69,9 @@ python3 criteo_preprocess.py \
--vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000000 \ --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000000 \
--project ${PROJECT} --region ${REGION} --project ${PROJECT} --region ${REGION}
``` ```
Vocabulary for each feature is going to be generated to
${STORAGE_BUCKET}/criteo_vocab/tftransform_tmp/feature_??_vocab files.
Vocabulary size can be found as wc -l <feature_vocab_file>.
Preprocess training and test data: Preprocess training and test data:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment