Fixed preprocessing link and added clarifications about vocabulary size.

97070abf · Aleksey Vlasenko · c280c4ee · 97070abf · 97070abf
Commit 97070abf authored Nov 23, 2021 by Aleksey Vlasenko
Showing with 6 additions and 3 deletions

official/recommendation/ranking/README.md official/recommendation/ranking/README.md +3 -2

official/recommendation/ranking/preprocessing/README.md official/recommendation/ranking/preprocessing/README.md +3 -1

No files found.
--- a/official/recommendation/ranking/README.md
+++ b/official/recommendation/ranking/README.md
@@ -68,7 +68,7 @@ Note that the dataset is large (~1TB).
 ### Preprocess the data
-Follow the instructions in [Data Preprocessing](data/preprocessing) to
+Follow the instructions in [Data Preprocessing](./preprocessing) to
 preprocess the Criteo Terabyte dataset.
 Data preprocessing steps are summarized below.
@@ -87,7 +87,8 @@ Categorical features:
    function such as modulus will suffice, i.e. feature_value % MAX_INDEX.
 The vocabulary sizes resulting from pre-processing are passed in to the model
-trainer using the model.vocab_sizes config.
+trainer using the model.vocab_sizes config. Note that provided values in sample below
+are only valid for Criteo Terabyte dataset.
 The full dataset is composed of 24 directories. Partition the data into training
 and eval sets, for example days 1-23 for training and day 24 for evaluation.

--- a/official/recommendation/ranking/preprocessing/README.md
+++ b/official/recommendation/ranking/preprocessing/README.md
@@ -69,7 +69,9 @@ python3 criteo_preprocess.py \
  --vocab_gen_mode --runner DataflowRunner --max_vocab_size 5000000 \
  --project ${PROJECT} --region ${REGION}
 ```
+Vocabulary for each feature is going to be generated to
+${STORAGE_BUCKET}/criteo_vocab/tftransform_tmp/feature_??_vocab files.
+Vocabulary size can be found as wc -l <feature_vocab_file>.
 Preprocess training and test data: