# Variational Autoencoder for Collaborative Filtering for TensorFlow ## Table Of Contents * [模型概览](#模型概览) * [模型架构](#模型结构) * [测试过程指导](#测试过程指导) * [参数说明](#参数说明) * [主要参数](#主要参数) * [其他参数](#其他参数) * [推理测试](#推理测试) * [测试结果参考](#测试结果参考) ## 模型概览 参考文档 “Variational Autoencoders for Collaborative Filtering”(https://arxiv.org/abs/1802.05814) ### 模型结构


Figure 1. The architecture of the VAE-CF model

## 测试过程指导 1. 数据准备 * 数据:[MovieLens 20m dataset](https://grouplens.org/datasets/movielens/20m/). * 数据处理:解压到```/data/ml-20m/extracted/``` 文件夹 ``` cd /data mkdir -p ml-20m/extracted cd ml-20m/extracted wget http://files.grouplens.org/datasets/movielens/ml-20m.zip unzip ml-20m.zip ``` 2. 数据预处理 ```bash python prepare_dataset.py #或者添加参数 --data_dir=/data/ml-20m/extracted ``` 3. 单卡训练测试. ```python export MIOPEN_USE_APPROXIMATE_PERFORMANCE=0 export MIOPEN_FIND_MODE=1 python main.py --train --checkpoint_dir ./checkpoints ``` 4. 多卡训练测试 ```bash mpirun --bind-to numa --allow-run-as-root -np 8 -H localhost:8 python main.py --train --amp --checkpoint_dir ./checkpoints ``` ## 参数说明 ### 主要参数 常见的运行参数 * `--data_dir` :指定测试数据的路径,默认地址为 ```/data``` * `--checkpoint_dir`: 训练模型参数保存地址 ### 其他参数 ```bash python main.py --help usage: main.py [-h] [--train] [--test] [--inference_benchmark] [--amp] [--epochs EPOCHS] [--batch_size_train BATCH_SIZE_TRAIN] [--batch_size_validation BATCH_SIZE_VALIDATION] [--validation_step VALIDATION_STEP] [--warm_up_epochs WARM_UP_EPOCHS] [--total_anneal_steps TOTAL_ANNEAL_STEPS] [--anneal_cap ANNEAL_CAP] [--lam LAM] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--top_results TOP_RESULTS] [--xla] [--trace] [--activation ACTIVATION] [--log_path LOG_PATH] [--seed SEED] [--data_dir DATA_DIR] [--checkpoint_dir CHECKPOINT_DIR] Train a Variational Autoencoder for Collaborative Filtering in TensorFlow optional arguments: -h, --help show this help message and exit --train Run training of VAE --test Run validation of VAE --inference_benchmark Benchmark the inference throughput and latency --amp Enable Automatic Mixed Precision --epochs EPOCHS Number of epochs to train --batch_size_train BATCH_SIZE_TRAIN Global batch size for training --batch_size_validation BATCH_SIZE_VALIDATION Used both for validation and testing --validation_step VALIDATION_STEP Train epochs for one validation --warm_up_epochs WARM_UP_EPOCHS Number of epochs to omit during benchmark --total_anneal_steps TOTAL_ANNEAL_STEPS Number of annealing steps --anneal_cap ANNEAL_CAP Annealing cap --lam LAM Regularization parameter --lr LR Learning rate --beta1 BETA1 Adam beta1 --beta2 BETA2 Adam beta2 --top_results TOP_RESULTS Number of results to be recommended --xla Enable XLA --trace Save profiling traces --activation ACTIVATION Activation function --log_path LOG_PATH Path to the detailed JSON log from to be created --seed SEED Random seed for TensorFlow and numpy --data_dir DATA_DIR Directory for storing the training data --checkpoint_dir CHECKPOINT_DIR Path for saving a checkpoint after the training ``` ## 推理测试 推理测试可以通过参数:`--inference_benchmark` ``` python main.py --inference_benchmark --checkpoint_dir ./checkpoints ``` ## 测试结果参考 #### Training accuracy: NVIDIA DGX A100 (8x A100 40GB) | GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 [s] | Time to train - mixed precision [s] | Time to train speedup (TF32 to mixed precision) |-------:|-----------------:|-------------:|-----------:|----------------:|--------------:|---------------:| | 1 | 24,576 | 0.430298 | 0.430398 | 112.8 | 109.4 | 1.03 | | 8 | 3,072 | 0.430897 | 0.430353 | 25.9 | 30.4 | 0.85 | #### Training accuracy: NVIDIA DGX-1 (8x V100 32GB) | GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [s] | Time to train - mixed precision [s] | Time to train speedup (FP32 to mixed precision) | |-------:|-----------------:|-------------:|-----------:|----------------:|--------------:|---------------:| | 1 | 24,576 | 0.430592 | 0.430525 | 346.5 | 186.5 | 1.86 | | 8 | 3,072 | 0.430753 | 0.431202 | 59.1 | 42.2 | 1.40 | ### Training performance results Performance numbers below show throughput in users processed per second. They were averaged over an entire training run. ##### Training performance: NVIDIA DGX A100 (8x A100 40GB) | GPUs | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Strong scaling - TF32 | Strong scaling - mixed precision |-------:|------------:|-------------------:|-----------------:|---------------------:|---:|---:| | 1 | 24,576 | 354,032 | 365,474 | 1.03 | 1 | 1 | | 8 | 3,072 | 1,660,700 | 1,409,770 | 0.85 | 4.69 | 3.86 | ##### Training performance: NVIDIA DGX-1 (8x V100 32GB) | GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision | |-------:|------------:|-------------------:|-----------------:|---------------------:|---:|---:| | 1 | 24,576 | 114,125 | 213,283 | 1.87 | 1 | 1 | | 8 | 3,072 | 697,628 | 1,001,210 | 1.44 | 6.11 | 4.69 | #### Inference performance results Our results were obtained by running: ``` python main.py --inference_benchmark [--amp] ``` in the TensorFlow 20.06 NGC container. We use users processed per second as a throughput metric for measuring inference performance. All latency numbers are in seconds. ##### Inference performance: NVIDIA DGX A100 (1x A100 40GB) TF32 | Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% | |-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:| | 1 | 1181 | 0.000847 | 0.000863 | 0.000871 | 0.000901 | FP16 | Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% | |-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:| | 1 | 1215 | 0.000823 | 0.000858 | 0.000864 | 0.000877 | ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB) FP32 | Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% | |-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:| | 1 | 718 | 0.001392 | 0.001443 | 0.001458 | 0.001499 | FP16 | Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% | |-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:| | 1 | 707 | 0.001413 | 0.001511 | 0.001543 | 0.001622 |