README.md

# Variational Autoencoder for Collaborative Filtering for TensorFlow


## Table Of Contents

  * [模型概览](#模型概览)
     * [模型架构](#模型结构)
  * [测试过程指导](#测试过程指导)
  * [参数说明](#参数说明)
     * [主要参数](#主要参数)
     * [其他参数](#其他参数)
   * [推理测试](#推理测试)
   * [测试结果参考](#测试结果参考)
     

## 模型概览

参考文档 “Variational Autoencoders for Collaborative Filtering”(https://arxiv.org/abs/1802.05814) 


### 模型结构

<p align="center">
   <img width="70%" src="images/autoencoder.png" />
   <br>
   Figure 1. The architecture of the VAE-CF model </p>


## 测试过程指导
1. 数据准备
* 数据：[MovieLens 20m dataset](https://grouplens.org/datasets/movielens/20m/). 

* 数据处理：解压到```/data/ml-20m/extracted/``` 文件夹
       
       ```
       cd /data
       mkdir -p ml-20m/extracted
       cd ml-20m/extracted
       wget http://files.grouplens.org/datasets/movielens/ml-20m.zip
       unzip ml-20m.zip
       ```
   

2. 数据预处理

   ```bash
   python prepare_dataset.py
   #或者添加参数 --data_dir=/data/ml-20m/extracted
   ```

3. 单卡训练测试.
   ```python
   export MIOPEN_USE_APPROXIMATE_PERFORMANCE=0
   export MIOPEN_FIND_MODE=1 
   python main.py --train --checkpoint_dir ./checkpoints

   ```
4. 多卡训练测试

   ```bash
   mpirun --bind-to numa --allow-run-as-root -np 8 -H localhost:8 python main.py --train --amp --checkpoint_dir ./checkpoints
   ```


## 参数说明

### 主要参数

常见的运行参数
* `--data_dir` ：指定测试数据的路径，默认地址为 ```/data```
* `--checkpoint_dir`： 训练模型参数保存地址


### 其他参数

```bash
python main.py --help

usage: main.py [-h] [--train] [--test] [--inference_benchmark]
               [--amp] [--epochs EPOCHS]
               [--batch_size_train BATCH_SIZE_TRAIN]
               [--batch_size_validation BATCH_SIZE_VALIDATION]
               [--validation_step VALIDATION_STEP]
               [--warm_up_epochs WARM_UP_EPOCHS]
               [--total_anneal_steps TOTAL_ANNEAL_STEPS]
               [--anneal_cap ANNEAL_CAP] [--lam LAM] [--lr LR] [--beta1 BETA1]
               [--beta2 BETA2] [--top_results TOP_RESULTS] [--xla] [--trace]
               [--activation ACTIVATION] [--log_path LOG_PATH] [--seed SEED]
               [--data_dir DATA_DIR] [--checkpoint_dir CHECKPOINT_DIR]

Train a Variational Autoencoder for Collaborative Filtering in TensorFlow

optional arguments:
  -h, --help            show this help message and exit
  --train               Run training of VAE
  --test                Run validation of VAE
  --inference_benchmark
                        Benchmark the inference throughput and latency
  --amp          Enable Automatic Mixed Precision
  --epochs EPOCHS       Number of epochs to train
  --batch_size_train BATCH_SIZE_TRAIN
                        Global batch size for training
  --batch_size_validation BATCH_SIZE_VALIDATION
                        Used both for validation and testing
  --validation_step VALIDATION_STEP
                        Train epochs for one validation
  --warm_up_epochs WARM_UP_EPOCHS
                        Number of epochs to omit during benchmark
  --total_anneal_steps TOTAL_ANNEAL_STEPS
                        Number of annealing steps
  --anneal_cap ANNEAL_CAP
                        Annealing cap
  --lam LAM             Regularization parameter
  --lr LR               Learning rate
  --beta1 BETA1         Adam beta1
  --beta2 BETA2         Adam beta2
  --top_results TOP_RESULTS
                        Number of results to be recommended
  --xla                 Enable XLA
  --trace               Save profiling traces
  --activation ACTIVATION
                        Activation function
  --log_path LOG_PATH   Path to the detailed JSON log from to be created
  --seed SEED           Random seed for TensorFlow and numpy
  --data_dir DATA_DIR   Directory for storing the training data
  --checkpoint_dir CHECKPOINT_DIR
                        Path for saving a checkpoint after the training

```


## 推理测试

推理测试可以通过参数：`--inference_benchmark` 

```
python main.py --inference_benchmark --checkpoint_dir ./checkpoints
```

## 测试结果参考


#### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)

| GPUs    | Batch size / GPU    | Accuracy - TF32  | Accuracy - mixed precision  |   Time to train - TF32 [s] |  Time to train - mixed precision [s] | Time to train speedup (TF32 to mixed precision)
|-------:|-----------------:|-------------:|-----------:|----------------:|--------------:|---------------:|
|      1 |       24,576 |         0.430298 |       0.430398 |     112.8  |    109.4 |           1.03 |
|      8 |        3,072 |         0.430897 |       0.430353 |      25.9 |     30.4 |           0.85 |

#### Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision  | Time to train - FP32 [s] |  Time to train - mixed precision [s] | Time to train speedup (FP32 to mixed precision) |
|-------:|-----------------:|-------------:|-----------:|----------------:|--------------:|---------------:|
|      1 |       24,576 |         0.430592 |       0.430525 |     346.5 |   186.5  |           1.86 |
|      8 |        3,072 |         0.430753 |       0.431202 |      59.1 |    42.2 |           1.40  |


### Training performance results

Performance numbers below show throughput in users processed per second. They were averaged over an entire training run.

##### Training performance: NVIDIA DGX A100 (8x A100 40GB)

| GPUs   | Batch size / GPU | Throughput - TF32  | Throughput - mixed precision    | Throughput speedup (TF32 - mixed precision)   | Strong scaling - TF32    | Strong scaling - mixed precision
|-------:|------------:|-------------------:|-----------------:|---------------------:|---:|---:|
|      1 |       24,576 |    354,032   |         365,474   |                 1.03 | 1    | 1    |
|      8 |        3,072 |    1,660,700 |         1,409,770 |                 0.85 | 4.69 | 3.86 |

##### Training performance: NVIDIA DGX-1 (8x V100 32GB)

| GPUs   | Batch size / GPU   | Throughput - FP32    | Throughput - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Strong scaling - FP32    | Strong scaling - mixed precision |
|-------:|------------:|-------------------:|-----------------:|---------------------:|---:|---:|
|      1 |       24,576 |             114,125 | 213,283        |                 1.87 | 1 | 1 |
|      8 |        3,072 |             697,628 |      1,001,210 |                 1.44 | 6.11 | 4.69 |

#### Inference performance results

Our results were obtained by running:
```
python main.py  --inference_benchmark [--amp]
```
in the TensorFlow 20.06 NGC container.

We use users processed per second as a throughput metric for measuring inference performance.
All latency numbers are in seconds.

##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
TF32

|   Batch size |   Throughput Avg |   Latency Avg |   Latency 90% |   Latency 95% |   Latency 99%  |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 | 1181 | 0.000847 | 0.000863 | 0.000871 | 0.000901 |

FP16

|   Batch size |   Throughput Avg |   Latency Avg |   Latency 90% |   Latency 95% |   Latency 99%  |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 | 1215 | 0.000823 | 0.000858 | 0.000864 | 0.000877 |

##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)

FP32

|   Batch size |   Throughput Avg |   Latency Avg |   Latency 90% |   Latency 95% |   Latency 99%  |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
|   1 | 718 |  0.001392 |   0.001443 | 0.001458 | 0.001499 |


FP16

|   Batch size |   Throughput Avg |   Latency Avg |   Latency 90% |   Latency 95% |   Latency 99%  |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 |   707 | 0.001413 | 0.001511 | 0.001543 | 0.001622 |