1. 06 Sep, 2022 2 commits
  2. 05 Sep, 2022 2 commits
    • Yuting Jiang's avatar
      Analyzer - Format int type and unify empty value to N/A in diagnosis output files (#406) · 117e0adc
      Yuting Jiang authored
      **Description**
      Format int type and unify np.nan in diagnosis output files.
      
      **Major Revision**
      - format all int columns 
      - unify na values to 'N/A' in json,jsonl,md,html files
      117e0adc
    • Yang Wang's avatar
      Auto generate ibstat file by pssh (#402) · 2fabac52
      Yang Wang authored
      **Description**
      As MPI can not be inited twice in one same process (by py and c)
      Also, MPI env initialized by mpi4py can not be reused in C env
      
      To avoid MPI init issue introduced from mpi4py, rewrite gen_ibstat_file function to generate ibstat file leveraged by pssh
      
      **Major Revision**
      - Rewrite gen_ibstat_file function to generate ibstat file leveraged by pssh
      
      **Minor Revision**
      - Remove mpi4py dependency
      
      Tested the functionality of topo-aware on 36 nodes cluster
      2fabac52
  3. 02 Sep, 2022 2 commits
    • Yifan Xiong's avatar
      Enhance timeout cleanup to avoid possible hanging (#405) · 8afaa376
      Yifan Xiong authored
      Enhance timeout cleanup to avoid possible hanging.
      
      __Major Revisions__
      * Skip postprocess (mainly torch.dist.barrier and destroy) when exception happens (e.g., timeout, GPU crashed) to avoid subprocesses hanging.
      * Add cleanup to kill sb exec processes when Ansible run failed for certain benchmark.
      
      __Minor Revisions__
      * Update extra Ansible timeout from 300s to 60s.
      8afaa376
    • Yuting Jiang's avatar
      Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399) · db842892
      Yuting Jiang authored
      **Description**
       Make baseline check optional in data diagnosis and fix bugs.
      
      **Major Revision**
      - make baseline file optional in data diagnosis
      - fix bugs of output in md and excel format when 'function' is not in the rule
      - fix bug in multi_rules function that miss/failed test may failed the whole process
      
      **Minor Revision**
      - revise doc related with data diagnosis
      - resolve warning message about baseline not found check, only raise exception if baseline not found in the  'variance' function
      - move summary fields into top of json file
      - unify 'Index','machine' -> 'index' in output file
      db842892
  4. 01 Sep, 2022 1 commit
  5. 31 Aug, 2022 1 commit
  6. 26 Aug, 2022 1 commit
  7. 25 Aug, 2022 2 commits
  8. 23 Aug, 2022 1 commit
    • Yuting Jiang's avatar
      Analyzer - Add support to store values of metrics in data diagnosis (#392) · 733860d7
      Yuting Jiang authored
      **Description**
      Add support to store values of metrics in data diagnosis.
      
      Take the following rules as example: 
      ```
          nccl_store_rule:
            categories: NCCL_DIS
            store: True
            metrics:
              - nccl-bw:allreduce-run0/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run1/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run2/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run3/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run4/allreduce_1073741824_busbw
          nccl_rule:
            function: multi_rules
            criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False'
            categories: NCCL_DIS
      ```
      **nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria
      733860d7
  9. 22 Aug, 2022 1 commit
  10. 17 Aug, 2022 1 commit
    • Yifan Xiong's avatar
      Update Python setup for require packages (#387) · 626ac0a4
      Yifan Xiong authored
      __Description__
      
      Update Python setup for require packages.
      
      __Major Revisions__
      * downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6
      * add extra entry in extras_require for nested packages
      * update `pip install` contents accordingly
      626ac0a4
  11. 16 Aug, 2022 1 commit
  12. 13 Aug, 2022 1 commit
  13. 09 Aug, 2022 1 commit
  14. 08 Aug, 2022 1 commit
  15. 04 Aug, 2022 1 commit
  16. 01 Aug, 2022 1 commit
    • Yuting Jiang's avatar
      Analyzer - Add failure check feature in data diagnosis (#378) · ec16d425
      Yuting Jiang authored
      **Description**
      Add failure check feature in data diagnosis.
      
      **Major Revision**
      - Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest
      - Split performance issue and failedtest in categories
      
      
      **Minor Revision**
      - replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas
      ec16d425
  17. 26 Jul, 2022 1 commit
    • Jie Zhang's avatar
      Support topo-aware IB performance validation (#373) · ef4d6574
      Jie Zhang authored
      
      
      * Support topo-aware IB performance validation
      
      Add a new pattern `topo-aware`, so the user can run IB performance
      test based on VM's topology information. This way, the user can
      validate the IB performance across VM pairs with different distance
      as a quick test instead of pair-wise test.
      
      To run with topo-aware pattern, user needs to specify three required
      (and two optional) parameters in YAML config file:
      --pattern	topo-aware
      --ibstat	path to ibstat output
      --ibnetdiscover	path to ibnetdiscover output
      --min_dist	minimum distance of VM pairs (optional, default 2)
      --max_dist	maximum distance of VM pairs (optional, default 6)
      
      The newly added topo_aware module then parses the topology
      information, builds a graph, and generates the VM pairs with
      the specified distance (# hops).
      
      The specified IB test will then be running across these
      generated VM pairs.
      Signed-off-by: default avatarJie Zhang <jessezhang1010@gmail.com>
      
      * Add description about topology aware ib traffic tests
      Signed-off-by: default avatarJie Zhang <jessezhang1010@gmail.com>
      
      * Add unit test to verify generated topology aware config file
      
      This commit adds unit test to verify the generated topology aware
      config file is correct. To do so, four new data files are added in
      order to invoke gen_topo_aware_config function to generate topology
      aware config file, then compares it with the expected config file.
      Signed-off-by: default avatarJie Zhang <jessezhang1010@gmail.com>
      
      * Fix lint issue on Azure pipeline
      Signed-off-by: default avatarJie Zhang <jessezhang1010@gmail.com>
      ef4d6574
  18. 25 Jul, 2022 1 commit
  19. 22 Jul, 2022 1 commit
  20. 20 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Fix port conflict in ib loopback (#375) · 352ae0c9
      Yifan Xiong authored
      Fix potential port conflict due to race condition between time-to-check
      to time-to-use, by binding the port all through.
      
      Modify the function to resolve flake8 C901 while keeping the logic same.
      352ae0c9
  21. 13 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Add dependencies (#374) · 16b6385d
      Yifan Xiong authored
      Add dependencies
      
      * include ndv4-topo.xml in cuda docker images
      * require requests version to avoid RequestsDependencyWarning
      16b6385d
  22. 09 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Fix issues in ib validation benchmark (#370) · b2875179
      Yifan Xiong authored
      Fix several issues in ib validation benchmark:
      * continue running when timeout in the middle, instead of aborting whole mpi process
      * make timeout parameter configurable, set default to 120 seconds
      * avoid mixture of stdio and iostream when print to stdout
      * set default message size to 8M which will saturate ib in most cases
      * fix hostfile path issue so that it can be auto found in different cases
      b2875179
  23. 08 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Support node_num=1 in mpi mode (#372) · e00a8180
      Yifan Xiong authored
      Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in
      both 1 node and all nodes in one config by changing `node_num`.
      Update docs and add test case accordingly.
      e00a8180
  24. 06 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Update dependencies and Dockerfile (#371) · 9f03d568
      Yifan Xiong authored
      Update dependencies and Dockerfile:
      * upgrade nccl-tests and rccl-tests to current latest version to match
        NCCL/RCCL versions
      * unify image tag names on DockerHub
      * remove verbose output in Dockerfile and minor fix some flags
      9f03d568
  25. 05 Jul, 2022 1 commit
  26. 29 Jun, 2022 2 commits
    • Yifan Xiong's avatar
      Fix issues in ib loopback benchmark (#369) · 620192a2
      Yifan Xiong authored
      Fix several issues in ib loopback benchmark:
      * use `--report_gbits` and divide by 8 to get GB/s, previous results are
        MiB/s / 1000
      * use the ib_write_bw binary built in third_party instead of system path
      * update the metrics name so that different hca indices have same metric
      620192a2
    • Yifan Xiong's avatar
      Deployment - Refine error message when GPU is not detected (#368) · 8ef7163a
      Yifan Xiong authored
      Refine error message when GPU is not detected.
      
      Possible solutions if hardware exists and drivers are already installed:
      * nvidia gpus:
        ```sh
        /sbin/modprobe nvidia-uvm
        D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
        mknod -m 666 /dev/nvidia-uvm c $D 0
        ```
      
      * amd gpus
        ```sh
        modprobe amdgpu
        ```
      8ef7163a
  27. 24 Jun, 2022 2 commits
    • Yifan Xiong's avatar
      Fix incorrect ulimit config in Dockerfile (#364) · 325a7338
      Yifan Xiong authored
      Fix incorrect ulimit nofile config in Dockerfile.
      
      Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.
      325a7338
    • Yifan Xiong's avatar
      Support multiple IB/GPU in ib validation (#363) · bfaa1c83
      Yifan Xiong authored
      **Description**
      
      Support multiple IB/GPU devices run simultaneously in ib validation benchmark.
      
      **Major Revisions**
      - Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
      - Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
      - Fix env issues in Dockerfile for end-to-end test.
      - Update ib-traffic configuration examples in config files.
      - Update unit tests and docs accordingly.
      
      Closes #326.
      bfaa1c83
  28. 19 Jun, 2022 2 commits
  29. 15 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Fix cmake and build issues (#360) · 60a3c743
      Yifan Xiong authored
      **Description**
      
      Fix cmake and build issues.
      
      **Major Revision**
      
      * Remove unnecessary boost build
      * Remove user-agent for mlc
      * Remove -j for third party to build each project in sequence
      * Fix ansible collections installation path
      60a3c743
  30. 14 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Support `sb run` on host directly without Docker (#358) · a4937e95
      Yifan Xiong authored
      **Description**
      
      Support `sb run` on host directly without Docker
      
      **Major Revisions**
      - Add `--no-docker` argument for `sb run`.
      - Run on host directly if `--no-docker` if specified.
      - Update docs and tests correspondingly.
      a4937e95
  31. 06 Jun, 2022 1 commit
  32. 02 Jun, 2022 2 commits