• Yuting Jiang's avatar
    Analyzer - Add support to store values of metrics in data diagnosis (#392) · 733860d7
    Yuting Jiang authored
    **Description**
    Add support to store values of metrics in data diagnosis.
    
    Take the following rules as example: 
    ```
        nccl_store_rule:
          categories: NCCL_DIS
          store: True
          metrics:
            - nccl-bw:allreduce-run0/allreduce_1073741824_busbw
            - nccl-bw:allreduce-run1/allreduce_1073741824_busbw
            - nccl-bw:allreduce-run2/allreduce_1073741824_busbw
            - nccl-bw:allreduce-run3/allreduce_1073741824_busbw
            - nccl-bw:allreduce-run4/allreduce_1073741824_busbw
        nccl_rule:
          function: multi_rules
          criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False'
          categories: NCCL_DIS
    ```
    **nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria
    733860d7
test_data_diagnosis.py 21.7 KB