- 05 Sep, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Format int type and unify np.nan in diagnosis output files. **Major Revision** - format all int columns - unify na values to 'N/A' in json,jsonl,md,html files
-
- 02 Sep, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Make baseline check optional in data diagnosis and fix bugs. **Major Revision** - make baseline file optional in data diagnosis - fix bugs of output in md and excel format when 'function' is not in the rule - fix bug in multi_rules function that miss/failed test may failed the whole process **Minor Revision** - revise doc related with data diagnosis - resolve warning message about baseline not found check, only raise exception if baseline not found in the 'variance' function - move summary fields into top of json file - unify 'Index','machine' -> 'index' in output file
-
- 01 Sep, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Update error handling to support exit code of sb result diagnosis. **Major Revision** - raise exception for any error to make exit_code=1
-
- 23 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add support to store values of metrics in data diagnosis. Take the following rules as example: ``` nccl_store_rule: categories: NCCL_DIS store: True metrics: - nccl-bw:allreduce-run0/allreduce_1073741824_busbw - nccl-bw:allreduce-run1/allreduce_1073741824_busbw - nccl-bw:allreduce-run2/allreduce_1073741824_busbw - nccl-bw:allreduce-run3/allreduce_1073741824_busbw - nccl-bw:allreduce-run4/allreduce_1073741824_busbw nccl_rule: function: multi_rules criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False' categories: NCCL_DIS ``` **nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria
-
- 22 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add support for both jsonl and json format in data diagnosis. **Major Revision** - Add support for both jsonl and json format in data diagnosis **Minor Revision** - change related doc - add jsonl support in cli
-
- 09 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Rename field in data diagnosis to be more readable. **Major Revision** - rename fields according to diagnosis/metric format **Minor Revision** - change type of diagnosis/issue_num to be int
-
- 01 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add failure check feature in data diagnosis. **Major Revision** - Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest - Split performance issue and failedtest in categories **Minor Revision** - replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas
-
- 01 Jun, 2022 1 commit
-
-
user4543 authored
**Description** Fix bugs in data diagnosis. **Major Revision** - add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0' - save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True - fix bug of using wrong column index when applying format(red color and percentile) in the excel
-
- 10 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Output results of all nodes in data diagnosis.
-
- 24 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Add result summary in excel,md,html format. **Major Revision** - Add ResultSummary class to support result summary in excel,md,html format. - Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
-
- 15 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Add md and html output format for DataDiagnosis. **Major Revision** - add md and html support in file_handler - add interface in DataDiagnosis for md and HTML output **Minor Revision** - move excel and json output interface into DataDiagnosis
-
- 07 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Abstract RuleBase from DataDiagnosis.
-
- 20 Feb, 2022 1 commit
-
-
user4543 authored
**Description** Add multi-rules feature for data diagnosis to support multiple rules' combined check. **Major Revision** - revise rule design to support multiple rules combination check - update related codes and tests
-
- 30 Dec, 2021 1 commit
-
-
Yifan Xiong authored
__Description__ Cherry-pick bug fixes from v0.4.0 to main. __Major Revisions__ * Bug - Fix issues for Ansible and benchmarks (#267) * Tests - Refine test cases for microbenchmark (#268) * Bug - Build openmpi with ucx support in rocm dockerfiles (#269) * Benchmarks: Fix Bug - Fix fio build issue (#272) * Docs - Unify metric and add doc for cublas and cudnn functions (#271) * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274) * Bug - Fix bug of detecting if gpu_index is none (#275) * Bug - Fix bugs in data diagnosis (#273) * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270) * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276) * Docs - Upgrade version and release note (#277) Co-authored-by:Yuting Jiang <v-yutjiang@microsoft.com>
-
- 08 Dec, 2021 1 commit
-
-
Yuting Jiang authored
**Description** Add data diagnosis module. **Major Revision** - Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes - Add RuleOp class to define rule operators
-