- 11 Aug, 2022 1 commit
-
-
Minjie Wang authored
* code changes for bug fixes identified during mag_lsc dataset (#4187) * code changes for bug fixes identified during mag_lsc dataset 1. Changed from call torch.Tensor() to torch.from_numpy() to address memory corruption issues when creating large tensors. Tricky thing is this works correctly for small tensors. 2. Changed dgl.graph() function call to include 'num_nodes" argument to specifically mention all the nodes in a graph partition. * Update convert_partition.py Moving the changes to the function "create_metadata_json" function to the "multiple-file-format" support, where this change is more appropriate. Since multiple machine testing was done with these code changes. * Addressing review comments. Removed space as suggested at the end of the line * Revert "Revert "[Distributed Training Pipeline] Initial implementation of Distributed data processing step in the Dis… (#3926)" (#4037)" This reverts commit 7c598aac . * Added code to support multiple-file-support feature and removed singl… (#4188) * Added code to support multiple-file-support feature and removed single-file-support code 1. Added code to read dataset in multiple-file-format 2. Removed code for single-file format * added files missing in the previous commit This commit includes dataset_utils.py, which reads the dataset in multiple-file-format, gloo_wrapper function calls to support exchanging dictionaries as objects and helper functions in utils.py * Update convert_partition.py Updated function call "create_metadata_json" file to include partition_id so that each rank only creates its own metadata object and later on these are accumulated on rank-0 to create graph-level metadata json file. * addressing code review comments during the CI process code changes resulting from the code review comments received during the CI process. * Code reorganization Addressing CI comments and code reorganization for easier understanding. * Removed commented out line removed commented out line. * Support new format for multi-file support in distributed partitioning. (#4217) * Code changes for the following 1. Generating node data at each process 2. Reading csv files using pyarrow 3. feature complete code. * Removed some typo's because of which unit tests were failing 1. Change the file name to correct file name when loading edges from file 2. When storing node-features after shuffling, use the correct key to store the global-nids of node features which are received after transmitted. * Code changes to address CI comments by reviewers 1. Removed some redundant code and added text in the doc-strings to describe the functionality of some functions. 2 function signatures and invocations now match w.r.t argument list 3. Added detailed description of the metadata json structure so that the users understand the the type of information present in this file and how it is used through out the code. * Addressing code review comments 1. Addressed all the CI comments and some of the changes include simplifying the code related to the concatenation of lists and enhancing the docstrings of functions which are changed in this process. * Update docstring's of two functions appropriately in response to code review comments Removed "todo" from the docstring of the gen_nodedata function. Added "todo" to the gen_dist_partitions function when node-id to partition-id's are read for the first time. Removed 'num-node-weights' from the docstring for the get_dataset function and added schema_map docstring to the argument list. * [Distributed] Change for the new input format for distributed partitioning (#4273) * Code changes to address the updated file format support for massively large graphs. 1. Updated the docstring for the starting function 'gen_dist_partitions" to describe the newly proposed file format for input dataset. 2. Code which was dependent on the structure of the old-metadata json object has been updated to read from the newly proposed metadata file. 3. Fixed some errors when appropriate functions were invoked and the calling function expects return values from the invoked furnction. 4. This modified code has been tested on "mag" dataset using 4-way partitions and verified the results * Code changes to address the CI review comments 1. Improved docstrings for some functions. 2. Added a new function in the utils.py to compute the id ranges and this is used in multiple places. * Added TODO to indicate the redundant data structure. Because of the new file format changes, one of the dictionaries (node_feature_tids, node_tids) will be redundant. Added TODO text so that this will be removed in the next iteration of code changes. * [Distributed] use alltoall fix to bypass gloo - alltoallv bug in distributed partitioning (#4311) * Alltoall Fix to bypass gloo - alltoallv bug which is preventing further testing 1. Replaced alltoallv gloo wrapper call with alltoall message. 2. All the messages are padded to be of same length 3. Receiving side unpads the messages and continues processing. * Code changes to address CI comments 1. Removed unused functions from gloo_wrapper.py 2. Changed the function signature of alltoallv_cpu_data as suggested. 3. Added docstring to include more description of the functionality inside alltoallv_cpu_data. Included more asserts to validate the assumptions. * Changed the function name appropriately Changed the function name from "alltoallv_cpu_data" to alltoallv_cpu which I believe is appropriate because underlying functionality is providing alltoallv which is basically alltoall_cpu + padding * Added code and text to address the review comments. 1. Changed the function name to indicate the local use of this function. 2. Changed docstring to indicate the assumptions made by alltoallv_cpu function. * Removed unused function from import statement Removed unused/removed function from import statement. * [Distributed] reduce memory consumption in distributed graph partitioning. (#4338) * Fix for node_subgraph function, which seems to generate segmentation fault for very large partitions 1. Removed three graph dgl objects and we create the final dgl object directly by maintaining the following constraints a) nodes are reordered so that local nodes are placed in the beginning of the nodes list compared to non-local nodes. b)Edges order are maintained as passed into this function. c) src/dst end points are mapped to target values based on the reshuffle'd nodes order. * Code changes addressing CI comments for this PR 1. Used Da's suggested map to map nodes from old to new order. This is much simpler and mem. efficient. * Addressing CI Comments. 1. Reduced the amount of documentation to reflect the actual implementation. 2. named the mapping object appropriately. * [Distributed] Graph chunking UX (#4365) * first commit * update * huh * fix * update * revert core * fix * update * rewrite * oops * address comments * add graph name * address comments * remove sample metadata file * address comments * fix * remove * add docs * Adding launch script and wrapper script to trigger distributed graph … (#4276) * Adding launch script and wrapper script to trigger distributed graph partitioning pipeline as defined in the UX document 1. dispatch_data.py is a wrapper script which builds the command and triggers the distributed partitioning pipeline 2. distgraphlaunch.py is the main python script which triggers the pipeline and to simplify its usage dispatch_data.py is included as a wrapper script around it. * Added code to auto-detect python version and retrieve some parameters from the input metadata json file 1. Auto detect python version 2. Read the metadata json file and extract some parameters to pass to the user defined command which is used to trigger the pipeline. * Updated the json file name to metadata.json file per UX documentation 1. Renamed json file name per UX documentation. * address comments * fix * fix doc * use unbuffered logging to cure anxiety * cure more anxiety * Update tools/dispatch_data.py Co-authored-by:
Minjie Wang <minjie.wang@nyu.edu> * oops Co-authored-by:
Quan Gan <coin2028@hotmail.com> Co-authored-by:
Minjie Wang <minjie.wang@nyu.edu> Co-authored-by:
kylasa <kylasa@gmail.com> Co-authored-by:
Da Zheng <zhengda1936@gmail.com> Co-authored-by:
Quan (Andy) Gan <coin2028@hotmail.com>
-
- 14 Jun, 2022 1 commit
-
-
Rhett Ying authored
* [Dist] master port should be fixed for all trainers * add tests for tools/launch.py
-
- 09 Jun, 2022 1 commit
-
-
Rhett Ying authored
-
- 09 Feb, 2022 1 commit
-
-
Rhett Ying authored
* enable to launch multiple client groups sequentially * launch simultaneously is enabled * refine docstring * revert unnecessary change * [DOC] add doc for long live server * refine * refine doc * refine doc
-
- 23 Sep, 2021 1 commit
-
-
xiang song(charlie.song) authored
[Distributed] Allow user to pass-in extra env parameters when launching a distributed training task. (#3375) * Allow user to pass-in extra env parameters when launching a distributed training task. * Update * upd Co-authored-by:xiangsx <xiangsx@ip-10-3-59-214.eu-west-1.compute.internal>
-
- 14 Sep, 2021 1 commit
-
-
xiang song(charlie.song) authored
* put PYTHONPATH in server launch * remove prints Co-authored-by:xiangsx <xiangsx@ip-10-3-59-214.eu-west-1.compute.internal>
-
- 17 Aug, 2021 1 commit
-
-
Eric Kim authored
[Tools] In `tools/launch.py`, correctly pass all DGL client/server env vars if udf is a multi-command (#3245) * Correctly pass all DGL client/server env vars if udf is a multi-command * Refactor to use wrap_cmd_with_local_envvars() helper fn
-
- 02 Aug, 2021 1 commit
-
-
Eric Kim authored
* Refactors torch dist launcher udf-wrap code to handle more python versions * minor changes
-
- 30 Jul, 2021 1 commit
-
-
Eric Kim authored
-
- 26 May, 2021 1 commit
-
-
Da Zheng authored
* explicitly set the graph format. * fix. * fix. * fix launch script. * fix readme. Co-authored-by:
Zheng <dzzhen@3c22fba32af5.ant.amazon.com> Co-authored-by:
xiang song(charlie.song) <classicxsong@gmail.com> Co-authored-by:
Ubuntu <ubuntu@ip-172-31-71-112.ec2.internal>
-
- 01 May, 2021 1 commit
-
-
Da Zheng authored
* kill training jobs. * update. * fix. Co-authored-by:
Zheng <dzzhen@3c22fba32af5.ant.amazon.com> Co-authored-by:
Ubuntu <ubuntu@ip-172-31-71-112.ec2.internal> Co-authored-by:
Ubuntu <ubuntu@ip-172-31-73-81.ec2.internal> Co-authored-by:
xiang song(charlie.song) <classicxsong@gmail.com>
-
- 08 Apr, 2021 1 commit
-
-
Da Zheng authored
Co-authored-by:
Ubuntu <ubuntu@ip-172-31-73-81.ec2.internal> Co-authored-by:
Jinjing Zhou <VoVAllen@users.noreply.github.com>
-
- 04 Apr, 2021 1 commit
-
-
Da Zheng authored
* set omp thread. * add comment. * fix.
-
- 30 Mar, 2021 1 commit
-
-
Da Zheng authored
* remove num_workers. * remove num_workers. * remove num_workers. * remove num-servers. * update error message. * update docstring. * fix docs. * fix tests. * fix test. * fix. * print messages in test. * fix. * fix test. * fix. Co-authored-by:Ubuntu <ubuntu@ip-172-31-9-132.us-west-1.compute.internal>
-
- 27 Aug, 2020 1 commit
-
-
Chao Ma authored
* check num_workers * update * update * update * update * update * update
-
- 12 Aug, 2020 1 commit
-
-
Chao Ma authored
* update * update
-
- 11 Aug, 2020 2 commits
-
-
Da Zheng authored
* move server start code to initialize. * fix. * fix lint. * fix examples. * add more checks.
-
Chao Ma authored
* remove server_count from ip_config.txt * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * lint * update * update * update * update * update * update * update * update * update * update * update * update * update * Update dist_context.py * fix lint. * make it work for multiple spaces. * update ip_config.txt. * fix examples. * update * update * update * update * update * update * update * update * update * update * update * update * update * update * udpate * update * update * update * update * update Co-authored-by:
Da Zheng <zhengda1936@gmail.com> Co-authored-by:
Ubuntu <ubuntu@ip-172-31-19-1.us-west-2.compute.internal>
-
- 10 Aug, 2020 1 commit
-
-
Da Zheng authored
* fix tests. * fix. * remove a test. * make code work in the standalone mode. * fix example. * more fix. * make DistDataloader work with num_workers=0 * fix DistDataloader tests. * fix. * fix lint. * fix cleanup. * fix test * remove unnecessary code. * remove tests. * fix. * fix. * fix. * fix example * fix. * fix. * fix launch script. Co-authored-by:
Jinjing Zhou <VoVAllen@users.noreply.github.com> Co-authored-by:
Ubuntu <ubuntu@ip-172-31-19-1.us-west-2.compute.internal>
-
- 09 Aug, 2020 1 commit
-
-
Da Zheng authored
* temp fix omp. * set server threads. * add CAPI to set up OMP threads. * fix. * fix. * update namesapce. * set cpi properly. * allow to config num worker threads. * set #threads. * fix.
-
- 08 Aug, 2020 1 commit
-
-
Da Zheng authored
* update launch script * check the correctness of launch script. * fix.
-
- 31 Jul, 2020 1 commit
-
-
Da Zheng authored
* fix bugs. * eval on both vaidation and testing. * add script. * update. * update launch. * make train_dist.py independent. * update readme. * update readme. * update readme. * update readme. * generate undirected graph. * rename conf_file to part_config * use rsync * make train_dist independent. Co-authored-by:
Ubuntu <ubuntu@ip-172-31-19-1.us-west-2.compute.internal> Co-authored-by:
Ubuntu <ubuntu@ip-172-31-19-115.us-west-2.compute.internal> Co-authored-by:
xiang song(charlie.song) <classicxsong@gmail.com>
-
- 27 Jul, 2020 1 commit
-
-
Chao Ma authored
* update * update * update * update
-
- 17 Jul, 2020 1 commit
-
-
Da Zheng authored
Co-authored-by:Ubuntu <ubuntu@ip-172-31-19-1.us-west-2.compute.internal>
-
- 16 Jul, 2020 1 commit
-
-
Chao Ma authored
* update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix launch script. Co-authored-by:Da Zheng <zhengda1936@gmail.com>
-