Unverified Commit 89edb819 authored by ChongyuNVIDIA's avatar ChongyuNVIDIA Committed by GitHub
Browse files

Add the permutation related support as the extension for asp lib. (#1194)

* Add the permutation related support as the extension for asp lib.

* [Fix] Track the permutation sequence for progressive channel swap strategy.

* Fix the corner case that one layer is not sparse, but need to apply permutation due to its siblings.

* Fix the deprecated functions in ASP unit tests.

* Fix the sparsity info typo in ASP lib.

* [Enhancement] Set the identical random seed for all GPUs to make sure the same results generated in permutation search.

* Update the README.md with identical random seed setting and NeurIPS info.

* Integrate the Pybind11 enhancement of permutation search into ASP lib.
parent 79c01877
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
This serves as a quick-start for ASP (Automatic SParsity), a tool that enables sparse training and inference for PyTorch models by adding 2 lines of Python. This serves as a quick-start for ASP (Automatic SParsity), a tool that enables sparse training and inference for PyTorch models by adding 2 lines of Python.
## Importing ASP ## Importing ASP
``` ```
from apex.contrib.sparsity import ASP from apex.contrib.sparsity import ASP
``` ```
...@@ -10,11 +11,13 @@ from apex.contrib.sparsity import ASP ...@@ -10,11 +11,13 @@ from apex.contrib.sparsity import ASP
## Initializing ASP ## Initializing ASP
Apart from the import statement, it is sufficient to add just the following line of code before the training phase to augment the model and the optimizer for sparse training/inference: Apart from the import statement, it is sufficient to add just the following line of code before the training phase to augment the model and the optimizer for sparse training/inference:
``` ```
ASP.prune_trained_model(model, optimizer) ASP.prune_trained_model(model, optimizer)
``` ```
In the context of a typical PyTorch training loop, it might look like this: In the context of a typical PyTorch training loop, it might look like this:
``` ```
ASP.prune_trained_model(model, optimizer) ASP.prune_trained_model(model, optimizer)
...@@ -27,6 +30,7 @@ for epoch in range(epochs): ...@@ -27,6 +30,7 @@ for epoch in range(epochs):
torch.save(...) torch.save(...)
``` ```
The `prune_trained_model` step calculates the sparse mask and applies it to the weights. This is done once, i.e., sparse locations in the weights matrix remain fixed after this step. The `prune_trained_model` step calculates the sparse mask and applies it to the weights. This is done once, i.e., sparse locations in the weights matrix remain fixed after this step.
## Generate a Sparse Network ## Generate a Sparse Network
...@@ -42,7 +46,6 @@ The following approach serves as a guiding example on how to generate a pruned m ...@@ -42,7 +46,6 @@ The following approach serves as a guiding example on how to generate a pruned m
In code, below is a sketch on how to use ASP for this approach (steps 1 and 2 above). In code, below is a sketch on how to use ASP for this approach (steps 1 and 2 above).
``` ```
model = define_model(..., pretrained=True) # define model architecture and load parameter tensors with trained values (by reading a trained checkpoint) model = define_model(..., pretrained=True) # define model architecture and load parameter tensors with trained values (by reading a trained checkpoint)
criterion = ... # compare ground truth with model predition; use the same criterion as used to generate the dense trained model criterion = ... # compare ground truth with model predition; use the same criterion as used to generate the dense trained model
optimizer = ... # optimize model parameters; use the same optimizer as used to generate the dense trained model optimizer = ... # optimize model parameters; use the same optimizer as used to generate the dense trained model
...@@ -72,7 +75,60 @@ ASP.compute_sparse_masks() ...@@ -72,7 +75,60 @@ ASP.compute_sparse_masks()
A more thorough example can be found in `./test/toy_problem.py`. A more thorough example can be found in `./test/toy_problem.py`.
## Advanced Usage: Channel Permutation
We introduce channel permutations as an advanced method to maximize the accuracy of structured sparse networks. By permuting weight matrices along their channel dimension and adjusting the surrounding layers appropriately, we demonstrate accuracy recovery for even small, parameter-efficient networks, without affecting inference run-time.
The final accuracy has a strong relationship with the quality of permutations. We provide the default algorithms to search for high-quality permutations. The permutation search process can be accelerated by the Apex CUDA extension: `apex.contrib.sparsity.permutation_search_kernels`
If you want to use the GPU to accelerate the permutation search process, we recommend installing Apex with permutation search CUDA extension via
```
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--permutation_search" ./
```
If you want to disable the permutation search process, please pass the `allow_permutation=False` to `init_model_for_pruning` function. For example:
```
ASP.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[torch.nn.Linear, torch.nn.Conv2d], allow_recompute_mask=False, allow_permutation=False)
```
Please notice, when using multi-GPUs we should set the identical random seed for all GPUs to make sure the same results generated in permutation search. The library has implemented the `set_identical_seed` function in `permutation_lib.py`, and be called in ASP library. We still suggest the users to set the identical random seed when using multi-GPUs in their code, the example code is as follows:
```
import torch
import numpy
import random
torch.manual_seed(identical_seed)
torch.cuda.manual_seed_all(identical_seed)
numpy.random.seed(identical_seed)
random.seed(identical_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
```
## Reference Papers
More details about sparsity support on the NVIDIA Ampere GPU with Sparse Tensor Cores can refer to our [white paper](https://arxiv.org/abs/2104.08378).
```
@article{mishra2021accelerating,
title={Accelerating sparse deep neural networks},
author={Mishra, Asit and Latorre, Jorge Albericio and Pool, Jeff and Stosic, Darko and Stosic, Dusan and Venkatesh, Ganesh and Yu, Chong and Micikevicius, Paulius},
journal={arXiv preprint arXiv:2104.08378},
year={2021}
}
```
The details about sparsity with permutation can refer to our [paper](https://proceedings.neurips.cc/paper/2021/hash/6e8404c3b93a9527c8db241a1846599a-Abstract.html) published in *Thirty-fifth Conference on Neural Information Processing Systems* (**NeurIPS 2021**):
```
@article{pool2021channel,
title={Channel Permutations for N: M Sparsity},
author={Pool, Jeff and Yu, Chong},
journal={Advances in Neural Information Processing Systems},
volume={34},
year={2021}
}
```
import types import types
import torch import torch
from .sparse_masklib import create_mask from .sparse_masklib import create_mask
from .permutation_lib import Permutation
torchvision_imported=True torchvision_imported=True
try: try:
...@@ -9,6 +10,11 @@ except ImportError: ...@@ -9,6 +10,11 @@ except ImportError:
print("[ASP][Warning] torchvision cannot be imported.") print("[ASP][Warning] torchvision cannot be imported.")
torchvision_imported=False torchvision_imported=False
import json
import os
import string
import time
def eligible_modules(model, whitelist_layer_types, allowed_layer_names, disallowed_layer_names): def eligible_modules(model, whitelist_layer_types, allowed_layer_names, disallowed_layer_names):
eligible_modules_list = [] eligible_modules_list = []
for name, mod in model.named_modules(): for name, mod in model.named_modules():
...@@ -18,19 +24,25 @@ def eligible_modules(model, whitelist_layer_types, allowed_layer_names, disallow ...@@ -18,19 +24,25 @@ def eligible_modules(model, whitelist_layer_types, allowed_layer_names, disallow
eligible_modules_list.append((name, mod)) eligible_modules_list.append((name, mod))
return eligible_modules_list return eligible_modules_list
class ASP: class ASP:
__model = None __model = None
__verbosity = 0 __verbosity = 0
__optimizer = None __optimizer = None
__sparse_parameters = [] __sparse_parameters = []
__calculate_mask = None __calculate_mask = None
__allow_permutation = True
__all_parameters = []
__save_permutation_graph = False
__permutation_output_dir = ''
@classmethod @classmethod
def init_model_for_pruning(cls, model, mask_calculator="m4n2_1d", def init_model_for_pruning(cls, model, mask_calculator="m4n2_1d",
verbosity=3, verbosity=3,
whitelist=[torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d], whitelist=[torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d],
allowed_layer_names=None, disallowed_layer_names=[], allowed_layer_names=None, disallowed_layer_names=[],
allow_recompute_mask=False, custom_layer_dict={}): allow_recompute_mask=False, custom_layer_dict={},
allow_permutation=True):
"""Call this method to modify your model to take advantage of sparse matrix multiplication. """Call this method to modify your model to take advantage of sparse matrix multiplication.
Note that this call alone only augments the model with additional buffers needed for sparse MMA, Note that this call alone only augments the model with additional buffers needed for sparse MMA,
it does not enable use of sparse MMA. it does not enable use of sparse MMA.
...@@ -63,12 +75,14 @@ class ASP: ...@@ -63,12 +75,14 @@ class ASP:
allow_recompute_mask If True, stores pruned values so that dense weights can be restored. allow_recompute_mask If True, stores pruned values so that dense weights can be restored.
Pruned weights are stored in CPU memory, hence this option does not increase GPU memory usage. Pruned weights are stored in CPU memory, hence this option does not increase GPU memory usage.
custom_layer_dict Dictionary of additional layer paremeters to sparsify. e.g. {CustomLinear: ['weight']} custom_layer_dict Dictionary of additional layer paremeters to sparsify. e.g. {CustomLinear: ['weight']}
allow_permutation If True, allow the input channel permutation to ease the influence of weight pruning.
[Future] Support for allow_recompute_mask can be removed, it is not part of sparse inference recipe -- AKM. [Future] Support for allow_recompute_mask can be removed, it is not part of sparse inference recipe.
""" """
assert (cls.__model is None), "ASP has been initialized already." assert (cls.__model is None), "ASP has been initialized already."
cls.__model = model cls.__model = model
cls.__verbosity = verbosity cls.__verbosity = verbosity
cls.__allow_permutation = allow_permutation
if isinstance(mask_calculator, str): if isinstance(mask_calculator, str):
def create_mask_from_pattern(param): def create_mask_from_pattern(param):
...@@ -91,6 +105,28 @@ class ASP: ...@@ -91,6 +105,28 @@ class ASP:
for module_type in whitelist: for module_type in whitelist:
assert (module_type in sparse_parameter_list), "Module %s :: Don't know how to sparsify module." % module.dtype() assert (module_type in sparse_parameter_list), "Module %s :: Don't know how to sparsify module." % module.dtype()
if allow_permutation: # find all named modules, extract parameters and decorate, used for offline permutation in K dim
for module_name, module in model.named_modules():
module_type_str = str(type(module)).split("\'")[1]
if module_type_str == 'torch.nn.modules.container.Sequential' or module_type_str.startswith('torchvision.models'):
# filter out the 'torch.nn.modules.container.Sequential' type and the whole model, like 'torchvision.models.vgg.VGG'
continue
for p_name, p in module.named_parameters():
cls.__all_parameters.append((module_name, module, p_name, p))
if module_type_str == 'torch.nn.modules.batchnorm.BatchNorm2d':
# need to get the running_mean and running_var from model.state_dict(), as they are not the learnable parameters
module_mean_name = module_name + '.running_mean'
module_var_name = module_name + '.running_var'
for param_key in model.state_dict():
if module_mean_name == param_key or module_var_name == param_key:
cls.__all_parameters.append((module_name, module, param_key.split(".")[-1], model.state_dict()[param_key]))
# add the __permutation_output_dir field to save the intermediate results for permutation
cls.__permutation_output_dir = '.'
# Set the corresponding params from ASP class to the Permutation class
Permutation.set_permutation_params_from_asp(cls.__model, cls.__sparse_parameters, cls.__all_parameters)
# Set the identical random seed for all GPUs to make sure the same results generated in permutation search
Permutation.set_identical_seed()
# find all sparse modules, extract sparse parameters and decorate # find all sparse modules, extract sparse parameters and decorate
def add_sparse_attributes(module_name, module): def add_sparse_attributes(module_name, module):
sparse_parameters = sparse_parameter_list[type(module)] sparse_parameters = sparse_parameter_list[type(module)]
...@@ -123,6 +159,19 @@ class ASP: ...@@ -123,6 +159,19 @@ class ASP:
for name, sparse_module in eligible_modules(model, tuple(whitelist), allowed_layer_names, disallowed_layer_names): for name, sparse_module in eligible_modules(model, tuple(whitelist), allowed_layer_names, disallowed_layer_names):
add_sparse_attributes(name, sparse_module) add_sparse_attributes(name, sparse_module)
@classmethod
def already_init_asp_model(cls):
"""Call this method to check whether ASP has been initialized already.
"""
if cls.__model is None:
if cls.__verbosity >= 3:
print("[ASP] ASP has not been initialized.")
return False
else:
if cls.__verbosity >= 3:
print("[ASP] ASP has been initialized already.")
return True
@classmethod @classmethod
def init_optimizer_for_pruning(cls, optimizer): def init_optimizer_for_pruning(cls, optimizer):
"""Call this method to monkey patch optimizer step function so that masks can be applied to """Call this method to monkey patch optimizer step function so that masks can be applied to
...@@ -157,6 +206,38 @@ class ASP: ...@@ -157,6 +206,38 @@ class ASP:
If init(...) was called with allow_recompute_mask=False AND sparsity is disabled, pruned field can be None. If init(...) was called with allow_recompute_mask=False AND sparsity is disabled, pruned field can be None.
""" """
with torch.no_grad(): with torch.no_grad():
if cls.__allow_permutation:
# Step 1: use the Torch.FX library to build the graph
# Step 2: permutation search with the customized kernel
# Notice: need to use the single GPU to build the Torch.FX graph
# The simplest without user intervention:
# A. try to import with the distributed mode of the original model
# B. if meet the error, import with the none-distributed mode of the original model
start_time_build_offline_permutation_graph = time.perf_counter()
try:
offline_permutation_fx_graph, success_in_build_offline_permutation_graph = Permutation.build_offline_permutation_graph(cls.__model.module, dump_fx_graph=cls.__save_permutation_graph, save_dumped_fx_graph=os.path.join(cls.__permutation_output_dir, 'model_offline_permutation_graph.json'))
print("\n[compute_sparse_masks] build offline permutation graph on distributed model.")
except AttributeError:
offline_permutation_fx_graph, success_in_build_offline_permutation_graph = Permutation.build_offline_permutation_graph(cls.__model, dump_fx_graph=cls.__save_permutation_graph, save_dumped_fx_graph=os.path.join(cls.__permutation_output_dir, 'model_offline_permutation_graph.json'))
print("\n[compute_sparse_masks] build offline permutation graph on none-distributed model.")
duration_build_offline_permutation_graph = time.perf_counter() - start_time_build_offline_permutation_graph
print("[compute_sparse_masks] Take {:.4f} seconds to finish build_offline_permutation_graph function.".format(duration_build_offline_permutation_graph))
# Step 3: off-line permutation to avoid the runtime overhead in deployment
if success_in_build_offline_permutation_graph:
start_time_apply_offline_permutation = time.perf_counter()
try:
Permutation.apply_offline_permutation(cls.__model.module, fx_graph=offline_permutation_fx_graph)
print("\n[compute_sparse_masks] apply offline permutation on distributed model.")
except AttributeError:
Permutation.apply_offline_permutation(cls.__model, fx_graph=offline_permutation_fx_graph)
print("\n[compute_sparse_masks] apply offline permutation on none-distributed model.")
duration_apply_offline_permutation = time.perf_counter() - start_time_apply_offline_permutation
print("[compute_sparse_masks] Take {:.4f} seconds to finish apply_offline_permutation function.\n".format(duration_apply_offline_permutation))
else:
print("[compute_sparse_masks] skip applying offline permutation because there is no valid offline_permutation_fx_graph.")
# Finally, permutation search and off-line permutation is done, give the model back to ASP to generate the normal structured sparse mask
for module_name, module, p_name, p, mask, pruned in cls.__sparse_parameters: for module_name, module, p_name, p, mask, pruned in cls.__sparse_parameters:
if mask.sum() < mask.numel(): # when recalculating masks if mask.sum() < mask.numel(): # when recalculating masks
# restore dense parameter if allow_recompute_mask is enabled # restore dense parameter if allow_recompute_mask is enabled
...@@ -170,7 +251,7 @@ class ASP: ...@@ -170,7 +251,7 @@ class ASP:
p.mul_(mask) # in-place multiplication, so pruned weights are 0-values, hence checkpoint will have 0s for pruned weights p.mul_(mask) # in-place multiplication, so pruned weights are 0-values, hence checkpoint will have 0s for pruned weights
if cls.__verbosity >= 2: if cls.__verbosity >= 2:
print("[ASP] Enabled %.2f%% sparsity for %s::%s of size=%s and type=%s" % (100.0*mask.sum()/mask.numel(), module_name, p_name, str(p.size()), str(p.dtype))) print("[ASP] Enabled %.2f%% sparsity for %s::%s of size=%s and type=%s" % (100.0-100.0*mask.sum()/mask.numel(), module_name, p_name, str(p.size()), str(p.dtype)))
@classmethod @classmethod
def restore_pruned_weights(cls): def restore_pruned_weights(cls):
...@@ -215,3 +296,17 @@ class ASP: ...@@ -215,3 +296,17 @@ class ASP:
cls.init_optimizer_for_pruning(optimizer) cls.init_optimizer_for_pruning(optimizer)
cls.compute_sparse_masks() cls.compute_sparse_masks()
@classmethod
def set_permutation_saving_params(cls, allow_permutation=True, save_permutation_graph=False, permutation_output_dir='.'):
"""This function is used to set the permutation saving related parameters in ASP class and inside of the Permutation class."""
print("\n[ASP][set_permutation_saving_param] Set permutation saving related parameters")
print("\n[set_permutation_saving_param] Set permutation saving related parameters")
cls.__allow_permutation = allow_permutation
print("[set_permutation_saving_param]\t Allow permutation: {}".format(cls.__allow_permutation))
cls.__save_permutation_graph = save_permutation_graph
print("[set_permutation_saving_param]\t Save permutation graphs: {}".format(cls.__save_permutation_graph))
cls.__permutation_output_dir = permutation_output_dir
print("[set_permutation_saving_param]\t Permutation graphs saving dir: {}".format(cls.__permutation_output_dir))
Permutation.set_permutation_saving_params(allow_permutation, save_permutation_graph, permutation_output_dir)
import os
import torch
import json
import string
import time
try:
from .permutation_search_kernels import accelerated_search_for_good_permutation, sum_after_2_to_4
print("[ASP][Info] permutation_search_kernels can be imported.")
except ImportError:
print("[ASP][Warning] permutation_search_kernels cannot be imported.")
print("[ASP][Warning] If you want to accelerate the permutation search process by GPU, please build APEX by following the instructions at https://github.com/NVIDIA/apex/blob/master/apex/contrib/sparsity/README.md")
def convert_fx_node_name(fx_node_name):
converted_fx_node_name = fx_node_name
converted_fx_node_name = converted_fx_node_name.replace('_', '.')
return converted_fx_node_name
def get_node_parent_children(fx_node):
# get node parent list, and convert node name to module name
node_parent_name_converted = []
if len(fx_node.all_input_nodes) > 0:
node_parent = fx_node.all_input_nodes
for item in node_parent:
converted_item = convert_fx_node_name(item.name)
node_parent_name_converted.append(converted_item)
else:
node_parent = list('None')
node_parent_name_converted.append('None')
# get node children list, and convert node name to module name
node_children_name_converted = []
if len(list(fx_node.users.keys())) > 0:
node_children = list(fx_node.users.keys())
for item in node_children:
converted_item = convert_fx_node_name(item.name)
node_children_name_converted.append(converted_item)
else:
node_children = list('None')
node_children_name_converted.append('None')
return node_parent_name_converted, node_children_name_converted
class Permutation:
__model = None
__sparse_parameters = []
__allow_permutation = False
__all_parameters = []
__save_permutation_graph = False
__permutation_output_dir = ''
@classmethod
def set_permutation_params_from_asp(cls, model, sparse_parameters, all_parameters):
"""This function is used to set the permutation needed parameters from ASP class."""
print("\n[set_permutation_params_from_asp] Set permutation needed parameters")
cls.__model = model
cls.__sparse_parameters = sparse_parameters
cls.__all_parameters = all_parameters
@classmethod
def set_identical_seed(cls, identical_seed=1):
print("\n[set_identical_seed] Set the identical seed: {:} for all GPUs to make sure the same results generated in permutation search".format(identical_seed))
torch.manual_seed(identical_seed)
torch.cuda.manual_seed_all(identical_seed)
import numpy as np
import random
np.random.seed(identical_seed)
random.seed(identical_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
@classmethod
def set_permutation_saving_params(cls, allow_permutation=False, save_permutation_graph=False, permutation_output_dir='.'):
"""This function is used to set the permutation saving related parameters."""
print("\n[permutation_lib][set_permutation_saving_param] Set permutation saving related parameters")
cls.__allow_permutation = allow_permutation
print("[set_permutation_saving_param]\t Allow permutation: {}".format(cls.__allow_permutation))
cls.__save_permutation_graph = save_permutation_graph
print("[set_permutation_saving_param]\t Save permutation graphs: {}".format(cls.__save_permutation_graph))
cls.__permutation_output_dir = permutation_output_dir
print("[set_permutation_saving_param]\t Permutation graphs saving dir: {}".format(cls.__permutation_output_dir))
@classmethod
def apply_offline_permutation(cls, model, fx_graph):
"""This function is used to offline permutation for each node according to the the whole network graph built with Torch.FX."""
print("\n[apply_offline_permutation] Offline permutation for each node according to the the whole network graph built with Torch.FX")
# Firstly, we should transfer the sparse mask to all-one dense mask
cls.transfer_to_dense_mask()
for node_name in fx_graph.keys():
node_module_type = fx_graph.get(node_name).get('module_type')
# check wheter the current layer can permute as plan, e.g., the flatten layer in VGG will change the shape and broke the permutation chain
# only need to check the 'is_node_real_parents_K_permuted', because the 'is_node_real_parents_C_permuted' has no influence to the children
node_real_parents = fx_graph.get(node_name).get('real_parents')
is_node_real_parents_K_permuted = True
if node_real_parents is not None: # filter out the 'unique_siblings' item
for real_parent_item in node_real_parents:
if fx_graph.get(real_parent_item).get('permutation_type') in ['K', 'KC']:
if fx_graph.get(real_parent_item).get('k_permuted') == 'False':
is_node_real_parents_K_permuted = False
if fx_graph[node_name]['permutation_type'] == 'KC': # intermediate Conv, FC
C_permutation_sequence = cls.fetch_C_permutation_sequence_value(node_name, fx_graph)
K_permutation_sequence = cls.fetch_K_permutation_sequence_value(node_name, fx_graph)
print("\n[apply_offline_permutation] node_name: \'{:}\', node module type: \'{:}\', need to do offline permutation in K and C dims.".format(node_name, node_module_type))
if is_node_real_parents_K_permuted == True:
fx_graph[node_name]['c_permuted'] = str(cls.apply_permutation_in_C_dim(node_name, C_permutation_sequence))
fx_graph[node_name]['k_permuted'] = str(cls.apply_permutation_in_K_dim(node_name, K_permutation_sequence))
else:
print("[apply_offline_permutation][warning] node_name: \'{:}\', its real parents have trouble in permutation in K dim, so skip the offline permutation in C dim.".format(node_name, node_module_type))
fx_graph[node_name]['k_permuted'] = str(cls.apply_permutation_in_K_dim(node_name, K_permutation_sequence))
elif fx_graph[node_name]['permutation_type'] == 'K': # BN, first layer Conv/FC
K_permutation_sequence = cls.fetch_K_permutation_sequence_value(node_name, fx_graph)
print("\n[apply_offline_permutation] node_name: \'{:}\', node module type: \'{:}\', need to do offline permutation in K dim.".format(node_name, node_module_type))
if is_node_real_parents_K_permuted == True:
fx_graph[node_name]['k_permuted'] = str(cls.apply_permutation_in_K_dim(node_name, K_permutation_sequence))
else: # for BN, if the previous Conv cannot do permutation in K dim, then no need to do permutation in K dim for this BN
print("[apply_offline_permutation][warning] node_name: \'{:}\', its real parents have trouble in permutation in K dim, so skip the offline permutation in K dim.".format(node_name, node_module_type))
elif fx_graph[node_name]['permutation_type'] == 'C': # last layer FC/Conv
C_permutation_sequence = cls.fetch_C_permutation_sequence_value(node_name, fx_graph)
print("\n[apply_offline_permutation] node_name: \'{:}\', node module type: \'{:}\', need to do offline permutation in C dim.".format(node_name, node_module_type))
if is_node_real_parents_K_permuted == True:
fx_graph[node_name]['c_permuted'] = str(cls.apply_permutation_in_C_dim(node_name, C_permutation_sequence))
else:
print("[apply_offline_permutation][warning] node_name: \'{:}\', its real parents have trouble in permutation in K dim, so skip the offline permutation in C dim.".format(node_name, node_module_type))
if cls.__save_permutation_graph:
cls.save_graph_to_json(fx_graph, save_dumped_graph_path_with_name=os.path.join(cls.__permutation_output_dir, './model_graph_apply_offline_permutation.json')) # save the intermediate graph as JSON file for debugging
return fx_graph
@classmethod
def transfer_to_dense_mask(cls):
"""Call this method to transfer the sparse mask to all-one dense mask."""
with torch.no_grad():
for module_name, module, p_name, p, mask, pruned in cls.__sparse_parameters:
mask.fill_(1)
@classmethod
def fetch_C_permutation_sequence_value(cls, node_name, fx_graph):
"""This function is used to fetch the permutation sequence value in C dim from the unique_siblings record."""
# C_permutation_sequence is the corresponding 'permutation_sequence' value stored in the fx_graph.get('unique_siblings') item which contains node_name
unique_siblings_groups = fx_graph.get('unique_siblings').get('name')
unique_siblings_groups_permutation_sequence = fx_graph.get('unique_siblings').get('permutation_sequence')
item_index = 0
fetched_C_permutation_sequence = []
for item in unique_siblings_groups:
if node_name in item:
fetched_C_permutation_sequence = unique_siblings_groups_permutation_sequence[item_index]
item_index = item_index + 1
return fetched_C_permutation_sequence
@classmethod
def fetch_K_permutation_sequence_value(cls, node_name, fx_graph):
"""This function is used to fetch the permutation sequence value in K dim from the unique_siblings record."""
# K_permutation_sequence is its real_children's corresponding 'permutation_sequence' value stored in the fx_graph.get('unique_siblings') item which contains real_children name
# we have the assumption that all the real children are in one unique_sibling group, so should share the same permutation_sequence value
unique_siblings_groups = fx_graph.get('unique_siblings').get('name')
unique_siblings_groups_permutation_sequence = fx_graph.get('unique_siblings').get('permutation_sequence')
node_real_children = fx_graph.get(node_name).get('real_children')
fetched_K_permutation_sequence = []
if len(node_real_children) > 0:
node_representative_child = node_real_children[0]
fetched_K_permutation_sequence = cls.fetch_C_permutation_sequence_value(node_representative_child, fx_graph)
return fetched_K_permutation_sequence
@classmethod
def apply_permutation_in_C_dim(cls, node_name, permutation_sequence):
"""This function is used to permutation for a node in C dim. (Only need to handle the weight of the node) """
print("[apply_permutation_in_C_dim] Permutation for node: \'{:}\' in C dim".format(node_name))
if len(permutation_sequence) == 0:
print("[apply_permutation_in_C_dim] the permutation sequence is empty, fail to apply permutation in C dim.")
return False
is_node_in_sparse_parameters = False
success_permutation = False
for module_name, module, p_name, p, mask, pruned in cls.__sparse_parameters:
processed_module_name = ''.join(c for c in module_name if c not in string.punctuation).lower()
processed_node_name = ''.join(c for c in node_name if c not in string.punctuation).lower()
distributed_node_name = 'module.' + node_name
processed_distributed_node_name = 'module.' + processed_node_name
if (module_name == node_name) or (module_name == distributed_node_name) or (processed_module_name == processed_node_name) or (processed_module_name == processed_distributed_node_name): # Inception-V3, module_name: Conv2d_2a_3x3.conv, node_name: conv2d.1a.3x3.conv
print("[apply_permutation_in_C_dim] find the node: \'{:}\' in cls.__sparse_parameters, succeed to apply permutation in C dim.".format(node_name))
is_node_in_sparse_parameters = True
temp_weight = torch.zeros_like(p)
temp_weight.copy_(p[:, permutation_sequence, ...])
p.data.copy_(temp_weight)
success_permutation = True
if is_node_in_sparse_parameters == False:
# A special case: if the node itself not in sparse_module_names but one of its real_siblings in sparse_module_names, then the node will not do the permutation search, but it may need to apply the offline permutation in C dim according to the searched permutation sequence from its real_siblings in sparse_module_names
try:
for module_name_from_all_parameters, module_from_all_parameters, p_name_from_all_parameters, p_from_all_parameters in cls.__all_parameters:
if ((node_name == module_name_from_all_parameters) or ('module.' + node_name == module_name_from_all_parameters)) and p_name_from_all_parameters == "weight":
print("[apply_permutation_in_C_dim] cannot find the node: \'{:}\' in cls.__sparse_parameters, but can find in cls.__all_parameters.".format(node_name))
temp_weight = torch.zeros_like(p_from_all_parameters)
temp_weight.copy_(p_from_all_parameters[:, permutation_sequence, ...])
p_from_all_parameters.data.copy_(temp_weight)
success_permutation = True
print("[apply_permutation_in_C_dim] cannot find the node: \'{:}\' in cls.__sparse_parameters, after trying with cls.__all_parameters, succeed to apply permutation in C dim.".format(node_name))
except:
success_permutation = False
print("[apply_permutation_in_C_dim] cannot find the node: \'{:}\' in cls.__sparse_parameters, after trying with cls.__all_parameters, still fail to apply permutation in C dim.".format(node_name))
return success_permutation
@classmethod
def apply_permutation_in_K_dim(cls, node_name, permutation_sequence):
"""This function is used to permutation for a node in K dim. (Need to handle the weight/bias/running_mean/running_var of the node)"""
print("[apply_permutation_in_K_dim] Permutation for node: \'{:}\' in K dim".format(node_name))
if len(permutation_sequence) == 0:
print("[apply_permutation_in_K_dim] the permutation sequence is empty, fail to apply permutation in K dim.")
return False
is_node_in_all_parameters = False
success_permutation = False
for module_name, module, p_name, p in cls.__all_parameters:
processed_module_name = ''.join(c for c in module_name if c not in string.punctuation).lower()
processed_node_name = ''.join(c for c in node_name if c not in string.punctuation).lower()
distributed_node_name = 'module.' + node_name
processed_distributed_node_name = 'module.' + processed_node_name
if (module_name == node_name) or (module_name == distributed_node_name) or (processed_module_name == processed_node_name) or (processed_module_name == processed_distributed_node_name): # Inception-V3, module_name: Conv2d_2a_3x3.conv, node_name: conv2d.1a.3x3.conv
print("[apply_permutation_in_K_dim] find the node: \'{:}\' with \'{:}\' in cls.__all_parameters, may succeed to apply permutation in K dim.".format(node_name, p_name))
is_node_in_all_parameters = True
temp_weight = torch.zeros_like(p)
if p.shape[0] != len(permutation_sequence):
print("[apply_permutation_in_K_dim][warning] the node: \'{:}\' with shape: \'{:}\', cannot match the size of permutation sequence with len: \'{:}\', fail to apply permutation in K dim.".format(node_name, p.shape, len(permutation_sequence)))
success_permutation = False
else:
print("[apply_permutation_in_K_dim] the node: \'{:}\' with shape: \'{:}\', can match the size of permutation sequence with len: \'{:}\', succeed to apply permutation in K dim.".format(node_name, p.shape, len(permutation_sequence)))
temp_weight.copy_(p[permutation_sequence, ...])
p.data.copy_(temp_weight)
success_permutation = True
if is_node_in_all_parameters == False:
print("[apply_permutation_in_K_dim] cannot find the node: \'{:}\' in cls.__all_parameters, fail to apply permutation in K dim.".format(node_name))
success_permutation = False
return success_permutation
@classmethod
def build_offline_permutation_graph(cls, model, dump_fx_graph=False, save_dumped_fx_graph='./model_offline_permutation_graph.json'):
"""This function is used to refine the whole network graph built with Torch.FX with some extra infomation needed for offline permutation."""
print("\n[build_offline_permutation_graph] Further refine the model graph built by Torch.FX for offline permutation")
# extract the output_dir, so all the intermediate fx_graph can be saved under that path
extract_output_dir=os.path.split(save_dumped_fx_graph)[0]
cls.__permutation_output_dir = extract_output_dir
fx_graph, success_in_build_fx_graph = cls.build_fx_graph(model, dump_fx_graph=dump_fx_graph, save_dumped_fx_graph=save_dumped_fx_graph)
if success_in_build_fx_graph:
fx_graph_after_find_real_parents = cls.find_real_parents(fx_graph)
fx_graph_after_find_real_children = cls.find_real_children(fx_graph_after_find_real_parents)
fx_graph_after_find_real_siblings = cls.find_real_siblings(fx_graph_after_find_real_children)
fx_graph_after_extract_all_unique_siblings = cls.extract_all_unique_siblings(fx_graph_after_find_real_siblings)
fx_graph_after_init_permutation_flag = cls.init_permutation_flag(fx_graph_after_extract_all_unique_siblings)
start_time_search_for_good_permutation = time.perf_counter()
fx_graph_after_search_for_good_permutation = cls.search_for_good_permutation(fx_graph_after_init_permutation_flag)
duration_search_for_good_permutation = time.perf_counter() - start_time_search_for_good_permutation
print("\n[build_offline_permutation_graph] Take {:.4f} seconds to finish search_for_good_permutation function.".format(duration_search_for_good_permutation))
else:
fx_graph_after_search_for_good_permutation = {}
return fx_graph_after_search_for_good_permutation, success_in_build_fx_graph
# Please notice the apply_offline_permutation step cannot fold into the above search_for_good_permutation step.
# Because the real_parent node needs to offline permutation in K direction according to the searched permutation sequence from its real_children.
# However, when we search_for_good_permutation for the node, its real_children have not been handled by search_for_good_permutation.
if cls.__save_permutation_graph:
cls.save_graph_to_json(fx_graph_after_search_for_good_permutation, save_dumped_graph_path_with_name=os.path.join(cls.__permutation_output_dir, './model_graph_build_offline_permutation_graph.json')) # save the intermediate graph as JSON file for debugging
return fx_graph_after_search_for_good_permutation, success_in_build_fx_graph
@classmethod
def search_for_good_permutation(cls, fx_graph):
"""This function is used to:
1. search for the good permutation sequence for each node weight, or each siblings_group weights by calling the permutation search kernels as ASP extension.
2. add the searched permutation sequence for each node according to the whole network graph built with Torch.FX."""
print("\n[search_for_good_permutation] Search for the good permutation sequence for each node according to the whole network graph built with Torch.FX")
unique_siblings_groups = fx_graph.get('unique_siblings').get('name')
unique_siblings_groups_module_type = fx_graph.get('unique_siblings').get('module_type')
unique_siblings_groups_permutation_sequence = []
item_index = 0
for unique_siblings_group in unique_siblings_groups: # loop through all unique siblings groups that must share a permutation sequence
print("\n[search_for_good_permutation] this unique_siblings_group has {:} real siblings: \'{:}\', with module type: \'{:}\'.".format(len(unique_siblings_group), unique_siblings_group, unique_siblings_groups_module_type[item_index]))
item_index = item_index + 1
# concat the weight for layers in the same unique_siblings_group
matrix_group = None
for node_name in unique_siblings_group:
node_module_type = fx_graph.get(node_name).get('module_type')
print("[search_for_good_permutation] try to merge the weight for node: \'{:}\', with module type: \'{:}\'.".format(node_name, node_module_type))
is_node_in_sparse_parameters = False
node_weight = torch.zeros(0)
for module_name, module, p_name, p, mask, pruned in cls.__sparse_parameters:
processed_module_name = ''.join(c for c in module_name if c not in string.punctuation).lower()
processed_node_name = ''.join(c for c in node_name if c not in string.punctuation).lower()
distributed_node_name = 'module.' + node_name
processed_distributed_node_name = 'module.' + processed_node_name
if (module_name == node_name) or (module_name == distributed_node_name) or (processed_module_name == processed_node_name) or (processed_module_name == processed_distributed_node_name): # Inception-V3, module_name: Conv2d_2a_3x3.conv, node_name: conv2d.1a.3x3.conv
module_type_from_sparse_parameters = str(type(module)) # e.g. <class 'torch.nn.modules.conv.Conv2d'>
module_type_from_sparse_parameters = module_type_from_sparse_parameters[8:-2]
print("[search_for_good_permutation] find the node: \'{:}\' in cls.__sparse_parameters, module type match: \'{:}\'.".format(node_name, node_module_type==module_type_from_sparse_parameters))
is_node_in_sparse_parameters = True
node_weight = torch.zeros_like(p)
node_weight.copy_(p)
# Need to handle the concat for layers with different R & S
shape = node_weight.shape
# 1d-tensor
if len(shape) == 1:
node_weight = node_weight.view(1, shape[0])
# 2d-tensor (in, out)
elif len(shape) == 2:
node_weight = node_weight.view(shape[0], shape[1])
# 3d-tensor (batch, in, out)
elif len(shape) == 3:
node_weight = node_weight.view(shape[0]*shape[1], shape[2])
# 4d-tensor (in, out, h, w)
elif len(shape) == 4:
# convs
node_weight = node_weight.permute(2,3,0,1).contiguous().view(shape[2]*shape[3]*shape[0], shape[1])
if is_node_in_sparse_parameters == False:
print("[search_for_good_permutation] cannot find the node: \'{:}\' in cls.__sparse_parameters, no need to merge its weight for permutation.".format(node_name))
else:
if matrix_group == None:
matrix_group = node_weight
else:
try:
if matrix_group.dim() == node_weight.dim():
matrix_group = torch.cat((matrix_group, node_weight), dim=0) # concat the weights in K dimension, and keep the same C dimension
else: # e.g. when try to merge the Conv and FC layers
print("[search_for_good_permutation] matrix_group dim: {:} is not matched with node_weight dim: {:}.".format(matrix_group.dim(), node_weight.dim()))
print("[search_for_good_permutation] matrix_group shape: \'{:}\' is not matched with node_weight shape: \'{:}\'.".format(matrix_group.size(), node_weight.size()))
if matrix_group.dim() < node_weight.dim():
while node_weight.dim() - matrix_group.dim() > 0:
matrix_group = matrix_group.unsqueeze(matrix_group.dim())
else:
while matrix_group.dim() - node_weight.dim() > 0:
node_weight = node_weight.unsqueeze(node_weight.dim())
print("[search_for_good_permutation] matrix_group shape: \'{:}\' is now matched with node_weight shape: \'{:}\'.".format(matrix_group.size(), node_weight.size()))
matrix_group = torch.cat((matrix_group, node_weight), dim=0) # concat the weights in K dimension, and keep the same C dimension
except:
print("[search_for_good_permutation][warning] cannot merge the weight for node: \'{:}\', with its weight shape: \'{:}\', the matrix_group shape: \'{:}\'.".format(node_name, node_weight.size(), matrix_group.size()))
continue
print("[search_for_good_permutation] have merged the weight for node: \'{:}\', with its weight shape: \'{:}\', the matrix_group shape: \'{:}\'.".format(node_name, node_weight.size(), matrix_group.size()))
if matrix_group == None: # cannot find the node: \'{:}\' in cls.__sparse_parameters
input_channel_num = 0
print("\n[search_for_good_permutation] init the all-zero list with length \'{:}\' for permutation search sequence of this unique_siblings_group.".format(input_channel_num))
print("[search_for_good_permutation] no need to search the permutation_sequence for empty matrix_group.")
permutation_sequence = [0 for n in range(input_channel_num)]
unique_siblings_groups_permutation_sequence.append(permutation_sequence)
continue
else:
input_channel_num = matrix_group.size()[1]
print("\n[search_for_good_permutation] init the all-zero list with length \'{:}\' for permutation search sequence of this unique_siblings_group.".format(input_channel_num))
permutation_sequence = [0 for n in range(input_channel_num)]
# automatic check for skipping the permutation search process
original_magnitude = (torch.abs(matrix_group)).sum(dtype=torch.float64)
pruned_magnitude = sum_after_2_to_4(matrix_group.cpu().detach().numpy())
diff_ratio = abs(original_magnitude - pruned_magnitude)/original_magnitude
epsilon = 1e-3
print("\n[search_for_good_permutation] Original element abs sum: {:}, Pruned element abs sum: {:}, Diff ratio: {:}".format(original_magnitude, pruned_magnitude, diff_ratio))
if diff_ratio < epsilon:
print("[search_for_good_permutation] Original element abs sum is almost same as the pruned element abs sum, further permutation search will not help, skipping!")
print("[search_for_good_permutation] Change the all-zero permutation search sequence to a sequential permutation search sequence.")
permutation_sequence = [n for n in range(input_channel_num)]
unique_siblings_groups_permutation_sequence.append(permutation_sequence)
continue
else:
print("[search_for_good_permutation] Original element abs sum is different from the pruned element abs sum, further permutation search will help, continue with the permutation search!")
# call the permutation search CUDA kernels as ASP extension.
# users can provide prefer search strategy by providing a valid 'search_options' as a dictionary,
# or users can implement their customized 'accelerated_search_for_good_permutation' function.
search_options = {}
# No.1 Strategy: Exhaustive Search
# search_options['strategy'] = 'exhaustive'
# search_options['stripe_group_size'] = 8
# search_options['escape_attempts'] = 100
# No.2 Strategy: Progressive Channel Swap Search
# search_options['strategy'] = 'progressive channel swap'
# search_options['progressive_search_time_limit'] = 10
# search_options['improvement_threshold'] = 1e-9
# No.3 Strategy: User Defined Search
# search_options['strategy'] = 'user defined'
# permutation search time is too long for matrix_group with large channel num
# change from Exhaustive Search to Progressive Channel Swap Search based on input matrix_group size
if input_channel_num > 2048:
search_options['strategy'] = 'progressive channel swap'
search_options['progressive_search_time_limit'] = 120
search_options['improvement_threshold'] = 1e-9
print("[search_for_good_permutation] Change to Progressive Channel Swap Search with {} seconds limitation, because the {} is too large and will leading too long permutation search time with Exhaustive Search.".format(search_options['progressive_search_time_limit'], input_channel_num))
start_time_accelerated_search_for_good_permutation = time.perf_counter()
permutation_sequence = accelerated_search_for_good_permutation(matrix_group, options=search_options)
duration_accelerated_search_for_good_permutation = time.perf_counter() - start_time_accelerated_search_for_good_permutation
print("[search_for_good_permutation] Take {:.4f} seconds to finish accelerated_search_for_good_permutation function.".format(duration_accelerated_search_for_good_permutation))
unique_siblings_groups_permutation_sequence.append(permutation_sequence)
fx_graph['unique_siblings']['permutation_sequence'] = unique_siblings_groups_permutation_sequence
if cls.__save_permutation_graph:
cls.save_graph_to_json(fx_graph, save_dumped_graph_path_with_name=os.path.join(cls.__permutation_output_dir, './model_graph_search_for_good_permutation.json')) # save the intermediate graph as JSON file for debugging
return fx_graph
@classmethod
def init_permutation_flag(cls, fx_graph):
"""This function is used to init the permutation flag for each node according to the whole network graph built with Torch.FX."""
print("\n[init_permutation_flag] Init the permutation flag for each node according to the whole network graph built with Torch.FX")
sparse_module_names = []
processed_sparse_module_names = [] # Inception-V3, module_name: Conv2d_2a_3x3.conv, node_name: conv2d.1a.3x3.conv
for module_name, module, p_name, p, mask, pruned in cls.__sparse_parameters:
sparse_module_names.append(module_name)
processed_module_name = ''.join(c for c in module_name if c not in string.punctuation).lower()
processed_sparse_module_names.append(processed_module_name)
for node_name in fx_graph.keys():
processed_node_name = ''.join(c for c in node_name if c not in string.punctuation).lower()
distributed_node_name = 'module.' + node_name
processed_distributed_node_name = 'module.' + processed_node_name
node_module_type = fx_graph.get(node_name).get('module_type')
if node_module_type in ['torch.nn.modules.conv.Conv2d', 'torch.nn.modules.linear.Linear']:
node_parents = fx_graph.get(node_name).get('parents')
node_children = fx_graph.get(node_name).get('children')
node_real_parents = fx_graph.get(node_name).get('real_parents')
node_real_children = fx_graph.get(node_name).get('real_children')
node_groups_param = fx_graph.get(node_name).get('groups_param')
is_node_real_children_in_sparse_parameters = False
is_node_real_children_has_group_conv = False
for real_child_item in node_real_children:
processed_real_child_item = ''.join(c for c in real_child_item if c not in string.punctuation).lower()
distributed_real_child_item = 'module.' + real_child_item
processed_distributed_real_child_item = 'module.' + processed_real_child_item
if (real_child_item in sparse_module_names) or (processed_real_child_item in processed_sparse_module_names) or (distributed_real_child_item in sparse_module_names) or (processed_distributed_real_child_item in processed_sparse_module_names):
is_node_real_children_in_sparse_parameters = True
if (fx_graph.get(real_child_item).get('groups_param') not in ['None', '1']):
is_node_real_children_has_group_conv = True
is_node_real_parents_has_group_conv = False
for real_parent_item in node_real_parents:
# notice: we assume the if one item of real_parents need to permute in C or K dim, then the corresponding flag should be set
# so for all items of real_parents, they may not share the same 'permutation_type' (e.g., one item is Group Conv, etc.)
# that's why we also need to judge the 'is_node_real_parents_has_group_conv'
if (fx_graph.get(real_parent_item).get('groups_param') not in ['None', '1']):
is_node_real_parents_has_group_conv = True
# If the node itself is in sparse_module_names or one of its real_children in sparse_module_names, then it may need the offline permutation
if ((node_name in sparse_module_names) or (processed_node_name in processed_sparse_module_names) or (distributed_node_name in sparse_module_names) or (processed_distributed_node_name in processed_sparse_module_names)) or (is_node_real_children_in_sparse_parameters == True):
if node_groups_param not in ['None', '1']:
# for Group Conv, disable the permutation in 'C' and 'K' dim
fx_graph[node_name]['permutation_type'] = 'None'
elif ('x' in node_parents) or ((node_name not in sparse_module_names) and (processed_node_name not in processed_sparse_module_names) and (distributed_node_name not in sparse_module_names) and (processed_distributed_node_name not in processed_sparse_module_names)):
# for the first (due to it is connected to 'x' node or itself is not in sparse_module_names) or not NVIDIA's TC compatiable Conv/FC, only permutate the K direction
if is_node_real_children_has_group_conv == False:
fx_graph[node_name]['permutation_type'] = 'K'
fx_graph[node_name]['k_permuted'] = 'False'
else: # if node real_children contains Group Conv, disable the permutation for node in 'K' dim
fx_graph[node_name]['permutation_type'] = 'None'
elif ('output' in node_children) or (is_node_real_children_in_sparse_parameters == False):
# for the last (due to it is connected to 'output' node or to a node which is not in sparse_module_names) FC/Conv, only permutate the C direction
if is_node_real_parents_has_group_conv == False:
fx_graph[node_name]['permutation_type'] = 'C'
fx_graph[node_name]['c_permuted'] = 'False'
else: # if node real_parents contains Group Conv, disable the permutation for node in 'C' dim
fx_graph[node_name]['permutation_type'] = 'None'
else:
if (is_node_real_parents_has_group_conv == False) and (is_node_real_children_has_group_conv == False):
fx_graph[node_name]['permutation_type'] = 'KC'
fx_graph[node_name]['k_permuted'] = 'False'
fx_graph[node_name]['c_permuted'] = 'False'
elif is_node_real_parents_has_group_conv == True: # TODO: if node real_parents contains Group Conv, disable the permutation for node in 'C' dim
fx_graph[node_name]['permutation_type'] = 'K'
fx_graph[node_name]['k_permuted'] = 'False'
else: # if node real_children contains Group Conv, disable the permutation for node in 'K' dim
fx_graph[node_name]['permutation_type'] = 'C'
fx_graph[node_name]['c_permuted'] = 'False'
else:
fx_graph[node_name]['permutation_type'] = 'None'
elif node_module_type in ['torch.nn.modules.batchnorm.BatchNorm2d']:
node_real_parents = fx_graph.get(node_name).get('real_parents')
is_node_real_parents_need_K_permutation = False
is_node_real_parents_has_group_conv = False
for real_parent_item in node_real_parents:
# notice: we assume the if one item of real_parents need to permute in K dim, then the corresponding flag should be set
# as in most of the cases, BN only follows one Conv, so it should be OK for now
if fx_graph.get(real_parent_item).get('permutation_type') in ['K', 'KC']:
is_node_real_parents_need_K_permutation = True
if (fx_graph.get(real_parent_item).get('groups_param') not in ['None', '1']):
is_node_real_parents_has_group_conv = True
node_real_children = fx_graph.get(node_name).get('real_children')
is_node_real_children_in_sparse_parameters = False
for real_child_item in node_real_children:
processed_real_child_item = ''.join(c for c in real_child_item if c not in string.punctuation).lower()
distributed_real_child_item = 'module.' + real_child_item
processed_distributed_real_child_item = 'module.' + processed_real_child_item
if (real_child_item in sparse_module_names) or (processed_real_child_item in processed_sparse_module_names) or (distributed_real_child_item in sparse_module_names) or (processed_distributed_real_child_item in processed_sparse_module_names):
is_node_real_children_in_sparse_parameters = True
# Firstly, we should make sure the BN is not in the last (due to it is connected to a FC/Conv node which is not in sparse_module_names), then:
# If the real_parents of BN node are in sparse_module_names, then it may need the offline permutation
# Or if the real_parents of BN node just needs to permute in K dim
if (is_node_real_children_in_sparse_parameters == True) and (is_node_real_parents_need_K_permutation == True):
if (is_node_real_parents_has_group_conv == False) and (is_node_real_parents_need_K_permutation == True):
fx_graph[node_name]['permutation_type'] = 'K'
fx_graph[node_name]['k_permuted'] = 'False'
else: # if node real_parents contains Group Conv or does not need permutation in 'K' dim, disable the permutation for node in 'K' dim
fx_graph[node_name]['permutation_type'] = 'None'
else:
fx_graph[node_name]['permutation_type'] = 'None'
else:
fx_graph[node_name]['permutation_type'] = 'None'
# A special case: if the node itself not in sparse_module_names but one of its real_siblings in sparse_module_names, then the node will not do the permutation search, but it may need to apply the offline permutation in C dim according to the searched permutation sequence from its real_siblings in sparse_module_names
# We make it as the post-processing, because if we add this to the previous logic, will make it too complex
# Post-processing Step No.1:
print("\n[init_permutation_flag] Post-processing Step No.1.")
node_change_permutation_due_to_siblings = []
for node_name in fx_graph.keys():
node_real_siblings = fx_graph.get(node_name).get('real_siblings')
if node_real_siblings is not None:
is_node_real_siblings_needs_C_permutation = False
for real_sibling_item in node_real_siblings:
if fx_graph.get(real_sibling_item).get('permutation_type') in ['C', 'KC']:
is_node_real_siblings_needs_C_permutation = True
if is_node_real_siblings_needs_C_permutation == True:
print("[init_permutation_flag] node_name: \'{:}\', one of its real siblings need do offline permutation in C dim.".format(node_name))
node_original_permutation_type = fx_graph.get(node_name).get('permutation_type')
if node_original_permutation_type in ['C', 'KC']:
print("[init_permutation_flag] node_name: \'{:}\', its original permutation: \'{:}\' already includes C dim, no need to do No.1 post-processing change.".format(node_name, node_original_permutation_type))
elif node_original_permutation_type == 'None':
fx_graph[node_name]['permutation_type'] = 'C'
print("[init_permutation_flag] node_name: \'{:}\', change its original permutation: \'{:}\' to new permutation: 'C'.".format(node_name, node_original_permutation_type))
node_change_permutation_due_to_siblings.append(node_name)
elif node_original_permutation_type == 'K':
fx_graph[node_name]['permutation_type'] = 'KC'
print("[init_permutation_flag] node_name: \'{:}\', change its original permutation: \'{:}\' to new permutation: 'KC'.".format(node_name, node_original_permutation_type))
node_change_permutation_due_to_siblings.append(node_name)
# Post-processing Step No.2:
print("\n[init_permutation_flag] Post-processing Step No.2.")
for node_name in fx_graph.keys():
node_real_children = fx_graph.get(node_name).get('real_children')
node_module_type = fx_graph.get(node_name).get('module_type')
if (node_real_children is not None) and (node_module_type in ['torch.nn.modules.conv.Conv2d', 'torch.nn.modules.linear.Linear', 'torch.nn.modules.batchnorm.BatchNorm2d']):
is_node_real_children_has_node_change_permutation = False
for real_child_item in node_real_children:
if real_child_item in node_change_permutation_due_to_siblings:
is_node_real_children_has_node_change_permutation = True
if is_node_real_children_has_node_change_permutation == True:
print("[init_permutation_flag] node_name: \'{:}\', one of its real children has changed permutation due to its siblings.".format(node_name))
node_original_permutation_type = fx_graph.get(node_name).get('permutation_type')
if node_original_permutation_type in ['K', 'KC']:
print("[init_permutation_flag] node_name: \'{:}\', its original permutation: \'{:}\' already includes K dim, no need to do No.2 post-processing change.".format(node_name, node_original_permutation_type))
elif node_original_permutation_type == 'None':
fx_graph[node_name]['permutation_type'] = 'K'
print("[init_permutation_flag] node_name: \'{:}\', change its original permutation: \'{:}\' to new permutation: 'K'.".format(node_name, node_original_permutation_type))
elif node_original_permutation_type == 'C':
fx_graph[node_name]['permutation_type'] = 'KC'
print("[init_permutation_flag] node_name: \'{:}\', change its original permutation: \'{:}\' to new permutation: 'KC'.".format(node_name, node_original_permutation_type))
if cls.__save_permutation_graph:
cls.save_graph_to_json(fx_graph, save_dumped_graph_path_with_name=os.path.join(cls.__permutation_output_dir, './model_graph_init_permutation_flag.json')) # save the intermediate graph as JSON file for debugging
return fx_graph
@classmethod
def extract_all_unique_siblings(cls, fx_graph):
"""This function is used to extrat all unique siblings for the whole network graph built with Torch.FX."""
print("\n[extract_all_unique_siblings] Extract all unique siblings for the whole network graph built with Torch.FX")
all_unique_siblings_name = []
all_unique_siblings_module_type = []
for node_name in fx_graph.keys():
fx_graph[node_name]['node_type'] = 'network_node' # use the 'node_type' to divide the real nodes apart from the auxiliary info node, like 'unique_siblings' node
node_module_type = fx_graph.get(node_name).get('module_type')
node_real_siblings = fx_graph.get(node_name).get('real_siblings')
node_real_siblings_module_type = fx_graph.get(node_name).get('real_siblings_module_type')
if node_real_siblings == []:
print("[extract_all_unique_siblings] node_name: \'{:}\', node module type: \'{:}\', has no real siblings.".format(node_name, node_module_type))
# for the Conv/FC layers without real_siblings, then we should insert itself as an unique_siblings
if node_module_type in ['torch.nn.modules.conv.Conv2d', 'torch.nn.modules.linear.Linear']:
# direct insert will change the real_siblings info for the node in the fx_graph
node_real_siblings_with_node_itself = node_real_siblings.copy()
node_real_siblings_with_node_itself.insert(0, node_name)
node_real_siblings_module_type_with_node_itself = node_real_siblings_module_type.copy()
node_real_siblings_module_type_with_node_itself.insert(0, node_module_type)
all_unique_siblings_name.append(node_real_siblings_with_node_itself)
all_unique_siblings_module_type.append(node_real_siblings_module_type_with_node_itself)
else:
print("[extract_all_unique_siblings] node_name: \'{:}\', node module type: \'{:}\', has {:} real siblings: \'{:}\'.".format(node_name, node_module_type, len(node_real_siblings), node_real_siblings))
# for the two duplicated siblings lists, the node names included should be the same.
# If the node name is already included in one of the unique_siblings_name list, which means the real_siblings of this node is duplicated with the unique_siblings_name list.
# Otherwise, we should insert the [real_siblings + node_name] as a new unique_siblings_name list.
has_include_siblings = False
for unique_siblings_item in all_unique_siblings_name:
if node_name in unique_siblings_item:
has_include_siblings = True
if has_include_siblings == False:
# direct insert will change the real_siblings info for the node in the fx_graph
node_real_siblings_with_node_itself = node_real_siblings.copy()
node_real_siblings_with_node_itself.insert(0, node_name)
node_real_siblings_module_type_with_node_itself = node_real_siblings_module_type.copy()
node_real_siblings_module_type_with_node_itself.insert(0, node_module_type)
all_unique_siblings_name.append(node_real_siblings_with_node_itself)
all_unique_siblings_module_type.append(node_real_siblings_module_type_with_node_itself)
fx_graph['unique_siblings'] = {}
fx_graph['unique_siblings']['name'] = all_unique_siblings_name
fx_graph['unique_siblings']['module_type'] = all_unique_siblings_module_type
fx_graph['unique_siblings']['node_type'] = 'auxiliary_info_node'
if cls.__save_permutation_graph:
cls.save_graph_to_json(fx_graph, save_dumped_graph_path_with_name=os.path.join(cls.__permutation_output_dir, './model_graph_extract_all_unique_siblings.json')) # save the intermediate graph as JSON file for debugging
return fx_graph
@classmethod
def find_real_siblings(cls, fx_graph):
"""This function is used to find all siblings for each node according to the whole network graph built with Torch.FX.
we need to find siblings recursively, because siblings may have siblings via other parents we don't know about.
"""
print("\n[find_real_siblings] Find all siblings for each node according to the whole network graph built with Torch.FX")
for node_name in fx_graph.keys():
node_real_siblings_name = []
node_real_siblings_module_type = []
node_real_parents = fx_graph.get(node_name).get('real_parents')
node_module_type = fx_graph.get(node_name).get('module_type')
if node_module_type not in ['torch.nn.modules.conv.Conv2d', 'torch.nn.modules.linear.Linear']:
print("[find_real_siblings] node_name: \'{:}\', node module type: \'{:}\', has no real siblings.".format(node_name, node_module_type))
else:
print("[find_real_siblings] node_name: \'{:}\', node module type: \'{:}\', may have real siblings.".format(node_name, node_module_type))
# sibling means the nodes share the same real parent
for real_parent_item in node_real_parents:
for real_child_item in fx_graph.get(real_parent_item).get('real_children'):
if real_child_item != node_name:
sibling_module_type = fx_graph.get(real_child_item).get('module_type')
print("[find_real_siblings] node_name: \'{:}\', has one real sibling: \'{:}\', its real sibling module type: \'{:}\'.".format(node_name, real_child_item, sibling_module_type))
node_real_siblings_name.append(real_child_item)
node_real_siblings_module_type.append(sibling_module_type)
# remove the duplicated real siblings
exclusive_node_real_siblings_name = []
exclusive_node_real_siblings_module_type = []
item_index = 0
duplicated_real_siblings = 0
for item in node_real_siblings_name:
if item not in exclusive_node_real_siblings_name:
exclusive_node_real_siblings_name.append(item)
exclusive_node_real_siblings_module_type.append(node_real_siblings_module_type[item_index])
else:
duplicated_real_siblings = duplicated_real_siblings + 1
item_index = item_index + 1
if duplicated_real_siblings > 0:
print("[find_real_siblings] node_name: \'{:}\', remove {:} duplicated real siblings.".format(node_name, duplicated_real_siblings))
fx_graph[node_name]['real_siblings'] = exclusive_node_real_siblings_name
fx_graph[node_name]['real_siblings_module_type'] = exclusive_node_real_siblings_module_type
if cls.__save_permutation_graph:
cls.save_graph_to_json(fx_graph, save_dumped_graph_path_with_name=os.path.join(cls.__permutation_output_dir, './model_graph_find_real_siblings.json')) # save the intermediate graph as JSON file for debugging
return fx_graph
@classmethod
def recursive_find_real_children(cls, node_name, fx_graph):
"""This function is used to recursively find the real children for each node according to the whole network graph built with Torch.FX.
Used as the sub-function of find_real_children.
"""
node_real_children_name = []
node_real_children_module_type = []
if node_name in fx_graph.keys(): # can be deleted, because node_name is already in the 'children' item in one node of the fx_graph
node_children = fx_graph.get(node_name).get('children')
node_module_type = fx_graph.get(node_name).get('module_type')
has_visit_children_num = 0
has_real_children_num = 0
sub_node_need_recursive_search = []
while has_visit_children_num < len(node_children):
for child_name in node_children:
if child_name != 'output': # 'output' node has no 'module_type'
child_module_type = fx_graph.get(child_name).get('module_type')
if child_module_type in ['torch.nn.modules.conv.Conv2d', 'torch.nn.modules.linear.Linear']:
print("[recursive_find_real_children] node_name: \'{:}\', has one real child: \'{:}\', its real child module type: \'{:}\'.".format(node_name, child_name, child_module_type))
node_real_children_name.append(child_name)
node_real_children_module_type.append(child_module_type)
has_real_children_num = has_real_children_num + 1
else:
print("[recursive_find_real_children] node_name: \'{:}\', its child: \'{:}\' with module type: \'{:}\', needs recursive search.".format(node_name, child_name, child_module_type))
sub_node_need_recursive_search.append(child_name)
else:
print("[recursive_find_real_children] node_name: \'{:}\', its child: \'{:}\' with no module type, is not its real child.".format(node_name, child_name))
has_visit_children_num = has_visit_children_num + 1
if len(sub_node_need_recursive_search) > 0:
for sub_node in sub_node_need_recursive_search:
if fx_graph.get(sub_node).get('real_children') == []:
sub_node_real_children_name, sub_node_real_children_module_type = cls.recursive_find_real_children(sub_node, fx_graph)
else:
# if the sub_node already find the 'real_children', no need to do recursive search
sub_node_real_children_name = fx_graph.get(sub_node).get('real_children')
sub_node_real_children_module_type = fx_graph.get(sub_node).get('real_children_module_type')
node_real_children_name.extend(sub_node_real_children_name)
node_real_children_module_type.extend(sub_node_real_children_module_type)
return node_real_children_name, node_real_children_module_type
@classmethod
def find_real_children(cls, fx_graph):
"""This function is used to find the real children for each node according to the whole network graph built with Torch.FX.
For example:
The real children of Conv is the subsequent Conv/FC.
The real children of BN or other no-need-permutataion layers is the subsequent Conv/FC.
"""
print("\n[find_real_children] Find the real children for each node according to the whole network graph built with Torch.FX")
from sys import version_info
if version_info.major == 3 and version_info.minor >= 8:
reversible_fx_graph_keys = fx_graph.keys()
else: # 'dict_keys' object is not reversible in previous of Python 3.8
reversible_fx_graph_keys = list(fx_graph.keys())
for node_name in reversed(reversible_fx_graph_keys): # as the optimization, we need to find the real children from back to front, to use the already saved 'real_children'
node_real_children_name = []
node_real_children_module_type = []
node_children = fx_graph.get(node_name).get('children')
node_module_type = fx_graph.get(node_name).get('module_type')
if node_module_type not in ['torch.nn.modules.conv.Conv2d', 'torch.nn.modules.linear.Linear']:
print("\n[find_real_children] node_name: \'{:}\', node module type: \'{:}\', children num: {:}, recursive to find real children.".format(node_name, node_module_type, len(node_children)))
node_real_children_name, node_real_children_module_type = cls.recursive_find_real_children(node_name, fx_graph)
else: # Quick method, but cannot get the real children for no-need-permutataion layers like BN
print("\n[find_real_children] node_name: \'{:}\', node module type: \'{:}\', children num: {:}, can directly find real children.".format(node_name, node_module_type, len(node_children)))
# if the node is in the 'real_parents' list of the other node, then the other node is the real children for this node
for other_node_name in fx_graph.keys():
if (other_node_name != node_name) and (node_name in fx_graph.get(other_node_name).get('real_parents')):
child_module_type = fx_graph.get(other_node_name).get('module_type')
if child_module_type in ['torch.nn.modules.conv.Conv2d', 'torch.nn.modules.linear.Linear']:
print("[find_real_children] node_name: \'{:}\', has one real child: \'{:}\', its real child module type: \'{:}\'.".format(node_name, other_node_name, child_module_type))
node_real_children_name.append(other_node_name)
node_real_children_module_type.append(child_module_type)
# remove the duplicated real children
exclusive_node_real_children_name = []
exclusive_node_real_children_module_type = []
item_index = 0
duplicated_real_children = 0
for item in node_real_children_name:
if item not in exclusive_node_real_children_name:
exclusive_node_real_children_name.append(item)
exclusive_node_real_children_module_type.append(node_real_children_module_type[item_index])
else:
duplicated_real_children = duplicated_real_children + 1
item_index = item_index + 1
if duplicated_real_children > 0:
print("[find_real_children] node_name: \'{:}\', remove {:} duplicated real children.".format(node_name, duplicated_real_children))
fx_graph[node_name]['real_children'] = exclusive_node_real_children_name
fx_graph[node_name]['real_children_module_type'] = exclusive_node_real_children_module_type
if cls.__save_permutation_graph:
cls.save_graph_to_json(fx_graph, save_dumped_graph_path_with_name=os.path.join(cls.__permutation_output_dir, './model_graph_find_real_children.json')) # save the intermediate graph as JSON file for debugging
return fx_graph
@classmethod
def find_real_parents(cls, fx_graph):
"""This function is used to find the real parents for each node according to the whole network graph built with Torch.FX.
For example:
The real parent of BN is the previous Conv/FC.
The real parent of Conv is the previous Conv/FC.
"""
print("\n[find_real_parents] Find the real parents for each node according to the whole network graph built with Torch.FX")
for node_name in fx_graph.keys():
node_real_parents_name = []
node_real_parents_module_type = []
node_parents = fx_graph.get(node_name).get('parents')
print("[find_real_parents] node_name: \'{:}\', parents num: {:}".format(node_name, len(node_parents)))
has_visit_parent_num = 0
while has_visit_parent_num < len(node_parents):
for parent_name in node_parents:
if fx_graph.__contains__(parent_name):
parent_module_type = fx_graph.get(parent_name).get('module_type')
if parent_module_type in ['torch.nn.modules.conv.Conv2d', 'torch.nn.modules.linear.Linear']:
print("[find_real_parents] node_name: \'{:}\', has one real parent: \'{:}\', its real parent module type: \'{:}\'.".format(node_name, parent_name, parent_module_type))
node_real_parents_name.append(parent_name)
node_real_parents_module_type.append(parent_module_type)
else:
print("[find_real_parents] node_name: \'{:}\', has one/several real parent(s): \'{:}\', its real parent module type: \'{:}\'.".format(node_name, fx_graph[parent_name]['real_parents'], fx_graph[parent_name]['real_parents_module_type']))
for real_parent_item in fx_graph[parent_name]['real_parents']:
node_real_parents_name.append(real_parent_item)
for real_parent_module_type_item in fx_graph[parent_name]['real_parents_module_type']:
node_real_parents_module_type.append(real_parent_module_type_item)
else:
print("[find_real_parents] node_name: \'{:}\', has no real parent because this is the first node.".format(node_name))
has_visit_parent_num = has_visit_parent_num + 1
# remove the duplicated real parents
exclusive_node_real_parents_name = []
exclusive_node_real_parents_module_type = []
exclusive_node_real_parents_groups_param = []
item_index = 0
duplicated_real_parents = 0
for item in node_real_parents_name:
if item not in exclusive_node_real_parents_name:
exclusive_node_real_parents_name.append(item)
exclusive_node_real_parents_module_type.append(node_real_parents_module_type[item_index])
exclusive_node_real_parents_groups_param.append(fx_graph.get(item).get('groups_param'))
else:
duplicated_real_parents = duplicated_real_parents + 1
item_index = item_index + 1
if duplicated_real_parents > 0:
print("[find_real_parents] node_name: \'{:}\', remove {:} duplicated real parents.".format(node_name, duplicated_real_parents))
fx_graph[node_name]['real_parents'] = exclusive_node_real_parents_name
fx_graph[node_name]['real_parents_module_type'] = exclusive_node_real_parents_module_type
fx_graph[node_name]['real_parents_groups_param'] = exclusive_node_real_parents_groups_param
if cls.__save_permutation_graph:
cls.save_graph_to_json(fx_graph, save_dumped_graph_path_with_name=os.path.join(cls.__permutation_output_dir, './model_graph_find_real_parent.json')) # save the intermediate graph as JSON file for debugging
return fx_graph
@classmethod
def build_fx_graph(cls, model, dump_fx_graph=False, save_dumped_fx_graph='./model_fx_graph.json'):
"""This function is used to build the whole network graph with Torch.FX features."""
success = True
torch_version = str(torch.__version__)
torch_version_major = int(torch_version.split('.')[0])
torch_version_minor = int(torch_version.split('.')[1])
try:
torch_version_minimum = int(torch_version.split('.')[2])
except ValueError: # support the none standard version
torch_version_minimum = torch_version.split('.')[2]
print("[build_fx_graph] The torch version is: {}, version major is: {}, version minor is: {}, version minimum is: {}".format(torch_version, torch_version_major, torch_version_minor, torch_version_minimum))
if torch_version_major >= 1 and torch_version_minor >= 8:
print("[build_fx_graph] The Torch.FX is supported.")
else: # Torch.FX is introduced in torch 1.8.0
print("[build_fx_graph] The Torch.FX is not supported. So cannot build the Torch.FX graph.")
success = False
network_fx_graph = {}
return network_fx_graph, success
print("\n[build_fx_graph] Print the model structure with pure PyTorch function")
print(model)
print("\n[build_fx_graph] Build the module name and type dictionary")
module_name_type_dict = {}
module_name_group_conv_dict = {}
for name, mod in model.named_modules():
print("[build_fx_graph] module_name: {}, module type: {}".format(name, type(mod)))
module_name_type_dict[name] = str(type(mod)).split("\'")[1]
try:
print("[build_fx_graph] this module has \'group\' param with value: {}".format(mod.groups))
module_name_group_conv_dict[name] = str(mod.groups)
except:
module_name_group_conv_dict[name] = 'None'
continue
graph_module = cls.print_raw_fx_graph(model, print_tabular=True)
# keep track of children and parents for each layer (could be call_module or call_function)
print("\n[build_fx_graph] Print the children and parents relationship for each layer")
network_fx_graph = {}
for node in graph_module.graph.nodes:
if node.op == 'placeholder':
print("[build_fx_graph] This is the \'input\' node: {:}".format(node.target))
continue
elif node.op == 'get_attr':
print("[build_fx_graph] This is the \'get_attr\' node: {:}".format(node.target))
continue
elif node.op == 'call_function': # e.g. 'adaptive.avg.pool2d', 'add', 'cat', 'flatten', 'floordiv', 'getattr', 'getitem', 'hardsigmoid', 'mean', 'mul', 'relu', 'transpose'
node_parent, node_children = get_node_parent_children(node)
converted_node_name=convert_fx_node_name(node.name)
print("[build_fx_graph] This is the \'call_function\' node: {:}, its parent list: {:}, its children list: {:}".format(converted_node_name, node_parent, node_children))
network_fx_graph[converted_node_name] = {}
network_fx_graph[converted_node_name]['parents'] = node_parent
network_fx_graph[converted_node_name]['children'] = node_children
network_fx_graph[converted_node_name]['fx_op'] = 'call_function'
elif node.op == 'call_method': # e.g. 'chunk', 'contiguous', 'mean', 'size', 'unsqueeze', 'view'
node_parent, node_children = get_node_parent_children(node)
converted_node_name=convert_fx_node_name(node.name)
print("[build_fx_graph] This is the \'call_method\' node: {:}, its parent list: {:}, its children list: {:}".format(converted_node_name, node_parent, node_children))
network_fx_graph[converted_node_name] = {}
network_fx_graph[converted_node_name]['parents'] = node_parent
network_fx_graph[converted_node_name]['children'] = node_children
network_fx_graph[converted_node_name]['fx_op'] = 'call_method'
continue
elif node.op == 'call_module':
node_parent, node_children = get_node_parent_children(node)
converted_node_name=convert_fx_node_name(node.name)
# check whether the converted_node_name is same as node.target, especially for ReLU case
if converted_node_name != node.target:
print("[build_fx_graph][warning] The target name from Torch.FX is \'{:}\', the manually converted node name is \'{:}\', not the same one, choose the converted node name".format(node.target, converted_node_name))
# assume the modules share the same target name have the same type, because converted_node_name may not be obtained by model.named_modules(), like some ReLU (defined in forward function)
node_type = module_name_type_dict[node.target]
print("[build_fx_graph] This is the \'call_module\' node: {:}, its parent list: {:}, its children list: {:}, its type: {:}".format(converted_node_name, node_parent, node_children, node_type))
network_fx_graph[converted_node_name] = {}
network_fx_graph[converted_node_name]['parents'] = node_parent
network_fx_graph[converted_node_name]['children'] = node_children
network_fx_graph[converted_node_name]['fx_op'] = 'call_module'
network_fx_graph[converted_node_name]['module_type'] = node_type
network_fx_graph[converted_node_name]['groups_param'] = module_name_group_conv_dict[node.target]
elif node.op == 'output':
print("[build_fx_graph] This is the \'output\' node: {:}".format(node.target))
continue
if dump_fx_graph:
print("\n[build_fx_graph] Dump the overall dict for children and parents relationship into JSON file")
cls.save_graph_to_json(network_fx_graph, save_dumped_graph_path_with_name=save_dumped_fx_graph)
return network_fx_graph, success
@classmethod
def print_raw_fx_graph(cls, model, print_tabular=False, generate_python_code=False):
"""This function is used to print the intermediate representation (IR) - Graph representation with Torch.FX features."""
from torch.fx import symbolic_trace
# Symbolic tracing frontend - captures the semantics of the module
try:
symbolic_traced : torch.fx.GraphModule = symbolic_trace(model)
except:
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print("\n[print_raw_fx_graph] Meet the fatal fault when trying to symbolic trace the model with Torch.FX")
raise
exit(0)
# High-level intermediate representation (IR) - Graph representation
print("\n[print_raw_fx_graph] Print the intermediate representation (IR) with Torch.FX")
print(symbolic_traced.graph)
if print_tabular:
print("\n[print_raw_fx_graph] Print the intermediate representation (IR) with Torch.FX in a table format")
try:
symbolic_traced.graph.print_tabular()
except AttributeError: # to avoid the AttributeError: 'Graph' object has no attribute 'print_tabular'
print("[print_raw_fx_graph][Warning] \'print_tabular\' function is not supported in current Torch version. Skip!")
# Code generation - valid Python code
if generate_python_code:
print("\n[print_raw_fx_graph] Create valid Python code matching the IR/Graph's semantics with Torch.FX")
print(symbolic_traced.code)
return symbolic_traced
@classmethod
def save_graph_to_json(cls, graph, save_dumped_graph_path_with_name='./model_fx_graph.json'):
"""This function is used to same the graph into JSON file."""
# use dumps to transfer the dict to JSON string
json_graph_str = json.dumps(graph)
with open(save_dumped_graph_path_with_name, 'w', encoding='utf-8') as dumped_graph_file:
dumped_graph_file.write(json_graph_str) # write the transferred JSON string into JSON file
#include <stdio.h>
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
namespace py = pybind11;
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert %d: %s %s %d\n", (int)code, cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__device__ float group_2_to_4(float4 vals)
{
vals.x = fabs(vals.x);
vals.y = fabs(vals.y);
vals.z = fabs(vals.z);
vals.w = fabs(vals.w);
float sum0 = vals.x + vals.y;
float sum1 = vals.x + vals.z;
float sum2 = vals.x + vals.w;
float sum3 = vals.y + vals.z;
float sum4 = vals.y + vals.w;
float sum5 = vals.z + vals.w;
float best_sum0 = fmax(sum0, sum1);
float best_sum1 = fmax(sum2, sum3);
float best_sum2 = fmax(sum4, sum5);
float best_sum = fmax(fmax(best_sum0, best_sum1), best_sum2);
return best_sum;
}
inline float* float_ptr_from_numpy(py::array_t<float>& py_float)
{
return (float*)py_float.data();
}
inline unsigned int* uint_ptr_from_numpy(py::array_t<unsigned int>& py_uint)
{
return (unsigned int*)py_uint.data();
}
__global__ void subset_sum_after_2_to_4(float* matrix,
unsigned int rows,
unsigned int cols,
unsigned int start_col,
unsigned int end_col,
float* output)
{
// vectorize
float4* mat4 = (float4*) matrix;
cols /= 4;
start_col /= 4;
end_col /= 4;
// each thread in a block takes some number of rows
size_t num_rows = max((int)ceilf((float)rows / (float)blockDim.x), 1);
size_t row_offset = num_rows * threadIdx.x;
// each block takes some number of columns
size_t num_cols = (end_col - start_col) / gridDim.x;
size_t col_offset = num_cols * blockIdx.x;
start_col += col_offset;
end_col = start_col + num_cols;
float sum = 0.0f;
for ( unsigned int r = row_offset; r < row_offset + num_rows; ++r ) {
if (r < rows) {
for ( unsigned int c = start_col; c < end_col; c++ ) {
sum += group_2_to_4(mat4[r * cols + c]);
}
}
}
atomicAdd(output, sum);
}
// build the entire permute map at once
// each block handles one group of stripes
// each threads in the block handle all handle the same permutation at the same time on different rows before moving to the next permutation
__global__ void build_permute_map(float* matrix,
unsigned int rows,
unsigned int cols,
unsigned int* stripes,
unsigned int group_width,
unsigned int* permutations,
unsigned int num_permutations,
unsigned int perm_length,
float* output,
unsigned int* best_indices)
{
// vectorize
float4* mat4 = (float4*) matrix;
cols /= 4;
// each block handles a group of stripes
unsigned int* stripe_group = (unsigned int*)&stripes[blockIdx.x*group_width];
// shared memory: 32 threads each need 16*2
extern __shared__ float pm_shared[32][32];
float4* local_stripes = (float4*)&pm_shared[threadIdx.x];
float* local_columns = (float*) &pm_shared[threadIdx.x];
float4* permuted_stripes = (float4*) &local_stripes[4];
float* permuted_columns = (float*) &local_columns[16];
// each thread handles all permutations in the row before moving on to the next row
size_t num_rows = max((int)ceilf((float)rows / (float)blockDim.x), 1);
size_t row_offset = num_rows * threadIdx.x;
for ( unsigned int r = row_offset; r < row_offset + num_rows; ++r) {
if (r >= rows)
break;
// load a row into smem
for ( unsigned int s = 0; s < group_width; ++s) {
unsigned int const stripe = stripe_group[s];
local_stripes[s] = mat4[r*cols+stripe];
}
for ( unsigned int p = 0; p < num_permutations; ++p) {
unsigned int* permutation = &permutations[p*perm_length];
float sum = 0.0f;
// permute
#pragma unroll 4
for ( unsigned int c = 0; c < group_width*4; ++c) {
permuted_columns[c] = local_columns[permutation[c]];
}
// sum 2:4
for ( unsigned int s = 0; s < group_width; ++s) {
sum += group_2_to_4(permuted_stripes[s]);
}
// update the running sum for this stripe group's permutation
atomicAdd(&output[blockIdx.x*num_permutations + p], sum);
}
}
// at this point, each permutation's sum in this stripe group has been calculated
// now, find the best option
__syncthreads();
if (threadIdx.x == 0) {
unsigned int best_permutation = 0;
float best_magnitude = output[blockIdx.x*num_permutations];
float base_magnitude = best_magnitude;
//#pragma unroll 32
for (unsigned int p = 1; p < num_permutations; ++p) {
float magnitude = output[blockIdx.x*num_permutations+p];
if (magnitude > best_magnitude) {
best_permutation = p;
best_magnitude = magnitude;
}
}
output[blockIdx.x*num_permutations] = best_magnitude - base_magnitude;
best_indices[blockIdx.x] = best_permutation;
}
}
void free_sum_after_2_to_4_memory(float** dmatrix,
float** dresult)
{
cudaFree(*dmatrix);
cudaFree(*dresult);
}
int set_up_sum_after_2_to_4_memory(float** dmatrix,
unsigned int rows,
unsigned int cols,
float** dresult)
{
static unsigned int setupRows = 0;
static unsigned int setupCols = 0;
static bool allocated = false;
int fresh_allocation = 0;
if (!allocated ||
setupRows != rows ||
setupCols != cols)
{
if (allocated)
free_sum_after_2_to_4_memory(dmatrix, dresult);
gpuErrchk(cudaMalloc( (void**) dmatrix, rows*cols*sizeof(float)));
gpuErrchk(cudaMalloc( (void**) dresult, sizeof(float)));
setupRows = rows;
setupCols = cols;
fresh_allocation = 1;
}
allocated = true;
return fresh_allocation;
}
int run_subset_sum_after_2_to_4(py::array_t<float>& py_matrix,
unsigned int rows,
unsigned int cols,
unsigned int start_col,
unsigned int end_col,
unsigned int blocks,
unsigned int threads,
py::array_t<float>& py_output)
{
static float* d_matrix;
static float* d_result;
int fresh_allocation = set_up_sum_after_2_to_4_memory(&d_matrix, rows, cols, &d_result);
float* matrix = float_ptr_from_numpy(py_matrix);
float* output = float_ptr_from_numpy(py_output);
gpuErrchk(cudaMemcpy( d_matrix, matrix, rows*cols*sizeof(float), cudaMemcpyHostToDevice ));
gpuErrchk(cudaMemset( d_result, 0, sizeof(float)));
subset_sum_after_2_to_4<<<blocks, threads>>>(d_matrix, rows, cols, start_col, end_col, d_result);
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy( output, d_result, sizeof(float), cudaMemcpyDeviceToHost ));
return 0;
}
void set_up_permute_map_memory(float** dmatrix,
unsigned int rows,
unsigned int cols,
unsigned int** dstripes,
unsigned int num_groups,
unsigned int group_width,
unsigned int** dpermutations,
unsigned int num_permutations,
unsigned int perm_length,
float** doutput,
unsigned int** dindices,
float** hresult,
unsigned int** hindices)
{
static unsigned int setUpRows = 0;
static unsigned int setUpCols = 0;
static unsigned int setUpGroupWidth = 0;
static unsigned int setUpNumGroups = 0;
static unsigned int setUpNumPerms = 0;
static unsigned int setUpPermLength = 0;
if (setUpRows != rows ||
setUpCols != cols) {
if (*dmatrix != NULL) { gpuErrchk(cudaFree(*dmatrix)); *dmatrix = NULL; }
gpuErrchk(cudaMalloc( (void**) dmatrix, rows*cols*sizeof(float)));
}
if (setUpGroupWidth < group_width ||
setUpNumGroups < num_groups) {
if (*dstripes != NULL) { gpuErrchk(cudaFree(*dstripes)); *dstripes = NULL; }
gpuErrchk(cudaMalloc( (void**) dstripes, num_groups*group_width*sizeof(unsigned int)));
if (setUpNumGroups < num_groups) {
if (*dindices != NULL) { gpuErrchk(cudaFree(*dindices)); *dindices = NULL; }
gpuErrchk(cudaMalloc( (void**) dindices, num_groups*sizeof(unsigned int)));
if (*hindices != NULL) { free(*hindices); *hindices = NULL; }
*hindices = (unsigned int*) malloc (num_groups*sizeof(unsigned int));
}
}
if (setUpNumPerms < num_permutations ||
setUpPermLength < perm_length) {
if (*dpermutations != NULL) { gpuErrchk(cudaFree(*dpermutations)); *dpermutations = NULL; }
gpuErrchk(cudaMalloc( (void**) dpermutations, perm_length*num_permutations*sizeof(unsigned int)));
}
if (setUpNumPerms < num_permutations ||
setUpNumGroups < num_groups) {
if (*doutput != NULL) { gpuErrchk(cudaFree(*doutput)); *doutput = NULL; }
gpuErrchk(cudaMalloc( (void**) doutput, num_permutations*num_groups*sizeof(float)));
if (*hresult != NULL) { free(*hresult); *hresult = NULL; }
*hresult = (float*) malloc(num_permutations*num_groups*sizeof(float));
}
setUpRows = rows;
setUpCols = cols;
setUpGroupWidth = group_width;
setUpNumGroups = num_groups;
setUpNumPerms = num_permutations;
setUpPermLength = perm_length;
}
int run_build_permute_map(py::array_t<float>& py_matrix,
unsigned int rows,
unsigned int cols,
py::array_t<unsigned int>& py_stripes,
unsigned int num_groups,
unsigned int group_width,
py::array_t<unsigned int>& py_permutations,
//unsigned int num_permutations,
unsigned int perm_length,
py::array_t<float>& py_improvements,
py::array_t<unsigned int>& py_best_indices)
{
static float* d_matrix = NULL;
static unsigned int* d_stripes = NULL;
static unsigned int* d_permutations = NULL;
static float* d_output = NULL;
static unsigned int* d_indices = NULL;
static float* hresult = NULL;
static unsigned int* hindices = NULL;
//const unsigned int cols = py_matrix.size() / rows;
//const unsigned int num_groups = py_stripes.size() / group_width;
//const unsigned int perm_length = group_width * 4; // 2:4 sparsity - each stripe in the group is 4 elements wide
const unsigned int num_permutations = py_permutations.size() / perm_length;
const unsigned int MAX_GROUPS_PER_LAUNCH = num_permutations <= 5775 ? 1820 : 40;
const unsigned int full_launches = num_groups / MAX_GROUPS_PER_LAUNCH;
const unsigned int final_launch = num_groups % MAX_GROUPS_PER_LAUNCH;
const unsigned int launches = full_launches + (final_launch != 0 ? 1 : 0);
set_up_permute_map_memory(&d_matrix, rows, cols, &d_stripes, min(num_groups,MAX_GROUPS_PER_LAUNCH), group_width, &d_permutations, num_permutations, perm_length, &d_output, &d_indices, &hresult, &hindices);
float* matrix = float_ptr_from_numpy(py_matrix);
unsigned int* stripes = uint_ptr_from_numpy(py_stripes);
unsigned int* permutations = uint_ptr_from_numpy(py_permutations);
float* improvements = float_ptr_from_numpy(py_improvements);
unsigned int* best_indices = uint_ptr_from_numpy(py_best_indices);
gpuErrchk(cudaMemcpy( d_matrix, matrix, rows*cols*sizeof(float), cudaMemcpyHostToDevice ));
gpuErrchk(cudaMemcpy( d_permutations, permutations, num_permutations*perm_length*sizeof(unsigned int), cudaMemcpyHostToDevice ));
unsigned int group_offset = 0;
for (unsigned int l = 0; l < launches; ++l)
{
unsigned int groups_this_launch = (l < full_launches) ? MAX_GROUPS_PER_LAUNCH : final_launch;
gpuErrchk(cudaMemcpy( d_stripes, &stripes[group_offset*group_width], groups_this_launch*group_width*sizeof(unsigned int), cudaMemcpyHostToDevice ));
gpuErrchk(cudaMemset( d_output, 0, groups_this_launch*num_permutations*sizeof(float)));
gpuErrchk(cudaMemset( d_indices, 0, groups_this_launch*sizeof(unsigned int)));
unsigned int shmem = 32*(32)*sizeof(float);
build_permute_map<<<groups_this_launch, 32, shmem>>>(d_matrix, rows, cols, d_stripes, group_width, d_permutations, num_permutations, perm_length, d_output, d_indices);
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy( hresult, d_output, num_permutations*groups_this_launch*sizeof(float), cudaMemcpyDeviceToHost ));
gpuErrchk(cudaMemcpy( hindices, d_indices, groups_this_launch*sizeof(unsigned int), cudaMemcpyDeviceToHost ));
// thread0 stuck the minimum in the first slot of each group
for (unsigned int g = 0; g < groups_this_launch; ++g) {
improvements[group_offset+g] = hresult[g*num_permutations];
best_indices[group_offset+g] = hindices[g];
}
group_offset += groups_this_launch;
}
return 0;
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
{
m.def("sum_after_2_to_4", &run_subset_sum_after_2_to_4, "matrix sum after applying 2:4 (CUDA)");
m.def("build_permute_map", &run_build_permute_map, "optimize stripe groups (CUDA)");
}
\ No newline at end of file
from .call_permutation_search_kernels import accelerated_search_for_good_permutation
from .permutation_utilities import sum_after_2_to_4
\ No newline at end of file
import numpy as np
from .permutation_utilities import *
from .exhaustive_search import Exhaustive_Search
def accelerated_search_for_good_permutation(matrix_group, options=None):
"""This function is used to call the permutation search CUDA kernels.
users can provide prefer search strategy by providing a valid 'options' as a dictionary,
or users can implement their customized 'accelerated_search_for_good_permutation' function.
"""
input_matrix = matrix_group.cpu().detach().numpy()
print("\n[accelerated_search_for_good_permutation] input matrix shape: \'{:}\'.".format(input_matrix.shape))
result = np.copy(input_matrix)
# init a sequential permutation search sequence
input_channel_num = matrix_group.size()[1]
permutation_sequence = [n for n in range(input_channel_num)]
duration = 0.0
if options == None:
options = {}
if 'strategy' not in options: # right now, the default permutation search strategy is: 'exhaustive' search
options['strategy'] = 'exhaustive'
print("[accelerated_search_for_good_permutation] the permutation strategy is: \'{:} search\'.".format(options['strategy']))
# define sub options for each search strategy
if options['strategy'] == 'exhaustive':
# right now, the default options for 'exhaustive' search is: 'exhaustive,8,100'
if 'stripe_group_size' not in options:
options['stripe_group_size'] = 8
if 'escape_attempts' not in options:
options['escape_attempts'] = 100
elif options['strategy'] == 'progressive channel swap':
# just swaps meaningful channels, keeping the good swaps, until the search time limit expires.
if 'progressive_search_time_limit' not in options:
options['progressive_search_time_limit'] = 60
if 'improvement_threshold' not in options:
options['improvement_threshold'] = 1e-9
# execute the requested strategy
if options['strategy'] == 'exhaustive':
result, duration, permutation_sequence = Exhaustive_Search(result, stripe_group_size=options['stripe_group_size'], escape_attempts=options['escape_attempts'])
elif options['strategy'] == 'progressive channel swap':
real_swap_num = 0
start_time = time.perf_counter()
while time.perf_counter() - start_time < options['progressive_search_time_limit']:
src = np.random.randint(result.shape[1])
dst = np.random.randint(result.shape[1])
src_group = int(src/4)
dst_group = int(dst/4)
if src_group == dst_group: # channel swapping within a stripe does nothing
continue
new_sum, improvement = try_swap(result, dst, src)
if improvement > options['improvement_threshold']:
result[...,[src,dst]] = result[...,[dst,src]]
permutation_sequence[src], permutation_sequence[dst] = permutation_sequence[dst], permutation_sequence[src]
real_swap_num += 1
duration = time.perf_counter() - start_time
print("\tFinally swap {} channel pairs until the search time limit expires.".format(real_swap_num))
elif options['strategy'] == 'user defined': # need to get the permutated matrix (result) by applying customized permutation search function
print("[accelerated_search_for_good_permutation] Use the user customized permutation search function!")
else:
print("[accelerated_search_for_good_permutation] Cannot find the implementation of the required strategy!")
print("[accelerated_search_for_good_permutation] Take {:.4f} seconds to search the permutation sequence.".format(duration))
# In the new version of Exhaustive_Search function, there’s no need to use the find_permutation(result, input_matrix) function
# to recover the permutation sequence applied to the input_matrix to get the result separately any more.
#start_time_find_permutation = time.perf_counter()
#permutation_sequence = find_permutation(result, input_matrix)
#duration_find_permutation = time.perf_counter() - start_time_find_permutation
#print("[accelerated_search_for_good_permutation] Take {:.4f} seconds to finish find_permutation function.".format(duration_find_permutation))
#print("[accelerated_search_for_good_permutation] The permutation sequence is: {:}".format(permutation_sequence))
#print("[accelerated_search_for_good_permutation] The length of permutation sequence is: {:}".format(len(permutation_sequence)))
return permutation_sequence
from .permutation_utilities import *
################################################################################################################
# Exhaustive
# Try them all
# - order of columns within a group doesn't matter
# - order of groups doesn't matter
# - we can eliminate effective duplicates by defining aunique combination to be a sorted list of sorted groups
################################################################################################################
####################################################################
# generate unique permutations
####################################################################
# check if adding a column index to a current permutation would keep it in canonical form
# assumes that perm is in canonical form already!
def is_canonical(perm, col):
# if it's a new group
if len(perm) % 4 == 0:
# every column ID < col needs to be in the permutation already
for val in range(col):
if val not in perm:
return False
# this new group needs to be sorted w.r.t. the previous group
return col > perm[-4]
# not a new group, just check to see if it will still be sorted
return col > perm[-1]
# recursive: build a unique permutation one column index at a time
def generate_unique_combinations(built_permutation, remaining_columns, full_permutation_list, group_width):
# base case: nothing else to add
if len(remaining_columns) == 0:
full_permutation_list.append(np.copy(built_permutation))
if len(full_permutation_list) % 1000000 == 0:
print(f"{len(full_permutation_list)} unique permutations found so far")
# still more choices to make, so add each remaining column in turn column if it keeps everything sorted
else:
for c in range(len(remaining_columns)):
# to satisfy our immutables (values within groups are sorted, groups are globally sorted),
# only add this column if either:
# it's starting a new group and is larger than the previous group's first entry
# OR
# it's larger than the last value in the built_permutation
col_to_add = remaining_columns[c]
if is_canonical(built_permutation, col_to_add):
# add the column to the running permutation, remove it from remaining columns
built_permutation.append(col_to_add)
remaining_columns.pop(c)
# recurse
generate_unique_combinations(built_permutation, remaining_columns, full_permutation_list, group_width)
# remove the most recent column and put it back on the remaining column list where we found it (sorted)
remaining_columns.insert(c, built_permutation.pop(-1))
import pickle
import os.path
from os import path
master_unique_permutation_list = {}
def generate_all_unique_combinations(C, M, must_use_all_groups = False):
global master_unique_permutation_list
if len(master_unique_permutation_list) == 0 and path.exists("master_list.pkl"):
with open("master_list.pkl","rb") as cache:
master_unique_permutation_list = pickle.load(cache)
if (C,M) not in master_unique_permutation_list:
full_permutation_list = []
generate_unique_combinations([0], [c for c in range(1,C)], full_permutation_list, M)
master_unique_permutation_list[(C,M)] = full_permutation_list
with open("master_list.pkl", "wb") as cache:
pickle.dump(master_unique_permutation_list, cache)
unique_permutations = master_unique_permutation_list[(C,M)]
return unique_permutations
# analytical solution
import math
def predict_unique_combinations(C, M):
assert(C%M==0)
G = int(C/M)
return int(int(math.factorial(C)) / (int(math.pow(math.factorial(M),G)) * math.factorial(G)))
#################################################################
# exhaustively try all unique permutations
#################################################################
# exhaustively search the entire matrix
def search_matrix(matrix, group_width):
# give up quickly if we'd go on forever
prediction = predict_unique_combinations(matrix.shape[1], group_width)
best_permutation = [c for c in range(matrix.shape[1])]
if prediction > 1e10:
print(f"There are {prediction} unique combinations with {matrix.shape[1]} columns and a group width of {group_width}, not searching.")
return matrix, prediction, best_permutation
start_time = time.perf_counter()
full_permutation_list = generate_all_unique_combinations(matrix.shape[1], group_width)
# found them, now try them
best_improvement = 0.0
base_sum = sum_after_2_to_4(matrix)
for i in range(1,len(full_permutation_list)):
permutation = full_permutation_list[i]
permuted = matrix[:, permutation]
cur_improvement = sum_after_2_to_4(permuted) - base_sum
if (cur_improvement > best_improvement):
best_improvement = cur_improvement
best_permutation = permutation
seconds = time.perf_counter() - start_time
return matrix[:, best_permutation], seconds, best_permutation, best_improvement
#############
# Stripe group handling
#############
# gather stripes from a larger matrix into a single matrix
def collect_stripes(matrix, stripes, group_width):
subset = np.zeros((matrix.shape[0], len(stripes)*group_width))
#print("[Debug][collect_stripes] matrix shape info: {}".format(matrix.shape))
#print("[Debug][collect_stripes] subset info: {}, {}, {}".format(matrix.shape[0], len(stripes), group_width))
for s,stripe in enumerate(stripes):
#print("[Debug][collect_stripes] s: {}, stripe: {}".format(s, stripe))
subset[...,s*group_width:s*group_width+group_width] = matrix[...,stripe*group_width:stripe*group_width+group_width]
return subset
# apply the stripe group permutation to the entire permutation
def apply_stripe_group_permutation(sgp, stripes, group_width, permutation):
new_permutation = permutation.copy()
for subset_idx in range(len(sgp)):
dst_stripe_idx = stripes[int(subset_idx / group_width)]
dst_col_idx = subset_idx % group_width
subset_val = sgp[subset_idx]
src_stripe_idx = stripes[int(subset_val / group_width)]
src_col_idx = subset_val % group_width
new_permutation[dst_stripe_idx*group_width + dst_col_idx] = permutation[src_stripe_idx*group_width + src_col_idx]
return new_permutation
# generate all possible stripe groups
def generate_stripe_groups(num_stripes, window_size):
stripe_array = [[c] for c in range(num_stripes)]
next_stripe_array = []
for w in range(1, window_size):
for g in range(len(stripe_array)):
start_c = stripe_array[g][w-1]+1
group = stripe_array[g]
for c in range(start_c, num_stripes):
new_group = group.copy()
new_group.append(c)
next_stripe_array.append(new_group)
stripe_array = next_stripe_array
next_stripe_array = []
return set(tuple(stripe_array[g]) for g in range(len(stripe_array)))
# It is not safe to just reset the stripe_set as None here.
# When calling the Exhaustive_Search in E2E search, the stripe_set will not be reset as None.
stripe_set = None
stripe_set_config = None
# build the stripe map
def build_stripe_map(matrix, group_width, window_size, stripe_map, stripe_ids, perm_map, used_stripes):
global stripe_set, stripe_set_config
#print("[Debug][build_stripe_map] Now the stripe_set value is: {}".format(stripe_set))
window_size = int(window_size / group_width)
if stripe_set is None or stripe_set_config is None or stripe_set_config != (group_width, window_size):
num_stripes = int(matrix.shape[1] / group_width)
assert(group_width * num_stripes == matrix.shape[1])
stripe_set = generate_stripe_groups(num_stripes, window_size)
#print("[Debug][build_stripe_map] Update stripe_set value as: {}".format(stripe_set))
stripe_set_config = (group_width, window_size)
# step through each, update the stripe_map/stripe_ids if necessary
updates = 0
use_cuda = use_gpu()
gpu_list = []
gpu_groups = []
for i,s in enumerate(stripe_set):
sg = [] # build the group of stripes, check if any members changed
need_update = i >= len(stripe_map)
for stripe in s:
sg.append(stripe)
if stripe in used_stripes:
need_update = True
# pre-populate if we're building fresh
if i >= len(stripe_map):
stripe_ids.append(sg)
stripe_map.append(0.)
perm_map.append([c for c in range(group_width * window_size)])
# update entries if needed (only stripe_map and perm_map)
if need_update:
updates += 1
if not use_cuda: # do the work here if using the CPU
subset = collect_stripes(matrix, sg, group_width)
sub_result, sub_duration, permutation, improvement = search_matrix(subset, group_width)
stripe_map[i] = improvement
perm_map[i] = permutation
else: # otherwise, just track the work needed to farm off to the GPU
gpu_groups.append(sg)
gpu_list.append(i)
if use_cuda: # if using the GPU, perform the work
matrix_view = np.copy(matrix).astype(np.float32).flatten()
all_permutations = generate_all_unique_combinations(window_size*group_width, group_width)
num_permutations = len(all_permutations)
permutation_view = np.copy(np.asarray(all_permutations)).astype(np.uint32).flatten()
stripe_groups_view = np.asarray(gpu_groups).astype(np.uint32).flatten()
num_gpu_groups = len(gpu_list)
gpu_improvement = np.zeros((num_gpu_groups), dtype=np.float32).flatten()
gpu_permutation = np.zeros((num_gpu_groups), dtype=np.uint32).flatten()
result = permutation_search_cuda_kernels.build_permute_map(matrix_view,
matrix.shape[0],
matrix.shape[1],
stripe_groups_view,
num_gpu_groups,
window_size,
permutation_view,
window_size * group_width,
gpu_improvement,
gpu_permutation)
# put the data where python expects it
for i in range(len(gpu_list)):
stripe_map[gpu_list[i]] = gpu_improvement[i]
perm_map[gpu_list[i]] = all_permutations[gpu_permutation[i]]
return stripe_map, stripe_ids, perm_map
# start performing stripe checks
sm_perturbations = 0
sm_perturbation_limit = 0
def use_stripe_map(matrix, group_width, stripe_map, stripe_ids, perm_map, permutation):
global sm_perturbations, sm_perturbation_limit
used_stripes = []
stripe_groups_optimized = 0
improvement = 0.0
# set the traversal order
ix = np.flip(np.argsort(stripe_map)) # small to large --> large to small
for i in range(len(ix)):
stripe_group_id = ix[i]
perm = perm_map[stripe_group_id].copy()
if stripe_map[stripe_group_id] <= 0.0001:
# perturbations
if len(used_stripes) == 0 and sm_perturbations < sm_perturbation_limit:
sm_perturbations += 1
# use this permutation, but swap two channels from left/right halves to include two stripes, no matter the group size
stripe_group_id = ix[np.random.randint(len(ix))]
perm = perm_map[stripe_group_id].copy()
# a little easier to escape from
src = np.random.randint(int(len(perm)/2))
dst = int(len(perm)/2) + np.random.randint(int(len(perm)/2))
perm[src],perm[dst] = perm[dst],perm[src]
else:
break
stripe_group = stripe_ids[stripe_group_id]
# don't work on stripes we've already touched
touched_stripe = False
for stripe in stripe_group:
if stripe in used_stripes:
touched_stripe = True
if touched_stripe:
continue
# apply the permutation we've already found to this stripe group
subset = collect_stripes(matrix, stripe_group, group_width)
sub_result = subset[...,perm]
permutation = apply_stripe_group_permutation(perm, stripe_group, group_width, permutation)
# scatter the results, track what changed
for s,stripe in enumerate(stripe_group):
# see if this group is in canonical form (entry 0 a multiple of 4, contiguous values))
group = perm[s*group_width:s*group_width+group_width] # columns in this group of the used permutation
changed = False
if group[0] % 4 != 0:
changed = True
for c in range(1,group_width):
if group[c] != group[c-1]+1:
changed = True
break
# if it's not, then it changed
if changed:
used_stripes.append(stripe_group[s])
matrix[...,stripe*group_width:stripe*group_width+group_width] = sub_result[...,s*group_width:s*group_width+group_width]
improvement += stripe_map[stripe_group_id]
stripe_groups_optimized += 1
return matrix, stripe_groups_optimized, stripe_map, stripe_ids, used_stripes, improvement, permutation
# entry point for exhaustive searches - both the entire matrix, as well as stripe groups
def Exhaustive_Search(matrix, stripe_group_size=-1, escape_attempts=0, permutation=None):
global sm_perturbation_limit, sm_perturbations
sm_perturbations = 0
sm_perturbation_limit = escape_attempts
if permutation is None:
permutation = [c for c in range(matrix.shape[1])]
# It is much safer to reset the stripe_set as None in the entry point of Exhaustive_Search
global stripe_set, stripe_set_config
stripe_set = None
stripe_set_config = None
# only support N:4 for now
group_width = 4
result = np.copy(matrix)
# if the matrix is too large for a window size of 12, subdivide, then fix up with a global optimization with a window size of 8
if group_width==4 and stripe_group_size==12 and matrix.shape[1] > 512:
stripe_split = int(matrix.shape[1]/2/group_width)
col_split = stripe_split * group_width
result[:,:col_split], durationL, permutation[:col_split] = Exhaustive_Search(result[:,:col_split], stripe_group_size=stripe_group_size, escape_attempts=escape_attempts, permutation=permutation[:col_split])
result[:,col_split:], durationR, permutation[col_split:] = Exhaustive_Search(result[:,col_split:], stripe_group_size=stripe_group_size, escape_attempts=escape_attempts, permutation=permutation[col_split:])
escape_attempts = max(escape_attempts, 100)*10
result,duration,permutation = Exhaustive_Search(result, stripe_group_size=8, escape_attempts=escape_attempts, permutation=permutation)
return result, durationL+durationR+duration, permutation
# small enough to optimize the entire matrix at once
if stripe_group_size != -1 and stripe_group_size < matrix.shape[1]:
stripe_map = []
stripe_ids = []
perm_map = []
used_stripes = []
optimized_groups_count = 0
agg_improvement = 0.
cur_total_sum = sum_after_2_to_4(result)
# in practice, this work will be cached ahead of time; doing it now.
# (Reading the cached list from disk can take several seconds, which shouldn't be counted against the search, but amortized over every layer in a network)
generate_all_unique_combinations(stripe_group_size, group_width)
start_time = time.perf_counter()
while True:
#print("[Debug][Exhaustive_Search] Before entering the build_stripe_map function.")
#print("[Debug][Exhaustive_Search] Now the stripe_set value is: {}".format(stripe_set))
stripe_map, stripe_ids, perm_map = build_stripe_map(result, group_width, stripe_group_size, stripe_map, stripe_ids, perm_map, used_stripes)
result, stripe_groups_optimized, stripe_map, stripe_ids, used_stripes, improvement, permutation = use_stripe_map(result, group_width, stripe_map, stripe_ids, perm_map, permutation)
# converged?
if len(used_stripes) == 0:
break
duration = time.perf_counter() - start_time
else: # no sliding window, single iteration
print(f"Matrix has {matrix.shape[1]} columns and the search window is only {stripe_group_size}: searching exhaustively")
result, duration, permutation, improvement = search_matrix(matrix, group_width)
return result, duration, permutation
import numpy as np
import time
import ctypes
import subprocess
import os
import math
gpus_tested = False
gpus_found = 0
kernels_found = True
try:
import permutation_search_cuda as permutation_search_cuda_kernels
print(f"Found permutation search CUDA kernels")
except ImportError:
print(f"Could not find permutation search CUDA kernels, falling back to CPU path")
kernels_found = False
def use_gpu(initial_override = True):
global gpus_tested, gpus_found, kernels_found
if not gpus_tested:
if not initial_override:
gpus_tested = True
return False
try:
gpus_found = str(subprocess.check_output(["nvidia-smi", "-L"])).count('UUID')
print(f"Found {gpus_found} gpus")
except:
gpus_found = 0
print(f"Could not find nvidia-smi, please check your cuda installation")
gpus_tested = True
return gpus_found > 0 and kernels_found
##############################################################################################
# pruning utilities
##############################################################################################
## apply 2:4 to some matrix
def apply_2_to_4(matrix):
for row in range(matrix.shape[0]):
for col in range(0,matrix.shape[1],4):
ix = np.argsort(np.abs(matrix[row,col:col+4]))
matrix[row,col+ix[0]] = 0.0
matrix[row,col+ix[1]] = 0.0
return matrix
## find the sum of magnitudes if 2:4 were applied to a matrix
def sum_after_2_to_4(matrix):
#matrix = np.copy(matrix)
cur_sum = 0.0
use_cuda = use_gpu()
if not use_cuda:
start_time = time.perf_counter()
for row in range(matrix.shape[0]):
for col in range(0,matrix.shape[1],4):
ix = np.argsort(np.abs(matrix[row,col:col+4]))
cur_sum += abs(matrix[row,col+ix[2]])
cur_sum += abs(matrix[row,col+ix[3]])
np_elapsed = time.perf_counter() - start_time
else:
matrix = matrix.astype(np.float32)
cuda_sum = np.zeros((1), dtype=np.float32)
start_time = time.perf_counter()
matrix_view = np.copy(matrix).flatten()
sum_view = cuda_sum.flatten()
blocks = max(int(matrix.shape[1]/4/2), 1)
threads = min(max(math.ceil(matrix.shape[0]/4), 1), 1024)
result = permutation_search_cuda_kernels.sum_after_2_to_4(matrix_view,
matrix.shape[0],
matrix.shape[1],
0,
matrix.shape[1],
blocks,
threads,
sum_view)
cuda_elapsed = time.perf_counter() - start_time
#print(cuda_sum, cuda_elapsed, cur_sum, np_elapsed, np_elapsed/cuda_elapsed)
cur_sum = sum_view[0]
return cur_sum
## try swapping columns and tracking magnitude after pruning
def try_swap(matrix, dst, src):
src_base = sum_after_2_to_4(matrix[...,int(src/4)*4:int(src/4)*4+4])
dst_base = sum_after_2_to_4(matrix[...,int(dst/4)*4:int(dst/4)*4+4])
# swap
matrix[...,[src,dst]] = matrix[...,[dst,src]]
# check the Nx4 slices of the swapped columns
src_sum = sum_after_2_to_4(matrix[...,int(src/4)*4:int(src/4)*4+4])
dst_sum = sum_after_2_to_4(matrix[...,int(dst/4)*4:int(dst/4)*4+4])
# swap back
matrix[...,[src,dst]] = matrix[...,[dst,src]]
return src_sum + dst_sum, (src_sum + dst_sum) - (src_base + dst_base)
##############################################################################################
# permutation utilities
##############################################################################################
## find the permutation needed to make matrix A look like matrix B
def find_permutation(A, B):
permutation = []
for col in range(A.shape[1]):
Avals = A[...,col]
for bcol in range(B.shape[1]):
if np.all(Avals - B[...,bcol] == np.zeros(Avals.shape)):
permutation.append(bcol)
break
return permutation
...@@ -55,7 +55,7 @@ def main(args): ...@@ -55,7 +55,7 @@ def main(args):
step = train_loop(args, model, optimizer, step, args.num_dense_steps) step = train_loop(args, model, optimizer, step, args.num_dense_steps)
# simulate sparsity by inserting zeros into existing dense weights # simulate sparsity by inserting zeros into existing dense weights
ASP.enable_sparsity() ASP.compute_sparse_masks()
# train for a few steps with sparse weights # train for a few steps with sparse weights
print("SPARSE :: ",one_ll) print("SPARSE :: ",one_ll)
......
...@@ -50,7 +50,7 @@ def main(step, args, model_state_dict, optimizer_state_dict): ...@@ -50,7 +50,7 @@ def main(step, args, model_state_dict, optimizer_state_dict):
model.load_state_dict(model_state_dict) model.load_state_dict(model_state_dict)
optimizer.load_state_dict(optimizer_state_dict) optimizer.load_state_dict(optimizer_state_dict)
print("Model sparsity is %s" % ("enabled" if ASP.sparsity_is_enabled() else "disabled")) print("Model sparsity is %s" % ("enabled" if ASP.is_sparsity_enabled() else "disabled"))
# train for a few steps with sparse weights # train for a few steps with sparse weights
print("SPARSE :: ",one_ll) print("SPARSE :: ",one_ll)
......
...@@ -59,7 +59,7 @@ def main(args): ...@@ -59,7 +59,7 @@ def main(args):
step = train_loop(args, model, optimizer, step, args.num_dense_steps) step = train_loop(args, model, optimizer, step, args.num_dense_steps)
# simulate sparsity by inserting zeros into existing dense weights # simulate sparsity by inserting zeros into existing dense weights
ASP.enable_sparsity() ASP.compute_sparse_masks()
# train for a few steps with sparse weights # train for a few steps with sparse weights
print("SPARSE :: ",one_ll) print("SPARSE :: ",one_ll)
......
...@@ -295,6 +295,20 @@ if "--cuda_ext" in sys.argv: ...@@ -295,6 +295,20 @@ if "--cuda_ext" in sys.argv:
) )
) )
if "--permutation_search" in sys.argv:
sys.argv.remove("--permutation_search")
if CUDA_HOME is None:
raise RuntimeError("--permutation_search was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.")
else:
cc_flag = ['-Xcompiler', '-fPIC', '-shared']
ext_modules.append(
CUDAExtension(name='permutation_search_cuda',
sources=['apex/contrib/sparsity/permutation_search_kernels/CUDA_kernels/permutation_search_kernels.cu'],
include_dirs=[os.path.join(this_dir, 'apex', 'contrib', 'sparsity', 'permutation_search_kernels', 'CUDA_kernels')],
extra_compile_args={'cxx': ['-O3'] + version_dependent_macros,
'nvcc':['-O3'] + version_dependent_macros + cc_flag}))
if "--bnp" in sys.argv: if "--bnp" in sys.argv:
sys.argv.remove("--bnp") sys.argv.remove("--bnp")
raise_if_cuda_home_none("--bnp") raise_if_cuda_home_none("--bnp")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment