Unverified Commit 81f032ed authored by Zaida Zhou's avatar Zaida Zhou Committed by GitHub
Browse files

[Docs] Update FAQ (#1481)

* [Docs] Update FAQ

* update faq

* polish the description

* update faq

* update faq

* improve the faq

* improve the faq

* improve the faq
parent 43b2f098
...@@ -3,40 +3,89 @@ ...@@ -3,40 +3,89 @@
We list some common troubles faced by many users and their corresponding solutions here. We list some common troubles faced by many users and their corresponding solutions here.
Feel free to enrich the list if you find any frequent issues and have ways to help others to solve them. Feel free to enrich the list if you find any frequent issues and have ways to help others to solve them.
- Compatibility issue between MMCV and MMDetection; "ConvWS is already registered in conv layer" ### Installation
- KeyError: "xxx: 'yyy is not in the zzz registry'"
Please install the correct version of MMCV for the version of your MMDetection following the instruction above. The registry mechanism will be triggered only when the file of the module is imported.
So you need to import that file somewhere. More details can be found at https://github.com/open-mmlab/mmdetection/issues/5974.
- "No module named 'mmcv.ops'"; "No module named 'mmcv._ext'". - "No module named 'mmcv.ops'"; "No module named 'mmcv._ext'"
1. Uninstall existing mmcv in the environment using `pip uninstall mmcv`. 1. Uninstall existing mmcv in the environment using `pip uninstall mmcv`
2. Install mmcv-full following the instruction above. 2. Install mmcv-full following the [installation instruction](https://mmcv.readthedocs.io/en/latest/get_started/installation.html) or [Build MMCV from source](https://mmcv.readthedocs.io/en/latest/get_started/build.html)
- "invalid device function" or "no kernel image is available for execution". - "invalid device function" or "no kernel image is available for execution"
1. Check the CUDA compute capability of you GPU. 1. Check the CUDA compute capability of you GPU
2. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision, 2. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision, and MMCV are built for the correct GPU architecture. You may need to set `TORCH_CUDA_ARCH_LIST` to reinstall MMCV. The compatibility issue could happen when using old GPUS, e.g., Tesla K80 (3.7) on colab.
and MMCV are built for the correct GPU architecture. 3. Check whether the running environment is the same as that when mmcv/mmdet is compiled. For example, you may compile mmcv using CUDA 10.0 bug run it on CUDA9.0 environments
You may need to set `TORCH_CUDA_ARCH_LIST` to reinstall MMCV.
The compatibility issue could happen when using old GPUS, e.g., Tesla K80 (3.7) on colab.
3. Check whether the running environment is the same as that when mmcv/mmdet is compiled.
For example, you may compile mmcv using CUDA 10.0 bug run it on CUDA9.0 environments.
- "undefined symbol" or "cannot open xxx.so". - "undefined symbol" or "cannot open xxx.so"
1. If those symbols are CUDA/C++ symbols (e.g., libcudart.so or GLIBCXX), check 1. If those symbols are CUDA/C++ symbols (e.g., libcudart.so or GLIBCXX), check
whether the CUDA/GCC runtimes are the same as those used for compiling mmcv. whether the CUDA/GCC runtimes are the same as those used for compiling mmcv
2. If those symbols are Pytorch symbols (e.g., symbols containing caffe, aten, and TH), check whether 2. If those symbols are Pytorch symbols (e.g., symbols containing caffe, aten, and TH), check whether the Pytorch version is the same as that used for compiling mmcv
the Pytorch version is the same as that used for compiling mmcv. 3. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision, and MMCV are built by and running on the same environment
3. Run `python mmdet/utils/collect_env.py` to check whether PyTorch, torchvision,
and MMCV are built by and running on the same environment.
- "RuntimeError: CUDA error: invalid configuration argument". - "RuntimeError: CUDA error: invalid configuration argument"
This error may be due to your poor GPU. Try to decrease the value of [THREADS_PER_BLOCK](https://github.com/open-mmlab/mmcv/blob/cac22f8cf5a904477e3b5461b1cc36856c2793da/mmcv/ops/csrc/common_cuda_helper.hpp#L10) This error may be caused by the poor performance of GPU. Try to decrease the value of [THREADS_PER_BLOCK](https://github.com/open-mmlab/mmcv/blob/cac22f8cf5a904477e3b5461b1cc36856c2793da/mmcv/ops/csrc/common_cuda_helper.hpp#L10)
and recompile mmcv. and recompile mmcv.
- "RuntimeError: nms is not compiled with GPU support". - "RuntimeError: nms is not compiled with GPU support"
This error is because your CUDA environment is not installed correctly. This error is because your CUDA environment is not installed correctly.
You may try to re-install your CUDA environment and then delete the build/ folder before re-compile mmcv. You may try to re-install your CUDA environment and then delete the build/ folder before re-compile mmcv.
- "Segmentation fault"
1. Check your GCC version and use GCC >= 5.4. This usually caused by the incompatibility between PyTorch and the environment (e.g., GCC < 4.9 for PyTorch). We also recommend the users to avoid using GCC 5.5 because many feedbacks report that GCC 5.5 will cause "segmentation fault" and simply changing it to GCC 5.4 could solve the problem
2. Check whether PyTorch is correctly installed and could use CUDA op, e.g. type the following command in your terminal and see whether they could correctly output results
```shell
python -c 'import torch; print(torch.cuda.is_available())'
```
3. If PyTorch is correctly installed, check whether MMCV is correctly installed. If MMCV is correctly installed, then there will be no issue of the command
```shell
python -c 'import mmcv; import mmcv.ops'
```
4. If MMCV and PyTorch are correctly installed, you can use `ipdb` to set breakpoints or directly add `print` to debug and see which part leads the `segmentation fault`
- "libtorch_cuda_cu.so: cannot open shared object file"
`mmcv-full` depends on the share object but it can not be found. We can check whether the object exists in `~/miniconda3/envs/{environment-name}/lib/python3.7/site-packages/torch/lib` or try to re-install the PyTorch.
- "fatal error C1189: #error: -- unsupported Microsoft Visual Studio version!"
If you are building mmcv-full on Windows and the version of CUDA is 9.2, you will probably encounter the error `"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2\include\crt/host_config.h(133): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!"`, in which case you can use a lower version of Microsoft Visual Studio like vs2017.
- "error: member "torch::jit::detail::ModulePolicy::all_slots" may not be initialized"
If your version of PyTorch is 1.5.0 and you are building mmcv-full on Windows, you will probably encounter the error `- torch/csrc/jit/api/module.h(474): error: member "torch::jit::detail::ModulePolicy::all_slots" may not be initialized`. The way to solve the error is to replace all the `static constexpr bool all_slots = false;` with `static bool all_slots = false;` at this file `https://github.com/pytorch/pytorch/blob/v1.5.0/torch/csrc/jit/api/module.h`. More details can be found at https://github.com/pytorch/pytorch/issues/39394.
- "error: a member with an in-class initializer must be const"
If your version of PyTorch is 1.6.0 and you are building mmcv-full on Windows, you will probably encounter the error `"- torch/include\torch/csrc/jit/api/module.h(483): error: a member with an in-class initializer must be const"`. The way to solve the error is to replace all the `CONSTEXPR_EXCEPT_WIN_CUDA ` with `const` at `torch/include\torch/csrc/jit/api/module.h`. More details can be found at https://github.com/open-mmlab/mmcv/issues/575.
- "error: member "torch::jit::ProfileOptionalOp::Kind" may not be initialized"
If your version of PyTorch is 1.7.0 and you are building mmcv-full on Windows, you will probably encounter the error `torch/include\torch/csrc/jit/ir/ir.h(1347): error: member "torch::jit::ProfileOptionalOp::Kind" may not be initialized`. The way to solve the error needs to modify several local files of PyTorch:
- delete `static constexpr Symbol Kind = ::c10::prim::profile;` and `tatic constexpr Symbol Kind = ::c10::prim::profile_optional;` at `torch/include\torch/csrc/jit/ir/ir.h`
- replace `explicit operator type&() { return *(this->value); }` with `explicit operator type&() { return *((type*)this->value); }` at `torch\include\pybind11\cast.h`
- replace all the `CONSTEXPR_EXCEPT_WIN_CUDA` with `const` at `torch/include\torch/csrc/jit/api/module.h`
- Compatibility issue between MMCV and MMDetection; "ConvWS is already registered in conv layer"
Please install the correct version of MMCV for the version of your MMDetection following the [installation instruction](https://mmdetection.readthedocs.io/en/latest/get_started.html#installation). More details can be found at https://github.com/pytorch/pytorch/pull/45956.
### Usage
- "RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one"
1. This error indicates that your module has parameters that were not used in producing loss. This phenomenon may be caused by running different branches in your code in DDP mode. More datails at https://github.com/pytorch/pytorch/issues/55582
2. You can set ` find_unused_parameters = True` in the config to solve the above problems or find those unused parameters manually
- "RuntimeError: Trying to backward through the graph a second time"
`GradientCumulativeOptimizerHook` and `OptimizerHook` are both set which causes the `loss.backward()` to be called twice so `RuntimeError` was raised. We can only use one of these. More datails at https://github.com/open-mmlab/mmcv/issues/1379.
...@@ -47,13 +47,16 @@ class Converter1(object): ...@@ -47,13 +47,16 @@ class Converter1(object):
self.a = a self.a = a
self.b = b self.b = b
``` ```
The key step to use registry for managing the modules is to register the implemented module into the registry `CONVERTERS` through The key step to use registry for managing the modules is to register the implemented module into the registry `CONVERTERS` through
`@CONVERTERS.register_module()` when you are creating the module. By this way, a mapping between a string and the class is built and maintained by `CONVERTERS` as below `@CONVERTERS.register_module()` when you are creating the module. By this way, a mapping between a string and the class is built and maintained by `CONVERTERS` as below
```python ```python
'Converter1' -> <class 'Converter1'> 'Converter1' -> <class 'Converter1'>
``` ```
```{note}
The registry mechanism will be triggered only when the file where the module is located is imported.
So you need to import that file somewhere. More details can be found at https://github.com/open-mmlab/mmdetection/issues/5974.
```
If the module is successfully registered, you can use this converter through configs as If the module is successfully registered, you can use this converter through configs as
......
...@@ -3,35 +3,89 @@ ...@@ -3,35 +3,89 @@
在这里我们列出了用户经常遇到的问题以及对应的解决方法。如果您遇到了其他常见的问题,并且知道可以帮到大家的解决办法, 在这里我们列出了用户经常遇到的问题以及对应的解决方法。如果您遇到了其他常见的问题,并且知道可以帮到大家的解决办法,
欢迎随时丰富这个列表。 欢迎随时丰富这个列表。
- MMCV 和 MMDetection 的兼容性问题;"ConvWS is already registered in conv layer" ### 安装问题
- KeyError: "xxx: 'yyy is not in the zzz registry'"
请按照上述说明为您的 MMDetection 版本安装正确版本的 MMCV 只有模块所在的文件被导入时,注册机制才会被触发,所以您需要在某处导入该文件,更多详情请查看 https://github.com/open-mmlab/mmdetection/issues/5974
- "No module named 'mmcv.ops'"; "No module named 'mmcv._ext'" - "No module named 'mmcv.ops'"; "No module named 'mmcv._ext'"
1. 使用 `pip uninstall mmcv` 卸载您环境中的 mmcv 1. 使用 `pip uninstall mmcv` 卸载您环境中的 mmcv
2. 按照上述说明安装 mmcv-full 2. 参考 [installation instruction](https://mmcv.readthedocs.io/en/latest/get_started/installation.html) 或者 [Build MMCV from source](https://mmcv.readthedocs.io/en/latest/get_started/build.html) 安装 mmcv-full
- "invalid device function" 或者 "no kernel image is available for execution" - "invalid device function" 或者 "no kernel image is available for execution"
1. 检查 GPU 的 CUDA 计算能力 1. 检查 GPU 的 CUDA 计算能力
2. 运行 `python mmdet/utils/collect_env.py` 来检查 PyTorch、torchvision 和 MMCV 是否是针对正确的 GPU 架构构建的 2. 运行 `python mmdet/utils/collect_env.py` 来检查 PyTorch、torchvision 和 MMCV 是否是针对正确的 GPU 架构构建的,您可能需要去设置 `TORCH_CUDA_ARCH_LIST` 来重新安装 MMCV。兼容性问题可能会出现在使用旧版的 GPUs,如:colab 上的 Tesla K80 (3.7)
您可能需要去设置 `TORCH_CUDA_ARCH_LIST` 来重新安装 MMCV
兼容性问题的可能会出现在使用旧版的 GPUs,如:colab 上的 Tesla K80 (3.7)
3. 检查运行环境是否和 mmcv/mmdet 编译时的环境相同。例如,您可能使用 CUDA 10.0 编译 mmcv,但在 CUDA 9.0 的环境中运行它 3. 检查运行环境是否和 mmcv/mmdet 编译时的环境相同。例如,您可能使用 CUDA 10.0 编译 mmcv,但在 CUDA 9.0 的环境中运行它
- "undefined symbol" 或者 "cannot open xxx.so" - "undefined symbol" 或者 "cannot open xxx.so"
1. 如果符号和 CUDA/C++ 相关(例如:libcudart.so 或者 GLIBCXX),请检查 CUDA/GCC 运行时的版本是否和编译 mmcv 的一致 1. 如果符号和 CUDA/C++ 相关(例如:libcudart.so 或者 GLIBCXX),请检查 CUDA/GCC 运行时的版本是否和编译 mmcv 的一致
2. 如果符号和 PyTorch 相关(例如:符号包含 caffe、aten 和 TH),请检查 PyTorch 运行时的版本是否和编译 mmcv 的一致 2. 如果符号和 PyTorch 相关(例如:符号包含 caffe、aten 和 TH),请检查 PyTorch 运行时的版本是否和编译 mmcv 的一致
3. 运行 `python mmdet/utils/collect_env.py` 以检查 PyTorch、torchvision 和 MMCV 构建和运行的环境是否相同 3. 运行 `python mmdet/utils/collect_env.py` 以检查 PyTorch、torchvision 和 MMCV 构建和运行的环境是否相同
- "RuntimeError: CUDA error: invalid configuration argument" - "RuntimeError: CUDA error: invalid configuration argument"
这个错误可能是由于您的 GPU 性能不佳造成的。尝试降低[THREADS_PER_BLOCK](https://github.com/open-mmlab/mmcv/blob/cac22f8cf5a904477e3b5461b1cc36856c2793da/mmcv/ops/csrc/common_cuda_helper.hpp#L10) 这个错误可能是由于您的 GPU 性能不佳造成的。尝试降低[THREADS_PER_BLOCK](https://github.com/open-mmlab/mmcv/blob/cac22f8cf5a904477e3b5461b1cc36856c2793da/mmcv/ops/csrc/common_cuda_helper.hpp#L10)
的值并重新编译 mmcv。 的值并重新编译 mmcv。
- "RuntimeError: nms is not compiled with GPU support" - "RuntimeError: nms is not compiled with GPU support"
这个错误是由于您的 CUDA 环境没有正确安装。 这个错误是由于您的 CUDA 环境没有正确安装。
您可以尝试重新安装您的 CUDA 环境,然后删除 mmcv/build 文件夹并重新编译 mmcv。 您可以尝试重新安装您的 CUDA 环境,然后删除 mmcv/build 文件夹并重新编译 mmcv。
- "Segmentation fault"
1. 检查 GCC 的版本,通常是因为 PyTorch 版本与 GCC 版本不匹配 (例如 GCC < 4.9 ),我们推荐用户使用 GCC 5.4,我们也不推荐使用 GCC 5.5, 因为有反馈 GCC 5.5 会导致 "segmentation fault" 并且切换到 GCC 5.4 就可以解决问题
2. 检查是否正确安装 CUDA 版本的 PyTorc。输入以下命令并检查是否返回 True
```shell
python -c 'import torch; print(torch.cuda.is_available())'
```
3. 如果 `torch` 安装成功,那么检查 MMCV 是否安装成功。输入以下命令,如果没有报错说明 mmcv-full 安装成。
```shell
python -c 'import mmcv; import mmcv.ops'
```
4. 如果 MMCV 与 PyTorch 都安装成功了,则可以使用 `ipdb` 设置断点或者使用 `print` 函数,分析是哪一部分的代码导致了 `segmentation fault`
- "libtorch_cuda_cu.so: cannot open shared object file"
`mmcv-full` 依赖 `libtorch_cuda_cu.so` 文件,但程序运行时没能找到该文件。我们可以检查该文件是否存在 `~/miniconda3/envs/{environment-name}/lib/python3.7/site-packages/torch/lib` 也可以尝试重装 PyTorch。
- "fatal error C1189: #error: -- unsupported Microsoft Visual Studio version!"
如果您在 Windows 上编译 mmcv-full 并且 CUDA 的版本是 9.2,您很可能会遇到这个问题 `"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2\include\crt/host_config.h(133): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!"`,您可以尝试使用低版本的 Microsoft Visual Studio,例如 vs2017。
- "error: member "torch::jit::detail::ModulePolicy::all_slots" may not be initialized"
如果您在 Windows 上编译 mmcv-full 并且 PyTorch 的版本是 1.5.0,您很可能会遇到这个问题 `- torch/csrc/jit/api/module.h(474): error: member "torch::jit::detail::ModulePolicy::all_slots" may not be initialized`。解决这个问题的方法是将 `torch/csrc/jit/api/module.h` 文件中所有 `static constexpr bool all_slots = false;` 替换为 `static bool all_slots = false;`。更多细节可以查看 https://github.com/pytorch/pytorch/issues/39394。
- "error: a member with an in-class initializer must be const"
如果您在 Windows 上编译 mmcv-full 并且 PyTorch 的版本是 1.6.0,您很可能会遇到这个问题 `"- torch/include\torch/csrc/jit/api/module.h(483): error: a member with an in-class initializer must be const"`. 解决这个问题的方法是将 `torch/include\torch/csrc/jit/api/module.h` 文件中的所有 `CONSTEXPR_EXCEPT_WIN_CUDA ` 替换为 `const`。更多细节可以查看 https://github.com/open-mmlab/mmcv/issues/575。
- "error: member "torch::jit::ProfileOptionalOp::Kind" may not be initialized"
如果您在 Windows 上编译 mmcv-full 并且 PyTorch 的版本是 1.7.0,您很可能会遇到这个问题 `torch/include\torch/csrc/jit/ir/ir.h(1347): error: member "torch::jit::ProfileOptionalOp::Kind" may not be initialized`. 解决这个问题的方法是修改 PyTorch 中的几个文件:
- 删除 `torch/include\torch/csrc/jit/ir/ir.h` 文件中的 `static constexpr Symbol Kind = ::c10::prim::profile;``tatic constexpr Symbol Kind = ::c10::prim::profile_optional;`
-`torch\include\pybind11\cast.h` 文件中的 `explicit operator type&() { return *(this->value); }` 替换为 `explicit operator type&() { return *((type*)this->value); }`
-`torch/include\torch/csrc/jit/api/module.h` 文件中的 所有 `CONSTEXPR_EXCEPT_WIN_CUDA` 替换为 `const`
更多细节可以查看 https://github.com/pytorch/pytorch/pull/45956。
- MMCV 和 MMDetection 的兼容性问题;"ConvWS is already registered in conv layer"
请参考 [installation instruction](https://mmdetection.readthedocs.io/en/latest/get_started.html#installation) 为您的 MMDetection 版本安装正确版本的 MMCV。
### 使用问题
- "RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one"
1. 这个错误是因为有些参数没有参与 loss 的计算,可能是代码中存在多个分支,导致有些分支没有参与 loss 的计算。更多细节见 https://github.com/pytorch/pytorch/issues/55582。
2. 你可以设置 DDP 中的 `find_unused_parameters` 为 `True`,或者手动查找哪些参数没有用到。
- "RuntimeError: Trying to backward through the graph a second time"
不能同时设置 `GradientCumulativeOptimizerHook` 和 `OptimizerHook`,这会导致 `loss.backward()` 被调用两次,于是程序抛出 `RuntimeError`。我们只需设置其中的一个。更多细节见 https://github.com/open-mmlab/mmcv/issues/1379。
...@@ -48,7 +48,9 @@ class Converter1(object): ...@@ -48,7 +48,9 @@ class Converter1(object):
```python ```python
'Converter1' -> <class 'Converter1'> 'Converter1' -> <class 'Converter1'>
``` ```
```{note}
只有模块所在的文件被导入时,注册机制才会被触发,所以您需要在某处导入该文件。更多详情请查看 https://github.com/open-mmlab/mmdetection/issues/5974。
```
如果模块被成功注册了,你可以通过配置文件使用这个转换器(converter),如下所示: 如果模块被成功注册了,你可以通过配置文件使用这个转换器(converter),如下所示:
```python ```python
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment