projects_oss/detr/detr/models/position_encoding.py · 0a38f8c829a24e186180382af1d703149b270d7d · OpenDAS / d2go

clamp reference point max to 1.0 to avoid NaN in regressed bbox · 0a38f8c8

Zhicheng Yan authored Sep 01, 2021

Summary:
For training DF-DETR with swin-transformer backbone which uses large size_divisibility 224 (=32 * 7) and potentially has more zero-padding, we find the regressed box can contain NaN values and fail the assertion here (https://fburl.com/code/p27ztcce).

This issue might be caused by two potential reasons.
- Fix 1. In DF-DETR encoder, the reference points prepared by `get_reference_points()` can contain normalized x,y coordinates larger than 1 due to the rounding issues during mask interpolation across feature scales (specific examples can be given upon request LoL). Thus, we clamp max of x,y coordinates to 1.0.

- Fix 2. The MLP used in bbox_embed heads contains 3 FC layers, which might be too many. We introduce an argument `BBOX_EMBED_NUM_LAYERS` to allow users to configure the number of FC layers. This change is back-compatible.

Reviewed By: zhanghang1989

Differential Revision: D30661167

fbshipit-source-id: c7e94983bf1ec07426fdf1b9d363e5163637f21a

0a38f8c8

position_encoding.py 3.8 KB

Replace position_encoding.py