• Zhicheng Yan's avatar
    clamp reference point max to 1.0 to avoid NaN in regressed bbox · 0a38f8c8
    Zhicheng Yan authored
    Summary:
    For training DF-DETR with swin-transformer backbone which uses large size_divisibility 224 (=32 * 7) and potentially has more zero-padding, we find the regressed box can contain NaN values and fail the assertion here (https://fburl.com/code/p27ztcce).
    
    This issue might be caused by two potential reasons.
    - Fix 1. In DF-DETR encoder, the reference points prepared by `get_reference_points()` can contain normalized x,y coordinates larger than 1 due to the rounding issues during mask interpolation across feature scales (specific examples can be given upon request LoL). Thus, we clamp max of x,y coordinates to 1.0.
    
    - Fix 2. The MLP used in bbox_embed heads contains 3 FC layers, which might be too many. We introduce an argument `BBOX_EMBED_NUM_LAYERS` to allow users to configure the number of FC layers. This change is back-compatible.
    
    Reviewed By: zhanghang1989
    
    Differential Revision: D30661167
    
    fbshipit-source-id: c7e94983bf1ec07426fdf1b9d363e5163637f21a
    0a38f8c8
position_encoding.py 3.8 KB