Resumen
Video object detection is an important research direction of computer vision. The task of video object detection is to detect and classify moving objects in a sequence of images. Based on the static image object detector, most of the existing video object detection methods use the unique temporal correlation of video to solve the problem of missed detection and false detection caused by moving object occlusion and blur. Another video object detection model guided by an optical flow network is widely used. Feature aggregation of adjacent frames is performed by estimating the optical flow field. However, there are many redundant computations for feature aggregation of adjacent frames. To begin with, this paper improved Faster RCNN by Feature Pyramid and Dynamic Region Aware Convolution. Then the S-SELSA module is proposed from the perspective of semantic and feature similarity. Feature similarity is obtained by a modified SSIM algorithm. The module can aggregate the features of frames globally to avoid redundancy. Finally, the experimental results on the ImageNet VID and DET datasets show that the mAP of the method proposed in this paper is 83.55%, which is higher than the existing methods.