Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition

Zizhao Guo and Sancong Ying

Resumen

Incorporating multi-modality data is an effective way to improve action recognition performance. Based on this idea, we investigate a new data modality in which Whole-Body Keypoint and Skeleton (WKS) labels are used to capture refined body information. Unlike directly aggregated multi-modality, we leverage distillation to adapt an RGB network to classify action with the feature-extraction ability of the WKS network, which is only fed with RGB clips. Inspired by the success of transformers for vision tasks, we design an architecture that takes advantage of both three-dimensional (3D) convolutional neural networks (CNNs) and the Swin transformer to extract spatiotemporal features, resulting in advanced performance. Furthermore, considering the unequal discrimination among clips of a video, we also present a new method for aggregating the clip-level classification results, further improving the performance. The experimental results demonstrate that our framework achieves advanced accuracy of 93.4% with only RGB input on the UCF-101 dataset.

Palabras claves

action recognition - aggregation function - multi-modality - Swin transformer

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 12 Parte: 12 (2022)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

DOI

https://doi.org/10.3390/app12126215

Whole-Body Keypoint and Skeleton Augmented RGB Networks for Video Action Recognition

Revistas destacadas