Resumen
To alleviate the problem of performance degradation due to the varied sound durations of competing classes in sound event detection, we propose a method that utilizes multi-scale features for sound event detection. We employed a feature-pyramid component in a deep neural network architecture based on the Transformer encoder that is used to efficiently model the time correlation of sound signals because of its superiority over conventional recurrent neural networks, as demonstrated in recent studies. We used layers of convolutional neural networks to produce two-dimensional acoustic features that are input into the Transformer encoders. The outputs of the Transformer encoders at different levels of the network are combined to obtain the multi-scale features to feed the fully connected feed-forward neural network, which acts as the final classification layer. The proposed method is motivated by the idea that multi-scale features make the network more robust against the dynamic duration of the sound signals depending on their classes. We also applied the proposed method to a mean-teacher model, based on the Transformer encoder, to demonstrate its effectiveness on a large set of unlabeled data. We conducted experiments using the DCASE 2019 Task 4 dataset to evaluate the performance of the proposed method. The experimental results show that the proposed architecture outperforms the baseline network without multi-scale features.