Resumen
Recent research shows recurrent neural network-Transducer (RNN-T) architecture has become a mainstream approach for streaming speech recognition. In this work, we investigate the VGG2 network as the input layer to the RNN-T in streaming speech recognition. Specifically, before the input feature is passed to the RNN-T, we introduce a gated-VGG2 block, which uses the first two layers of the VGG16 to extract contextual information in the time domain, and then use a SEnet-style gating mechanism to control what information in the channel domain is to be propagated to RNN-T. The results show that the RNN-T model with the proposed gated-VGG2 block brings significant performance improvement when compared to the existing RNN-T model, and it has a lower latency and character error rate than the Transformer-based model.