Resumen
Acoustic echo in full-duplex telecommunication systems is a common problem that may cause desired-speech quality degradation during double-talk periods. This problem is especially challenging in low signal-to-echo ratio (SER) scenarios, such as hands-free conversations over mobile phones when the loudspeaker volume is high. This paper proposes a two-stage deep-learning approach to residual echo suppression focused on the low SER scenario. The first stage consists of a speech spectrogram masking model integrated with a double-talk detector (DTD). The second stage consists of a spectrogram refinement model optimized for speech quality by minimizing a perceptual evaluation of speech quality (PESQ) related loss function. The proposed integration of DTD with the masking model outperforms several other configurations based on previous studies. We conduct an ablation study that shows the contribution of each part of the proposed system. We evaluate the proposed system in several SERs and demonstrate its efficiency in the challenging setting of a very low SER. Finally, the proposed approach outperforms competing methods in several residual echo suppression metrics. We conclude that the proposed system is well-suited for the task of low SER residual echo suppression.