ARTÍCULO
TITULO

The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters

Nurgali Kadyrbek    
Madina Mansurova    
Adai Shomanov and Gaukhar Makharova    

Resumen

This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of deep neural networks for speech modeling. A high-quality decoded audio corpus was collected, containing 554 h of data, giving an idea of the frequencies of letters and syllables, as well as demographic parameters such as the gender, age, and region of residence of native speakers. The corpus contains a universal vocabulary and serves as a valuable resource for the development of modules related to speech. Machine learning experiments were conducted using the DeepSpeech2 model, which includes a sequence-to-sequence architecture with an encoder, decoder, and attention mechanism. To increase the reliability of the model, filters initialized with symbol-level embeddings were introduced to reduce the dependence on accurate positioning on object maps. The training process included simultaneous preparation of convolutional filters for spectrograms and symbolic objects. The proposed approach, using a combination of supervised and unsupervised learning methods, resulted in a 66.7% reduction in the weight of the model while maintaining relative accuracy. The evaluation on the test sample showed a 7.6% lower character error rate (CER) compared to existing models, demonstrating its most modern characteristics. The proposed architecture provides deployment on platforms with limited resources. Overall, this study presents a high-quality audio corpus, an improved speech recognition model, and promising results applicable to speech-related applications and languages beyond Kazakh.

 Artículos similares

       
 
Eduardo Medeiros, Leonel Corado, Luís Rato, Paulo Quaresma and Pedro Salgueiro    
Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization ... ver más
Revista: Future Internet

 
Jacob Bushur and Chao Chen    
The introduction of artificial neural networks to speech recognition applications has sparked the rapid development and popularization of digital assistants. These digital assistants constantly monitor the audio captured by a microphone for a small set o... ver más
Revista: Future Internet

 
Muhammad Atif and Valentina Franzoni    
Users of web or chat social networks typically use emojis (e.g., smilies, memes, hearts) to convey in their textual interactions the emotions underlying the context of the communication, aiming for better interpretability, especially for short polysemous... ver más
Revista: Future Internet

 
Purushottam Sharma, Devesh Tulsian, Chaman Verma, Pratibha Sharma and Nancy Nancy    
Language plays a vital role in the communication of ideas, thoughts, and information to others. Hearing-impaired people also understand our thoughts using a language known as sign language. Every country has a different sign language which is based on th... ver más
Revista: Future Internet

 
Shinnosuke Isobe, Satoshi Tamura, Satoru Hayamizu, Yuuto Gotoh and Masaki Nose    
Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obviou... ver más
Revista: Future Internet