Resumen
The introduction of artificial neural networks to speech recognition applications has sparked the rapid development and popularization of digital assistants. These digital assistants constantly monitor the audio captured by a microphone for a small set of keywords. Upon recognizing a keyword, a larger audio recording is saved and processed by a separate, more complex neural network. Deep neural networks have become an effective tool for keyword spotting. Their implementation in low-cost edge devices, however, is still challenging due to limited resources on board. This research demonstrates the process of implementing, modifying, and training neural network architectures for keyword spotting. The trained models are also subjected to post-training quantization to evaluate its effect on model performance. The models are evaluated using metrics relevant to deployment on resource-constrained systems, such as model size, memory consumption, and inference latency, in addition to the standard comparisons of accuracy and parameter count. The process of deploying the trained and quantized models is also explored through configuring the microcontroller or FPGA onboard the edge devices. By selecting multiple architectures, training a collection of models, and comparing the models using the techniques demonstrated in this research, a developer can find the best-performing neural network for keyword spotting given the constraints of a target embedded system.