A Robust Morpheme Sequence and Convolutional Neural Network-Based Uyghur and Kazakh Short Text Classification

Sardar Parhat

Mijit Ablimit and Askar Hamdulla

Resumen

In this paper, based on the multilingual morphological analyzer, we researched the similar low-resource languages, Uyghur and Kazakh, short text classification. Generally, the online linguistic resources of these languages are noisy. So a preprocessing is necessary and can significantly improve the accuracy. Uyghur and Kazakh are the languages with derivational morphology, in which words are coined by stems concatenated with suffixes. Usually, terms are used as the representation of text content while excluding functional parts as stop words in these languages. By extracting stems we can collect necessary terms and exclude stop words. Morpheme segmentation tool can split text into morphemes with 95% high reliability. After preparing both word- and morpheme-based training text corpora, we apply convolutional neural network (CNN) as a feature selection and text classification algorithm to perform text classification tasks. Experimental results show that the morpheme-based approach outperformed the word-based approach. Word embedding technique is frequently used in text representation both in the framework of neural networks and as a value expression, and can map language units into a sequential vector space based on context, and it is a natural way to extract and predict out-of-vocabulary (OOV) from context information. Multilingual morphological analysis has provided a convenient way for processing tasks of low resource languages like Uyghur and Kazakh.

Palabras claves

Uyghur and Kazakh - text classification - CNN - morphology

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 10 Parte: 12 (2019)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Applied Sciences
Computers
IEEE TRANSACTIONS ON NEURAL NETWORK

DOI