The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units

Ju-Sang Lee

Joon-Choul Shin and Choel-Young Ock

Resumen

Natural language models brought rapid developments to Natural Language Processing (NLP) performance following the emergence of large-scale deep learning models. Language models have previously used token units to represent natural language while reducing the proportion of unknown tokens. However, tokenization in language models raises language-specific issues. One of the key issues is that separating words by morphemes may cause distortion to the original meaning; also, it can prove challenging to apply the information surrounding a word, such as its semantic network. We propose a multi-hot representation language model to maintain Korean morpheme units. This method represents a single morpheme as a group of syllable-based tokens for cases where no matching tokens exist. This model has demonstrated similar performance to existing models in various natural language processing applications. The proposed model retains the minimum unit of meaning by maintaining the morpheme units and can easily accommodate the extension of semantic information.

Palabras claves

language model - tokenization - multi-hot representation - maintain morpheme units - morpheme and syllable-base tokens

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 12 Parte: 20 (2022)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

DOI

https://doi.org/10.3390/app122010612

The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units

Revistas destacadas