Redirigiendo al acceso original de articulo en 17 segundos...
Inicio  /  Applied Sciences  /  Vol: 12 Par: 20 (2022)  /  Artículo
ARTÍCULO
TITULO

The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units

Ju-Sang Lee    
Joon-Choul Shin and Choel-Young Ock    

Resumen

Natural language models brought rapid developments to Natural Language Processing (NLP) performance following the emergence of large-scale deep learning models. Language models have previously used token units to represent natural language while reducing the proportion of unknown tokens. However, tokenization in language models raises language-specific issues. One of the key issues is that separating words by morphemes may cause distortion to the original meaning; also, it can prove challenging to apply the information surrounding a word, such as its semantic network. We propose a multi-hot representation language model to maintain Korean morpheme units. This method represents a single morpheme as a group of syllable-based tokens for cases where no matching tokens exist. This model has demonstrated similar performance to existing models in various natural language processing applications. The proposed model retains the minimum unit of meaning by maintaining the morpheme units and can easily accommodate the extension of semantic information.

 Artículos similares

       
 
Andrei Paraschiv, Teodora Andreea Ion and Mihai Dascalu    
The advent of online platforms and services has revolutionized communication, enabling users to share opinions and ideas seamlessly. However, this convenience has also brought about a surge in offensive and harmful language across various communication m... ver más
Revista: Information

 
Jiaming Li, Ning Xie and Tingting Zhao    
In recent years, with the rapid advancements in Natural Language Processing (NLP) technologies, large models have become widespread. Traditional reinforcement learning algorithms have also started experimenting with language models to optimize training. ... ver más
Revista: Algorithms

 
Fenfang Li, Zhengzhang Zhao, Li Wang and Han Deng    
Sentence Boundary Disambiguation (SBD) is crucial for building datasets for tasks such as machine translation, syntactic analysis, and semantic analysis. Currently, most automatic sentence segmentation in Tibetan adopts the methods of rule-based and stat... ver más
Revista: Applied Sciences

 
Zhe Yang, Yi Huang, Yaqin Chen, Xiaoting Wu, Junlan Feng and Chao Deng    
Controllable Text Generation (CTG) aims to modify the output of a Language Model (LM) to meet specific constraints. For example, in a customer service conversation, responses from the agent should ideally be soothing and address the user?s dissatisfactio... ver más
Revista: Applied Sciences

 
Rongsheng Li, Jin Xu, Zhixiong Cao, Hai-Tao Zheng and Hong-Gee Kim    
In the realm of large language models (LLMs), extending the context window for long text processing is crucial for enhancing performance. This paper introduces SBA-RoPE (Segmented Base Adjustment for Rotary Position Embeddings), a novel approach designed... ver más
Revista: Applied Sciences