REVISTA
Information

TODAS

Redirigiendo al acceso original de articulo en 15 segundos...

Inicio / Information / Vol: 12 Par: 5 (2021) / Artículo

ARTÍCULO

TITULO

A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution

Guizhe Song

Degen Huang and Zhifeng Xiao

Resumen

Multilingual characteristics, lack of annotated data, and imbalanced sample distribution are the three main challenges for toxic comment analysis in a multilingual setting. This paper proposes a multilingual toxic text classifier which adopts a novel fusion strategy that combines different loss functions and multiple pre-training models. Specifically, the proposed learning pipeline starts with a series of pre-processing steps, including translation, word segmentation, purification, text digitization, and vectorization, to convert word tokens to a vectorized form suitable for the downstream tasks. Two models, multilingual bidirectional encoder representation from transformers (MBERT) and XLM-RoBERTa (XLM-R), are employed for pre-training through Masking Language Modeling (MLM) and Translation Language Modeling (TLM), which incorporate semantic and contextual information into the models. We train six base models and fuse them to obtain three fusion models using the F1 scores as the weights. The models are evaluated on the Jigsaw Multilingual Toxic Comment dataset. Experimental results show that the best fusion model outperforms the two state-of-the-art models, MBERT and XLM-R, in F1 score by 5.05% and 0.76%, respectively, verifying the effectiveness and robustness of the proposed fusion strategy.

Palabras claves

toxic comment - imbalanced positive and negative samples - pre-training models - multilingual classification - XLM-RoBERTa - MBERT

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 12 Parte: 5 (2021)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Applied Sciences
Information
Algorithms

DOI

https://doi.org/10.3390/info12050205

Artículos similares

On Isotropy of Multimodal Embeddings

Acceso

Kirill Tyshchuk, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev and Alexander Panchenko

Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based ... ver más

Revista: Information

Knowledge Distillation-Based Multilingual Code Retrieval

Acceso

Wen Li, Junfei Xu and Qi Chen

Semantic code retrieval is the task of retrieving relevant codes based on natural language queries. Although it is related to other information retrieval tasks, it needs to bridge the gaps between the language used in the code (which is usually syntax-sp... ver más

Revista: Algorithms

Extrapolation of Human Estimates of the Concreteness/ Abstractness of Words by Neural Networks of Various Architectures

Acceso

Valery Solovyev and Vladimir Ivanov

In a great deal of theoretical and applied cognitive and neurophysiological research, it is essential to have more vocabularies with concreteness/abstractness ratings. Since creating such dictionaries by interviewing informants is labor-intensive, consid... ver más

Revista: Applied Sciences

Fake News Spreaders Detection: Sometimes Attention Is Not All You Need

Acceso

Marco Siino, Elisa Di Nuovo, Ilenia Tinnirello and Marco La Cascia

Guided by a corpus linguistics approach, in this article we present a comparative evaluation of State-of-the-Art (SotA) models, with a special focus on Transformers, to address the task of Fake News Spreaders (i.e., users that share Fake News) detection.... ver más

Revista: Information

Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

Acceso

Abdullah M. Alshanqiti, Sami Albouq, Ahmad B. Alkhodre, Abdallah Namoun and Emad Nabil

Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitall... ver más

Revista: Applied Sciences

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas