Accuracy analysis of machine learning models using vectorization methods for heterogeneous text data classification tasks

A.N. Alpatov

K.S. Popov

A.N. Chesalin

Resumen

This paper investigates the problem of natural language processing using machine learning techniques, in particular, classification of unstructured heterogeneous text data sets. The paper presents a comparative analysis of some relevant and widely used methods and teacher-assisted machine learning models used for multi-class classification on heterogeneous textual data sources using different feature extraction methods. The dependence of the accuracy of class prediction by classifier models on the quality of the text data corpora used in this paper, applying different vectorization methods on the processed set of source data, is considered. Based on this analysis, a generalized scheme of the software functioning, which implements the algorithm for constructing a model of classification of unstructured texts, in the form of a pipeline for processing text corpus and control of machine learning models is proposed. During the experiment, it was demonstrated that for corpora with different quality of initial text data, the accuracy of classifier predictions differed. This circumstance manifested itself in the fact that the classifiers have lower performance on the corpus of texts of musical compositions and high on the texts of news summaries. It is shown that under certain conditions, the use of solutions to improve the quality of classification, such as stacking and adding additional features of classification, can lead not to improvement, but on the contrary to the deterioration of the results of class prediction, which, ultimately, can have a negative impact on the final accuracy of the obtained model results.

Acceso

PÁGINAS

pp. 47 - 53

NÚMERO

Volumen: 10 Número: 7 Parte: 0 (2022)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Water
Journal of Marine Science and Engineering
Information

Artículos similares

A Holistic Approach to Ransomware Classification: Leveraging Static and Dynamic Analysis with Visualization

Acceso

Bahaa Yamany, Mahmoud Said Elsayed, Anca D. Jurcut, Nashwa Abdelbaki and Marianne A. Azer

Ransomware is a type of malicious software that encrypts a victim?s files and demands payment in exchange for the decryption key. It is a rapidly growing and evolving threat that has caused significant damage and disruption to individuals and organizatio... ver más

Revista: Information

Enhanced Intrusion Detection Systems Performance with UNSW-NB15 Data Analysis

Acceso

Shweta More, Moad Idrissi, Haitham Mahmoud and A. Taufiq Asyhari

The rapid proliferation of new technologies such as Internet of Things (IoT), cloud computing, virtualization, and smart devices has led to a massive annual production of over 400 zettabytes of network traffic data. As a result, it is crucial for compani... ver más

Revista: Algorithms

Ultrasound-Based Deep Learning Models Performance versus Expert Subjective Assessment for Discriminating Adnexal Masses: A Head-to-Head Systematic Review and Meta-Analysis

Acceso

Mariana Lourenço, Teresa Arrufat, Elena Satorres, Sara Maderuelo, Blanca Novillo-Del Álamo, Stefano Guerriero, Rodrigo Orozco and Juan Luis Alcázar

(1) Background: Accurate preoperative diagnosis of ovarian masses is crucial for optimal treatment and postoperative outcomes. Transvaginal ultrasound is the gold standard, but its accuracy depends on operator skill and technology. In the absence of expe... ver más

Revista: Applied Sciences

A Calibration Facility for Hot-Wire Anemometers in Extremely Low Speed with Air Temperature and Humidity Variable and Controllable

Acceso

Tingbo Zhou, Zhengke Zhang, Yongqiang Tian, Zhongxiang Xi, Xiaomu Dou, Weidong Liu, Guobiao Zhang and Chao Gao

Aimed at addressing the difficult problems existing in extremely low speed calibration facilities for hot-wire anemometers, where calibration accuracy is often insufficient and vulnerable to the contamination from temperature and humidity discrepancies b... ver más

Revista: Applied Sciences

Temporal Development GAN (TD-GAN): Crafting More Accurate Image Sequences of Biological Development

Acceso

Pedro Celard, Adrián Seara Vieira, José Manuel Sorribes-Fdez, Eva Lorenzo Iglesias and Lourdes Borrajo

In this study, we propose a novel Temporal Development Generative Adversarial Network (TD-GAN) for the generation and analysis of videos, with a particular focus on biological and medical applications. Inspired by Progressive Growing GAN (PG-GAN) and Tem... ver más

Revista: Information

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas