Redirigiendo al acceso original de articulo en 18 segundos...
ARTÍCULO
TITULO

Comparative Analysis of the Accuracy of Methods for Visualizing the Structure of a Text Collection

Fedor Krasnov    

Resumen

Visualization of multidimensional data is the most important stage of data research. Often, decisions on the further stages of the study are made from the flat view of the data based on "rough proportions". High visibility and persuasiveness of representation on the plane of multidimensional vectors with the preservation of distances is used in models of distributive semantics (Word2Vec, GloVe, NaVec) successfully. On the other hand, the inaccuracy of the two-dimensional projection can lead to time being spent searching for non-existent multidimensional structures. The author set the task to evaluate the accuracy of dimensionality reduction methods with the following limitations: multi-dimensionality arises as a result of vector representation of text documents, dimensionality reduction is aimed at visualization on the plane. In numerous methods of dimension reduction, there is no separate class of approaches specifically for visualization. To measure the accuracy, an approach was chosen using marked-up data and quantifying the preservation of the markup while reducing the dimension. The author investigated 12 methods of reducing the dimension on two labeled data sets in Russian and English. Using the Silhouette Coefficient metric, the most accurate visualization method for text data was determined as UMAP with the Hellinger distance as the metric.

 Artículos similares

       
 
Kristina Mazur, Mischa Saleh and Mirko Hornung    
Early and rapid environmental assessment of newly developed aircraft concepts is eminent in today?s climate debate. This can shorten the decision-making process and thus accelerate the entry into service of climate-friendly technologies. A holistic appro... ver más
Revista: Aerospace

 
Maryan Rizinski, Andrej Jankov, Vignesh Sankaradas, Eugene Pinsky, Igor Mishkovski and Dimitar Trajanov    
The task of company classification is traditionally performed using established standards, such as the Global Industry Classification Standard (GICS). However, these approaches heavily rely on laborious manual efforts by domain experts, resulting in slow... ver más
Revista: Information

 
George Westergaard, Utku Erden, Omar Abdallah Mateo, Sullaiman Musah Lampo, Tahir Cetin Akinci and Oguzhan Topsakal    
Automated Machine Learning (AutoML) tools are revolutionizing the field of machine learning by significantly reducing the need for deep computer science expertise. Designed to make ML more accessible, they enable users to build high-performing models wit... ver más
Revista: Information

 
Hamed Taherdoost and Mitra Madanchian    
Blockchain technology has become a powerful disruptive force that upends established ideas in several industries. A fascinating point of convergence is that of blockchain technology and Business Process Management (BPM), where the distributed and immutab... ver más
Revista: Information

 
Carolina Bona-Sánchez, Heidi Salokangas and Kaisa Sorsa    
This study explores the complexities of cost behavior in the textile industry, conducting a comparative analysis between firms in the Nordic countries and Spain. Our main goal is to examine how distinct economic and corporate governance models impact the... ver más
Revista: Applied Sciences