Inicio  /  Future Internet  /  Vol: 14 Par: 10 (2022)  /  Artículo
ARTÍCULO
TITULO

Towards Reliable Baselines for Document-Level Sentiment Analysis in the Czech and Slovak Languages

Ján Moj?i?    
Peter Krammer    
Marcel Kvassay    
Lenka Skovajsová and Ladislav Hluchý    

Resumen

This article helps establish reliable baselines for document-level sentiment analysis in highly inflected languages like Czech and Slovak. We revisit an earlier study representing the first comprehensive formulation of such baselines in Czech and show that some of its reported results need to be significantly revised. More specifically, we show that its online product review dataset contained more than 18% of non-trivial duplicates, which incorrectly inflated its macro F1-measure results by more than 19 percentage points. We also establish that part-of-speech-related features have no damaging effect on machine learning algorithms (contrary to the claim made in the study) and rehabilitate the Chi-squared metric for feature selection as being on par with the best performing metrics such as Information Gain. We demonstrate that in feature selection experiments with Information Gain and Chi-squared metrics, the top 10% of ranked unigram and bigram features suffice for the best results regarding online product and movie reviews, while the top 5% of ranked unigram and bigram features are optimal for the Facebook dataset. Finally, we reiterate an important but often ignored warning by George Forman and Martin Scholz that different possible ways of averaging the F1-measure in cross-validation studies of highly unbalanced datasets can lead to results differing by more than 10 percentage points. This can invalidate the comparisons of F1-measure results across different studies if incompatible ways of averaging F1 are used.

 Artículos similares

       
 
Marco Carnevale Miino, Tatána Hale?ová, Tomá? Macsek, Jakub Racek and Petr Hlavínek    
Per- and polyfluoroalkyl substances (PFASs) have been under intense investigation by the scientific community due to their persistence in the environment and potentially hazardous effects on living organisms. In order to tackle the presence of these comp... ver más

 
Maciej Liro, Anna Zielonka, Hanna Hajdukiewicz, Pawel Mikus, Wojciech Haska, Mateusz Kieniewicz, Elzbieta Gorczyca and Kazimierz Krzemien    
Macroplastic pollution in mountain rivers can threaten water resources, biodiversity, and the recreational values provided by them. The first step towards evaluating and then mitigating these risks is the systematic collection of reliable and spatially u... ver más
Revista: Water

 
Andreas G. Papidas and George C. Polyzos    
We describe self-organizing network (SON) concepts and architectures and their potential to play a central role in 5G deployment and next-generation networks. Our focus is on the basic SON use case applied to radio access networks (RAN), which is self-op... ver más
Revista: Future Internet

 
Dennis Przytarski, Christoph Stach, Clémentine Gritti and Bernhard Mitschang    
When, in 2008, Satoshi Nakamoto envisioned the first distributed database management system that relied on cryptographically secured chain of blocks to store data in an immutable and tamper-resistant manner, his primary use case was the introduction of a... ver más
Revista: Future Internet

 
Yahya Alshawabkeh, Ahmad Baik and Yehia Miky    
Digital 3D capture and reliable reproduction of architectural features is the first and most difficult step towards defining a heritage BIM. Three-dimensional digital survey technologies, such as TLS and photogrammetry, enable experts to scan buildings w... ver más