REVISTA
Future Internet

TODAS

Redirigiendo al acceso original de articulo en 24 segundos...

Inicio / Future Internet / Vol: 14 Par: 10 (2022) / Artículo

ARTÍCULO

TITULO

Towards Reliable Baselines for Document-Level Sentiment Analysis in the Czech and Slovak Languages

Ján Moj?i?

Peter Krammer

Marcel Kvassay

Lenka Skovajsová and Ladislav Hluchý

Resumen

This article helps establish reliable baselines for document-level sentiment analysis in highly inflected languages like Czech and Slovak. We revisit an earlier study representing the first comprehensive formulation of such baselines in Czech and show that some of its reported results need to be significantly revised. More specifically, we show that its online product review dataset contained more than 18% of non-trivial duplicates, which incorrectly inflated its macro F1-measure results by more than 19 percentage points. We also establish that part-of-speech-related features have no damaging effect on machine learning algorithms (contrary to the claim made in the study) and rehabilitate the Chi-squared metric for feature selection as being on par with the best performing metrics such as Information Gain. We demonstrate that in feature selection experiments with Information Gain and Chi-squared metrics, the top 10% of ranked unigram and bigram features suffice for the best results regarding online product and movie reviews, while the top 5% of ranked unigram and bigram features are optimal for the Facebook dataset. Finally, we reiterate an important but often ignored warning by George Forman and Martin Scholz that different possible ways of averaging the F1-measure in cross-validation studies of highly unbalanced datasets can lead to results differing by more than 10 percentage points. This can invalidate the comparisons of F1-measure results across different studies if incompatible ways of averaging F1 are used.

Palabras claves

document-level sentiment analysis - natural language processing - machine learning - highly inflected languages - Czech language - Slovak language - baseline correction - duplicate records

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 14 Parte: 10 (2022)

MATERIAS

INFRAESTRUCTURA

REVISTAS SIMILARES

Water
Big Data and Cognitive Computing
Future Internet

DOI

https://doi.org/10.3390/fi14100300

Artículos similares

Evaluation of the Effectiveness of Treatments to Remove Per- and Polyfluoroalkyl Substances from Water?Are We Using the Right Approach? Proposal of a Paradigm Shift from ?Chemical Only? towards an Integrated Bio-Chemical Assessment

Acceso

Marco Carnevale Miino, Tatána Hale?ová, Tomá? Macsek, Jakub Racek and Petr Hlavínek

Per- and polyfluoroalkyl substances (PFASs) have been under intense investigation by the scientific community due to their persistence in the environment and potentially hazardous effects on living organisms. In order to tackle the presence of these comp... ver más

Revista: Clean Technologies

Litter Selfie: A Citizen Science Guide for Photorecording Macroplastic Deposition along Mountain Rivers Using a Smartphone

Acceso

Maciej Liro, Anna Zielonka, Hanna Hajdukiewicz, Pawel Mikus, Wojciech Haska, Mateusz Kieniewicz, Elzbieta Gorczyca and Kazimierz Krzemien

Macroplastic pollution in mountain rivers can threaten water resources, biodiversity, and the recreational values provided by them. The first step towards evaluating and then mitigating these risks is the systematic collection of reliable and spatially u... ver más

Revista: Water

Self-Organizing Networks for 5G and Beyond: A View from the Top

Acceso

Andreas G. Papidas and George C. Polyzos

We describe self-organizing network (SON) concepts and architectures and their potential to play a central role in 5G deployment and next-generation networks. Our focus is on the basic SON use case applied to radio access networks (RAN), which is self-op... ver más

Revista: Future Internet

Query Processing in Blockchain Systems: Current State and Future Challenges

Acceso

Dennis Przytarski, Christoph Stach, Clémentine Gritti and Bernhard Mitschang

When, in 2008, Satoshi Nakamoto envisioned the first distributed database management system that relied on cryptographically secured chain of blocks to store data in an immutable and tamper-resistant manner, his primary use case was the introduction of a... ver más

Revista: Future Internet

Integration of Laser Scanner and Photogrammetry for Heritage BIM Enhancement

Acceso

Yahya Alshawabkeh, Ahmad Baik and Yehia Miky

Digital 3D capture and reliable reproduction of architectural features is the first and most difficult step towards defining a heritage BIM. Three-dimensional digital survey technologies, such as TLS and photogrammetry, enable experts to scan buildings w... ver más

Revista: ISPRS International Journal of Geo-Information

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas