Building a text corpus for automatic biographical facts extraction from Russian texts

A.V. Glazkova

Resumen

The tasks of computer linguistics and machine learning related to natural language processing (NLP) often require the use of text corpora. Text corpora are specially prepared collection of documents equipped with text markup containing morphological, syntactic, semantic or other information. The data received from the text corpora is used in supervised machine learning for building classifiers of texts written in natural language and in other tasks associated with natural language processing and computer linguistics. The specificity of the information presented in the corpus, as well as the type of texts, is determined by the aim and tasks of the particular study. This article presents a tool for building a corpus of biographical texts in Russian. The process of building a text corpus includes two stages: the collection of texts and their markup. At the first stage we collected texts suitable for markup. Thus, we included in the corpus biographical articles placed in Wikipedia in free access. For this purpose, we developed an automatic parser based on open Python libraries. The second stage is the semantic markup of the text sentences and the selection of biographical facts. This stage took place in a semi-automatic mode. The article describes the features of the process of building the corpus of biographical facts, taxonomy of biographical facts using in our work, software implementation for text collecting and markup, text representation in the corpus and the characteristics of the prepared corpus.

Acceso

PÁGINAS

pp. 97 - 103

NÚMERO

Volumen: 7 Número: 1 Parte: 0 (2019)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Applied Sciences
Informatics
Complex Systems Informatics and Modeling Quarterly

Artículos similares

An Empirical Study on the Differences between Online Picture Reviews and Text Reviews

Acceso

Hanyang Luo, Wanhua Zhou, Wugang Song and Xiaofu He

In the context of e-commerce, online travel agencies often derive useful information from online reviews to improve transactions. Based on the dispute on the usefulness of different types of reviews and social exchange theory, this study investigates how... ver más

Revista: Information

CVE2ATT&CK: BERT-Based Mapping of CVEs to MITRE ATT&CK Techniques

Acceso

Octavian Grigorescu, Andreea Nica, Mihai Dascalu and Razvan Rughinis

Since cyber-attacks are ever-increasing in number, intensity, and variety, a strong need for a global, standardized cyber-security knowledge database has emerged as a means to prevent and fight cybercrime. Attempts already exist in this regard. The Commo... ver más

Revista: Algorithms

Attention-Based RU-BiLSTM Sentiment Analysis Model for Roman Urdu

Acceso

Bilal Ahmed Chandio, Ali Shariq Imran, Maheen Bakhtyar, Sher Muhammad Daudpota and Junaid Baber

Deep neural networks have emerged as a leading approach towards handling many natural language processing (NLP) tasks. Deep networks initially conquered the problems of computer vision. However, dealing with sequential data such as text and sound was a n... ver más

Revista: Applied Sciences

Finding Evidence of Fraudster Companies in the CEO?s Letter to Shareholders with Sentiment Analysis

Acceso

Núria Bel, Gabriel Bracons and Sophia Anderberg

The goal of our research was to assess whether the observation about deceptive texts having a lower positive tone than truthful ones in terms of sentiment could become operative and be used for building a classifier in the particular case of fraudster?s ... ver más

Revista: Information

BIM for Existing Construction: A Different Logic Scheme and an Alternative Semantic to Enhance the Interoperabilty

Acceso

Franco Guzzetti, Karen Lara Ngozi Anyabolu, Francesca Biolo and Lara D?Ambrosio

In the construction field, the Building Information Modeling (BIM) methodology is becoming increasingly predominant and the standardization of its use is now an essential operation. This method has become widespread in recent years, thanks to the advanta... ver más

Revista: Applied Sciences

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas