Web Scraping Scientific Repositories for Augmented Relevant Literature Search Using CRISP-DM

Hossam El-Din Hassanien

Resumen

Scientific web repositories are central cyber locations where academic papers are stored and maintained. With the nature of the unstructured and semi-structured information/metadata within these repositories, literature analysis for scholar writing becomes a challenge. Correspondingly, applying CRISP-DM poses a stance to address this challenge through formulating a rather augmented process for a relevant literature search. However, almost all repositories do not have a straight forward method where metadata could be extracted for preliminary data processing being applied as part of the CRISP-DM process. Additionally, most repositories do not follow open access standards. Until the time this paper was published, the topic of the augmented, relevant literature search had seen a methodological progress only, with the inability to apply the underlying methods on a larger scale, given data access constraints to open access repositories. The aim of this paper is to propose CRISP-DM as an augmented research methodology with a focus on web scraping as part of the data processing step. To substantiate the proposed methodology, a play role case study is conducted. This then works on alleviating these restrictions, as well as encouraging the wider adoption of the augmented analysis process for a relevant literature search within the research community.

Palabras claves

web scraping - web crawling - CRISP-DM - text mining - relevant literature search - research methodology

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 2 Parte: 4 (2019)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Applied System Innovation
International Journal of Open Information Technologies
Algorithms

DOI