Redirigiendo al acceso original de articulo en 19 segundos...
Inicio  /  Information  /  Vol: 13 Par: 12 (2022)  /  Artículo
ARTÍCULO
TITULO

Incremental Entity Blocking over Heterogeneous Streaming Data

Tiago Brasileiro Araújo    
Kostas Stefanidis    
Carlos Eduardo Santos Pires    
Jyrki Nummenmaa and Thiago Pereira da Nóbrega    

Resumen

Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-n neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average.

 Artículos similares

       
 
Sereysethy Touch and Jean-Noël Colin    
To proactively defend computer systems against cyber-attacks, a honeypot system?purposely designed to be prone to attacks?is commonly used to detect attacks, discover new vulnerabilities, exploits or malware before they actually do real damage to real sy... ver más
Revista: Applied Sciences

 
Cosmin Trif, Dragos Paul Mihai, Anca Zanfirescu and George Mihai Nitulescu    
The fatty acid amide hydrolase (FAAH) is an enzyme responsible for the degradation of anandamide, an endocannabinoid. Pharmacologically blocking this target can lead to anxiolytic effects; therefore, new inhibitors can improve therapy in this field. In o... ver más
Revista: AI

 
Yan Chen, Chunxiang Gao and Wuli Chu    
In order to prolong the service life of multistage axial compressors, it is increasingly important to study the influence of blade surface roughness on the compressor performance. In this paper, a five-stage axial compressor of a real aero-engine was sel... ver más
Revista: Aerospace

 
Xiaoxue Shen, Ruili Li, Jie Du, Xianchenghao Jiang and Guoyu Qiu    
Reliable quantitative information regarding sediment sources is essential for target mitigation, particularly in settings with a large number of loose provenances caused by earth disasters. The lakes in the Jiuzhaigou World Natural Heritage Site (WNHS) a... ver más
Revista: Water

 
Lalit Garg, Sally McClean, Brian Meenan, Maria Barton, Ken Fullerton, Sandra C. Buttigieg and Alexander Micallef    
The problem of hospital patients? delayed discharge or ?bed blocking? has long been a challenge for healthcare managers and policymakers. It negatively affects the hospital performance metrics and has other severe consequences for the healthcare system, ... ver más
Revista: Algorithms