Inicio  /  Future Internet  /  Vol: 6 Par: 3 (2014)  /  Artículo
ARTÍCULO
TITULO

ARCOMEM Crawling Architecture

Vassilis Plachouras    
Florent Carpentier    
Muhammad Faheem    
Julien Masanès    
Thomas Risse    
Pierre Senellart    
Patrick Siehndel and Yannis Stavrakas    

Resumen

The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM?s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM?s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.

 Artículos similares

       
 
Md. Khairul Hasan, Mohamed Rasmy, Toshio Koike and Katsunori Tamakawa    
The Sangu River basin significantly contributes to national economy significantly; however, exposures to water-related hazards are frequent. As it is expected that water-related disasters will increase manifold in the future due to global warming, the Go... ver más
Revista: Water

 
Yi Liu, Yiting Deng, Zhen Liu and Mohamed Osmani    
At present, increased modes of transport have facilitated daily life. Building information modeling (BIM) integration has become a key strategy to foster efficiency, collaboration, and sustainability in the fields of buildings, transport, and facilities.... ver más
Revista: Buildings

 
Andreas Giannakoulopoulos, Minas Pergantis and Aristeidis Lamprogeorgos    
The present study focuses on using qualitative and quantitative data to evaluate the functionality, user experience (UX), and aesthetic approach offered by an academic multi-site Web ecosystem consisting of multiple interconnected websites. Large entitie... ver más
Revista: Future Internet

 
Yuro Koga and Kayoko Yamamoto    
It is important that both static and dynamic information is efficiently used to create a suitable tourism plan. However, collecting, accumulating and managing dynamic information can cost tourists time, money and energy. In the present study, an original... ver más

 
Alberto Rodrigues da Silva, Jacinto Estima, Jorge Marques, Ivo Gamito, Alexandre Serra, Leonardo Moura, Ana Margarida Ricardo, Luís Mendes and Rui M. L. Ferreira    
Flood events are becoming more severe, causing significant problems to human communities, including physical, psychological, and material damage. For both flood forecasting in emergency response situations and flood mapping, georeferencing and data curat... ver más