Inicio  /  Information  /  Vol: 5 Par: 4 (2014)  /  Artículo
ARTÍCULO
TITULO

Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach

Hong Wang    
Qingsong Xu and Lifeng Zhou    

Resumen

To surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classification with labeled examples. However, labeled data are scarce, hard to get and requires tedious manual work, while unlabeled HTML forms are abundant and easy to obtain. In this research, we consider the plausibility of using both labeled and unlabeled data to train better models to identify search interfaces more effectively. We present a semi-supervised co-training ensemble learning approach using both neural networks and decision trees to deal with the search interface identification problem. We show that the proposed model outperforms previous methods using only labeled data. We also show that adding unlabeled data improves the effectiveness of the proposed model.

 Artículos similares

       
 
Benjamin Warnke, Stefan Fischer and Sven Groppe    
Due to increasing digitization, the amount of data in the Internet of Things (IoT) is constantly increasing. In order to be able to process queries efficiently, strategies must, therefore, be found to reduce the transmitted data as much as possible. SPAR... ver más
Revista: Computers

 
Yasar Ameer Ali, Lateef Najeh Assi, Hussein Abas, Hussein R. Taresh, Canh N. Dang and SeyedAli Ghahari    
Reinforced concrete deep beams are a vital member of infrastructures such as bridges, shear walls, and foundation pile caps. Thousands of dollars and human lives are seriously threatened due to shear failure, which have developed in deep beams containing... ver más
Revista: Infrastructures

 
Antonio Maci, Alessandro Santorsola, Antonio Coscia and Andrea Iannacone    
Web phishing is a form of cybercrime aimed at tricking people into visiting malicious URLs to exfiltrate sensitive data. Since the structure of a malicious URL evolves over time, phishing detection mechanisms that can adapt to such variations are paramou... ver más
Revista: Computers

 
Panagiotis Skondras, Nikos Zotos, Dimitris Lagios, Panagiotis Zervas, Konstantinos C. Giotopoulos and Giannis Tzimas    
This article presents a study on the multi-class classification of job postings using machine learning algorithms. With the growth of online job platforms, there has been an influx of labor market data. Machine learning, particularly NLP, is increasingly... ver más
Revista: Information

 
Adi Wibowo, Joga Dharma Setiawan, Hadha Afrisal, Anak Agung Sagung Manik Mahachandra Jayanti Mertha, Sigit Puji Santosa, Kuncoro Budhi Wisnu, Ambar Mardiyoto, Henri Nurrakhman, Boyi Kartiwa and Wahyu Caesarendra    
Human eyes generally perform product defect inspection in Indonesian industrial production lines; resulting in low efficiency and a high margin of error due to eye tiredness. Automated quality assessment systems for mass production can utilize deep learn... ver más