Redirigiendo al acceso original de articulo en 22 segundos...
ARTÍCULO
TITULO

Development of a document classification method by using geodesic distance to calculate similarity of documents

Hung Vo-Trung    

Resumen

Currently, the Internet has given people the opportunity to access to human knowledge quickly and conveniently through various channels such as Web pages, social networks, digital libraries, portals... However, with the process of exchanging and updating information quickly, the volume of information stored (in the form of digital documents) is increasing rapidly. Therefore, we are facing challenges in representing, storing, sorting and classifying documents.In this paper, we present a new approach to text classification. This approach is based on semi-supervised machine learning and Support Vector Machine (SVM). The new point of the study is that instead of calculating the distance between the vectors by Euclidean distance, we use geodesic distance. To do this, the text must first be expressed as an n-dimensional vector. In the n-dimensional vector space, each vector is represented by one point; use geodesic distance to calculate the distance from a point to nearby points and connect into a graph. The classification is based on calculating the shortest path between vertices on the graph through a kernel function. We conducted experiments on articles taken from Reuters on 5 different topics. To evaluate the proposed method, we tested the SVM method with the traditional calculation based on Euclidean distance and the method we proposed based on geodesic distance. The experiment was performed on the same data set of 5 topics: Business, Markets, World, Politics, and Technology. The results showed that the correct classification rate is better than the traditional SVM method based on Euclidean distance (average of 3.2 %)

 Artículos similares

       
 
Marhanum Che Mohd Salleh, Rizal Mohd Nor, Faizal Yusof and Md Amiruzzaman    
The aim of this research is to discuss the groundwork of building an Islamic Banking Document Screening Prototype based on a serverless architecture framework. This research first forms an algorithm for document matching based Vector Space Model (VCM) an... ver más
Revista: Computers

 
Hongfeng Sang, Liyi Ma and Nan Ma    
A three-dimensional MOOC analysis framework was developed, focusing on platform design, organizational mechanisms, and course construction. This framework aims to investigate the current situation of big data MOOCs in the intelligent era, particularly fr... ver más
Revista: Information

 
Tahira Niazi, Teerath Das, Ghufran Ahmed, Syed Muhammad Waqas, Sumra Khan, Suleman Khan, Ahmed Abdelaziz Abdelatif and Shaukat Wasi    
Code comments are considered an efficient way to document the functionality of a particular block of code. Code commenting is a common practice among developers to explain the purpose of the code in order to improve code comprehension and readability. Re... ver más
Revista: Algorithms

 
Ahmed Dourhri, Mohamed Hanine and Hassan Ouahmane    
The growth of structured, semi-structured, and unstructured data produced by the new applications is a result of the development and expansion of social networks, the Internet of Things, web technology, mobile devices, and other technologies. However, as... ver más
Revista: Information

 
Archana Tikayat Ray, Bjorn F. Cole, Olivia J. Pinon Fischer, Ryan T. White and Dimitri N. Mavris    
The system complexity that characterizes current systems warrants an integrated and comprehensive approach to system design and development. This need has brought about a paradigm shift towards Model-Based Systems Engineering (MBSE) approaches to system ... ver más
Revista: Aerospace