Redirigiendo al acceso original de articulo en 20 segundos...
Inicio  /  Informatics  /  Vol: 10 Par: 4 (2023)  /  Artículo
ARTÍCULO
TITULO

Analyzing Indo-European Language Similarities Using Document Vectors

Samuel R. Schrader and Eren Gultepe    

Resumen

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.

 Artículos similares

       
 
Diana Bratic, Marko ?apina, Denis Jurecic and Jana ?iljak Gr?ic    
This paper addresses the challenges associated with the centralized storage of educational materials in the context of a fragmented and disparate database. In response to the increasing demands of modern education, efficient and accessible retrieval of m... ver más

 
Piotr Sliz    
Purpose: The advancements in deep learning and AI technologies have led to the development of such language models, in 2022, as OpenAI?s ChatGPT. The primary objective of this paper is to thoroughly examine the capabilities of ChatGPT within the realm of... ver más

 
Hugo Silva, Vítor Duque, Mário Macedo and Mateus Mendes    
The International Classification of Diseases, 10th edition (ICD-10), has been widely used for the classification of patient diagnostic information. This classification is usually performed by dedicated physicians with specific coding training, and it is ... ver más
Revista: Algorithms

 
Leon Kopitar, Iztok Fister, Jr. and Gregor Stiglic    
Introduction: Type 2 diabetes mellitus is a major global health concern, but interpreting machine learning models for diagnosis remains challenging. This study investigates combining association rule mining with advanced natural language processing to im... ver más
Revista: Information

 
Yohanes Yohanie Fridelin Panduman, Nobuo Funabiki, Evianita Dewi Fajrianti, Shihao Fang and Sritrusta Sukaridhoto    
In this paper, we have developed the SEMAR (Smart Environmental Monitoring and Analytics in Real-Time) IoT application server platform for fast deployments of IoT application systems. It provides various integration capabilities for the collection, displ... ver más
Revista: Information