Inicio  /  Informatics  /  Vol: 10 Par: 4 (2023)  /  Artículo
ARTÍCULO
TITULO

Analyzing Indo-European Language Similarities Using Document Vectors

Samuel R. Schrader and Eren Gultepe    

Resumen

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.

 Artículos similares

       
 
Felipe Coelho de Abreu Pinna, Victor Takashi Hayashi, João Carlos Néto, Rosangela de Fátima Pereira Marquesone, Maísa Cristina Duarte, Rodrigo Suzuki Okada and Wilson Vicente Ruggiero    
Complex and long interactions (e.g., a change of topic during a conversation) justify the use of dialog systems to develop task-oriented chatbots and intelligent virtual assistants. The development of dialog systems requires considerable effort and takes... ver más
Revista: Applied Sciences

 
Leon Kopitar, Iztok Fister, Jr. and Gregor Stiglic    
Introduction: Type 2 diabetes mellitus is a major global health concern, but interpreting machine learning models for diagnosis remains challenging. This study investigates combining association rule mining with advanced natural language processing to im... ver más
Revista: Information

 
Diana Bratic, Marko ?apina, Denis Jurecic and Jana ?iljak Gr?ic    
This paper addresses the challenges associated with the centralized storage of educational materials in the context of a fragmented and disparate database. In response to the increasing demands of modern education, efficient and accessible retrieval of m... ver más

 
Piotr Sliz    
Purpose: The advancements in deep learning and AI technologies have led to the development of such language models, in 2022, as OpenAI?s ChatGPT. The primary objective of this paper is to thoroughly examine the capabilities of ChatGPT within the realm of... ver más

 
Giuliana Favara, Martina Barchitta, Andrea Maugeri, Roberta Magnano San Lio and Antonella Agodi    
Background: Natural language processing, such as ChatGPT, demonstrates growing potential across numerous research scenarios, also raising interest in its applications in public health and epidemiology. Here, we applied a bibliometric analysis for a syste... ver más
Revista: Informatics