Analyzing Indo-European Language Similarities Using Document Vectors

Samuel R. Schrader and Eren Gultepe

Resumen

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.

Palabras claves

language families - language phylogeny - multilingual - document vectors - machine learning - deep learning - partitioning - clustering - community detection

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 10 Parte: 4 (2023)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Applied Sciences
Information
Algorithms

DOI

https://doi.org/10.3390/informatics10040076

Artículos similares

A Modular Framework for Domain-Specific Conversational Systems Powered by Never-Ending Learning

Acceso

Felipe Coelho de Abreu Pinna, Victor Takashi Hayashi, João Carlos Néto, Rosangela de Fátima Pereira Marquesone, Maísa Cristina Duarte, Rodrigo Suzuki Okada and Wilson Vicente Ruggiero

Complex and long interactions (e.g., a change of topic during a conversation) justify the use of dialog systems to develop task-oriented chatbots and intelligent virtual assistants. The development of dialog systems requires considerable effort and takes... ver más

Revista: Applied Sciences

Using Generative AI to Improve the Performance and Interpretability of Rule-Based Diagnosis of Type 2 Diabetes Mellitus

Acceso

Leon Kopitar, Iztok Fister, Jr. and Gregor Stiglic

Introduction: Type 2 diabetes mellitus is a major global health concern, but interpreting machine learning models for diagnosis remains challenging. This study investigates combining association rule mining with advanced natural language processing to im... ver más

Revista: Information

Centralized Database Access: Transformer Framework and LLM/Chatbot Integration-Based Hybrid Model

Acceso

Diana Bratic, Marko ?apina, Denis Jurecic and Jana ?iljak Gr?ic

This paper addresses the challenges associated with the centralized storage of educational materials in the context of a fragmented and disparate database. In response to the increasing demands of modern education, efficient and accessible retrieval of m... ver más

Revista: Applied System Innovation

The Role of ChatGPT in Elevating Customer Experience and Efficiency in Automotive After-Sales Business Processes

Acceso

Piotr Sliz

Purpose: The advancements in deep learning and AI technologies have led to the development of such language models, in 2022, as OpenAI?s ChatGPT. The primary objective of this paper is to thoroughly examine the capabilities of ChatGPT within the realm of... ver más

Revista: Applied System Innovation

The Research Interest in ChatGPT and Other Natural Language Processing Tools from a Public Health Perspective: A Bibliometric Analysis

Acceso

Giuliana Favara, Martina Barchitta, Andrea Maugeri, Roberta Magnano San Lio and Antonella Agodi

Background: Natural language processing, such as ChatGPT, demonstrates growing potential across numerous research scenarios, also raising interest in its applications in public health and epidemiology. Here, we applied a bibliometric analysis for a syste... ver más

Revista: Informatics

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas