REVISTA
Informatics

TODAS

Redirigiendo al acceso original de articulo en 20 segundos...

Inicio / Informatics / Vol: 10 Par: 4 (2023) / Artículo

ARTÍCULO

TITULO

Analyzing Indo-European Language Similarities Using Document Vectors

Samuel R. Schrader and Eren Gultepe

Resumen

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.

Palabras claves

language families - language phylogeny - multilingual - document vectors - machine learning - deep learning - partitioning - clustering - community detection

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 10 Parte: 4 (2023)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Applied Sciences
Information
Applied System Innovation

DOI

https://doi.org/10.3390/informatics10040076

Artículos similares

Centralized Database Access: Transformer Framework and LLM/Chatbot Integration-Based Hybrid Model

Acceso

Diana Bratic, Marko ?apina, Denis Jurecic and Jana ?iljak Gr?ic

This paper addresses the challenges associated with the centralized storage of educational materials in the context of a fragmented and disparate database. In response to the increasing demands of modern education, efficient and accessible retrieval of m... ver más

Revista: Applied System Innovation

The Role of ChatGPT in Elevating Customer Experience and Efficiency in Automotive After-Sales Business Processes

Acceso

Piotr Sliz

Purpose: The advancements in deep learning and AI technologies have led to the development of such language models, in 2022, as OpenAI?s ChatGPT. The primary objective of this paper is to thoroughly examine the capabilities of ChatGPT within the realm of... ver más

Revista: Applied System Innovation

Aiding ICD-10 Encoding of Clinical Health Records Using Improved Text Cosine Similarity and PLM-ICD

Acceso

Hugo Silva, Vítor Duque, Mário Macedo and Mateus Mendes

The International Classification of Diseases, 10th edition (ICD-10), has been widely used for the classification of patient diagnostic information. This classification is usually performed by dedicated physicians with specific coding training, and it is ... ver más

Revista: Algorithms

Using Generative AI to Improve the Performance and Interpretability of Rule-Based Diagnosis of Type 2 Diabetes Mellitus

Acceso

Leon Kopitar, Iztok Fister, Jr. and Gregor Stiglic

Introduction: Type 2 diabetes mellitus is a major global health concern, but interpreting machine learning models for diagnosis remains challenging. This study investigates combining association rule mining with advanced natural language processing to im... ver más

Revista: Information

A Survey of AI Techniques in IoT Applications with Use Case Investigations in the Smart Environmental Monitoring and Analytics in Real-Time IoT Platform

Acceso

Yohanes Yohanie Fridelin Panduman, Nobuo Funabiki, Evianita Dewi Fajrianti, Shihao Fang and Sritrusta Sukaridhoto

In this paper, we have developed the SEMAR (Smart Environmental Monitoring and Analytics in Real-Time) IoT application server platform for fast deployments of IoT application systems. It provides various integration capabilities for the collection, displ... ver más

Revista: Information

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas