Inicio  /  Algorithms  /  Vol: 15 Par: 1 (2022)  /  Artículo
ARTÍCULO
TITULO

Knowledge Distillation-Based Multilingual Code Retrieval

Wen Li    
Junfei Xu and Qi Chen    

Resumen

Semantic code retrieval is the task of retrieving relevant codes based on natural language queries. Although it is related to other information retrieval tasks, it needs to bridge the gaps between the language used in the code (which is usually syntax-specific and logic-specific) and the natural language which is more suitable for describing ambiguous concepts and ideas. Existing approaches study code retrieval in a natural language for a specific programming language, however it is unwieldy and often requires a large amount of corpus for each language when dealing with multilingual scenarios.Using knowledge distillation of six existing monolingual Teacher Models to train one Student Model?MPLCS (Multi-Programming Language Code Search), this paper proposed a method to support multi-programing language code search tasks. MPLCS has the ability to incorporate multiple languages into one model with low corpus requirements. MPLCS can study the commonality between different programming languages and improve the recall accuracy for small dataset code languages. As for Ruby used in this paper, MPLCS improved its MRR score by 20 to 25%. In addition, MPLCS can compensate the low recall accuracy of monolingual models when perform language retrieval work on other programming languages. And in some cases, MPLCS? recall accuracy can even outperform the recall accuracy of monolingual models when they perform language retrieval work on themselves.

 Artículos similares

       
 
Mikel Penagarikano, Amparo Varona, Germán Bordel and Luis Javier Rodriguez-Fuentes    
In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an... ver más
Revista: Applied Sciences

 
Yong Fang, Fangzheng Zhou, Yijia Xu and Zhonglin Liu    
Code cloning is a common practice in software development, where developers reuse existing code to accelerate programming speed and enhance work efficiency. Existing clone-detection methods mainly focus on code clones within a single programming language... ver más
Revista: Applied Sciences

 
Md. Mostafizer Rahman and Yutaka Watanobe    
In recent years, the rise of advanced artificial intelligence technologies has had a profound impact on many fields, including education and research. One such technology is ChatGPT, a powerful large language model developed by OpenAI. This technology of... ver más
Revista: Applied Sciences

 
Sergiu Zaharia, Traian Rebedea and Stefan Trausan-Matu    
The research presented in the paper aims at increasing the capacity to identify security weaknesses in programming languages that are less supported by specialized security analysis tools, based on the knowledge gathered from securing the popular ones, f... ver más
Revista: Applied Sciences

 
Aleksandr Romanov, Anna Kurtukova, Anastasia Fedotova and Alexander Shelupanov    
This article is part of a series aimed at determining the authorship of source codes. Analyzing binary code is a crucial aspect of cybersecurity, software development, and computer forensics, particularly in identifying malware authors. Any program is ma... ver más
Revista: Information