REVISTA
Future Internet

TODAS

Redirigiendo al acceso original de articulo en 21 segundos...

Inicio / Future Internet / Vol: 15 Par: 11 (2023) / Artículo

ARTÍCULO

TITULO

Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification

Panagiotis Skondras

Panagiotis Zervas and Giannis Tzimas

Resumen

In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative.

Palabras claves

metadata extraction - resumes - CV - big data - multiclass classification - ChatGPT - large language models - deep learning - embeddings - labor market analysis

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 15 Parte: 11 (2023)

MATERIAS

INFRAESTRUCTURA

REVISTAS SIMILARES

Big Data and Cognitive Computing
Infrastructures
Water

DOI

https://doi.org/10.3390/fi15110363

Artículos similares

DP-CSM: Efficient Differentially Private Synthesis for Human Mobility Trajectory with Coresets and Staircase Mechanism

Acceso

Xin Yao, Juan Yu, Jianmin Han, Jianfeng Lu, Hao Peng, Yijia Wu and Xiaoqian Cao

Generating differentially private synthetic human mobility trajectories from real trajectories is a commonly used approach for privacy-preserving trajectory publishing. However, existing synthetic trajectory generation methods suffer from the drawbacks o... ver más

Revista: ISPRS International Journal of Geo-Information

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Acceso

Claudia Alessandra Libbi, Jan Trienes, Dolf Trieschnigg and Christin Seifert

A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be cos... ver más

Revista: Future Internet

A Data-Driven Framework for Walkability Measurement with Open Data: A Case Study of Triple Cities, New York

Acceso

Chengbin Deng, Xiaoyu Dong, Huihai Wang, Weiying Lin, Hao Wen, John Frazier, Hung Chak Ho and Louisa Holmes

Walking is the most common, environment-friendly, and inexpensive type of physical activity. To perform in-depth walkability analysis, one option is to objectively evaluate different aspects of built environment related to walkability. In this study, we ... ver más

Revista: ISPRS International Journal of Geo-Information

Fusion of Multi-Sensor-Derived Heights and OSM-Derived Building Footprints for Urban 3D Reconstruction

Acceso

Hossein Bagheri, Michael Schmitt and Xiaoxiang Zhu

So-called prismatic 3D building models, following the level-of-detail (LOD) 1 of the OGC City Geography Markup Language (CityGML) standard, are usually generated automatically by combining building footprints with height values. Typically, high-resolutio... ver más

Revista: ISPRS International Journal of Geo-Information

Traffic Sign Recognition based on Synthesised Training Data

Acceso

Alexandros Stergiou, Grigorios Kalliatakis and Christos Chrysoulas

To deal with the richness in visual appearance variation found in real-world data, we propose to synthesise training data capturing these differences for traffic sign recognition. The use of synthetic training data, created from road traffic sign templat... ver más

Revista: Big Data and Cognitive Computing

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas