Inicio  /  Applied System Innovation  /  Vol: 4 Par: 1 (2021)  /  Artículo
ARTÍCULO
TITULO

SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features

Mimi Mukherjee and Matloob Khushi    

Resumen

Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE?Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.

 Artículos similares

       
 
Xiaodong Cui, Zhuofan He, Yangtao Xue, Keke Tang, Peican Zhu and Jing Han    
Underwater Acoustic Target Recognition (UATR) plays a crucial role in underwater detection devices. However, due to the difficulty and high cost of collecting data in the underwater environment, UATR still faces the problem of small datasets. Few-shot le... ver más

 
Bahaa Yamany, Mahmoud Said Elsayed, Anca D. Jurcut, Nashwa Abdelbaki and Marianne A. Azer    
Ransomware is a type of malicious software that encrypts a victim?s files and demands payment in exchange for the decryption key. It is a rapidly growing and evolving threat that has caused significant damage and disruption to individuals and organizatio... ver más
Revista: Information

 
Tomasz Walczyna and Zbigniew Piotrowski    
The proliferation of ?Deep fake? technologies, particularly those facilitating face-swapping in images or videos, poses significant challenges and opportunities in digital media manipulation. Despite considerable advancements, existing methodologies ofte... ver más
Revista: Applied Sciences

 
Saima Bhatti, Asif Ali Shaikh, Asif Mansoor and Murtaza Hussain    
Machinery components undergo wear and tear over time due to regular usage, necessitating the establishment of a robust prognosis framework to enhance machinery health and avert catastrophic failures. This study focuses on the collection and analysis of v... ver más
Revista: Applied Sciences

 
Fang Gui, Jiaoyun Yang, Yiming Tang, Hongtu Chen and Ning An    
The life stories of older adults encapsulate an array of personal experiences that reflect their care needs. However, due to inherent fuzzy features, fragmented natures, repetition, and redundancies, the practical application of the life story approach p... ver más
Revista: Applied Sciences