Inicio  /  Infrastructures  /  Vol: 5 Par: 7 (2020)  /  Artículo
ARTÍCULO
TITULO

Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms

Nicholas Fiorentini and Massimo Losa    

Resumen

Crash severity is undoubtedly a fundamental aspect of a crash event. Although machine learning algorithms for predicting crash severity have recently gained interest by the academic community, there is a significant trend towards neglecting the fact that crash datasets are acutely imbalanced. Overlooking this fact generally leads to weak classifiers for predicting the minority class (crashes with higher severity). In this paper, in order to handle imbalanced accident datasets and provide a better prediction for the minority class, the random undersampling the majority class (RUMC) technique is used. By employing an imbalanced and a RUMC-based balanced training set, we propose the calibration, validation, and evaluation of four different crash severity predictive models, including random tree, k-nearest neighbor, logistic regression, and random forest. Accuracy, true positive rate (recall), false positive rate, true negative rate, precision, F1-score, and the confusion matrix have been calculated to assess the performance. Outcomes show that RUMC-based models provide an enhancement in the reliability of the classifiers for detecting fatal crashes and those causing injury. Indeed, in imbalanced models, the true positive rate for predicting fatal crashes and those causing injury spans from 0% (logistic regression) to 18.3% (k-nearest neighbor), while for the RUMC-based models, it spans from 52.5% (RUMC-based logistic regression) to 57.2% (RUMC-based k-nearest neighbor). Organizations and decision-makers could make use of RUMC and machine learning algorithms in predicting the severity of a crash occurrence, managing the present, and planning the future of their works.

 Artículos similares

       
 
Ali Mirzaei, Hossein Bagheri and Iman Khosravi    
Crop classification using remote sensing data has emerged as a prominent research area in recent decades. Studies have demonstrated that fusing synthetic aperture radar (SAR) and optical images can significantly enhance the accuracy of classification. Ho... ver más

 
Wenjin Hu, Yukun Chen, Lifang Wu, Ge Shi and Meng Jian    
Hamming space retrieval is a hot area of research in deep hashing because it is effective for large-scale image retrieval. Existing hashing algorithms have not fully used the absolute boundary to discriminate the data inside and outside the Hamming ball,... ver más
Revista: Applied Sciences

 
Maya Hilda Lestari Louk and Bayu Adhi Tama    
Classifier ensembles have been utilized in the industrial cybersecurity sector for many years. However, their efficacy and reliability for intrusion detection systems remain questionable in current research, owing to the particularly imbalanced data issu... ver más

 
Muhammad Adil, Nadeem Javaid, Umar Qasim, Ibrar Ullah, Muhammad Shafiq and Jin-Ghoo Choi    
The electrical losses in power systems are divided into non-technical losses (NTLs) and technical losses (TLs). NTL is more harmful than TL because it includes electricity theft, faulty meters and billing errors. It is one of the major concerns in the po... ver más
Revista: Applied Sciences