REVISTA
Algorithms

TODAS

Redirigiendo al acceso original de articulo en 22 segundos...

Inicio / Algorithms / Vol: 16 Par: 12 (2023) / Artículo

ARTÍCULO

TITULO

On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2?Applicability Domain and Outliers

Cindy Trinh

Silvia Lasala

Olivier Herbinet and Dimitrios Meimaroglou

Resumen

This article investigates the applicability domain (AD) of machine learning (ML) models trained on high-dimensional data, for the prediction of the ideal gas enthalpy of formation and entropy of molecules via descriptors. The AD is crucial as it describes the space of chemical characteristics in which the model can make predictions with a given reliability. This work studies the AD definition of a ML model throughout its development procedure: during data preprocessing, model construction and model deployment. Three AD definition methods, commonly used for outlier detection in high-dimensional problems, are compared: isolation forest (iForest), random forest prediction confidence (RF confidence) and k-nearest neighbors in the 2D projection of descriptor space obtained via t-distributed stochastic neighbor embedding (tSNE2D/kNN). These methods compute an anomaly score that can be used instead of the distance metrics of classical low-dimension AD definition methods, the latter being generally unsuitable for high-dimensional problems. Typically, in low- (high-) dimensional problems, a molecule is considered to lie within the AD if its distance from the training domain (anomaly score) is below a given threshold. During data preprocessing, the three AD definition methods are used to identify outlier molecules and the effect of their removal is investigated. A more significant improvement of model performance is observed when outliers identified with RF confidence are removed (e.g., for a removal of 30% of outliers, the MAE" role="presentation" style="position: relative;">??????MAE M A E (Mean Absolute Error) of the test dataset is divided by 2.5, 1.6 and 1.1 for RF confidence, iForest and tSNE2D/kNN, respectively). While these three methods identify X-outliers, the effect of other types of outliers, namely Model-outliers and y-outliers, is also investigated. In particular, the elimination of X-outliers followed by that of Model-outliers enables us to divide MAE" role="presentation" style="position: relative;">??????MAE M A E and RMSE" role="presentation" style="position: relative;">????????RMSE R M S E (Root Mean Square Error) by 2 and 3, respectively, while reducing overfitting. The elimination of y-outliers does not display a significant effect on the model performance. During model construction and deployment, the AD serves to verify the position of the test data and of different categories of molecules with respect to the training data and associate this position with their prediction accuracy. For the data that are found to be close to the training data, according to RF confidence, and display high prediction errors, tSNE 2D representations are deployed to identify the possible sources of these errors (e.g., representation of the chemical information in the training data).

Palabras claves

machine learning - QSPR/QSAR - high-dimensional data - descriptors - thermodynamic properties - applicability domain - outlier detection

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 16 Parte: 12 (2023)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Water
Management Theory and Studies for Rural Business and Infrastructure Development
IRA-International Journal of Management & Social Sciences

DOI

https://doi.org/10.3390/a16120573

Artículos similares

ADMINISTRATIVE DATA AND MODEL BASED ESTIMATION IN ITALIAN AGRICULTURE STATISTICS

Acceso

Roberto Gismondi,Massimo Alfonso Russo Pág. 421 - 431

Revista: Management Theory and Studies for Rural Business and Infrastructure Development

Harvest prediction model based on public data for large regions

Acceso

Andrius Zuoza,Aurelijus Kazys Zuoza,Audrius Gargasas Pág. 135 - 140

Revista: Management Theory and Studies for Rural Business and Infrastructure Development

System Dynamics Approach to Groundwater Storage Modeling for Basin-Scale Planning

Acceso

Guy Bates, Mario Beruvides and Clifford B. Fedler

A system dynamics approach to groundwater modeling suitable for groundwater management planning is presented for a basin-scale system. System dynamics techniques were used to develop a general model for estimating changes in net annual groundwater storag... ver más

Revista: Water

A Numerical Study of Fluid Flow in a Vertical Slot Fishway with the Smoothed Particle Hydrodynamics Method

Acceso

Gorazd Novak, Angelantonio Tafuni, José M. Domínguez, Matja? Cetina and Du?an ?agar

Fishways have a great ecological importance as they help mitigate the interruptions of fish migration routes. In the present work, the novel DualSPHysics v4.4 solver, based on the smoothed particle hydrodynamics method (SPH), has been applied to perform ... ver más

Revista: Water

Modeling Sugar Beet Responses to Irrigation with AquaCrop for Optimizing Water Allocation

Acceso

Margarita Garcia-Vila, Rodrigo Morillo-Velarde and Elias Fereres

Process-based crop models such as AquaCrop are useful for a variety of applications but must be accurately calibrated and validated. Sugar beet is an important crop that is grown in regions under water scarcity. The discrepancies and uncertainty in past ... ver más

Revista: Water

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas