Resumen
The increasing prevalence of marine pollution during the past few decades motivated recent research to help ease the situation. Typical water quality assessment requires continuous monitoring of water and sediments at remote locations with labour-intensive laboratory tests to determine the degree of pollution. We propose an automated water quality assessment framework where we formalise a predictive model using machine learning to infer the water quality and level of pollution using collected water and sediments samples. Firstly, due to the sparsity of sample collection locations, the amount of sediment samples of water is limited, and the dataset is incomplete. Therefore, after an extensive investigation on various data imputation methods? performance in water and sediment datasets with different missing data rates, we chose the best imputation method to process the missing data. Afterwards, the water sediment sample will be tagged as one of four levels of pollution based on some guidelines and then the machine learning model will use a specific technique named classification to find the relationship between the data and the final result. After that, the result of prediction can be compared to the real result so that it can be checked whether the model is good and whether the prediction is accurate. Finally, the research gave improvement advice based on the result obtained from the model building part. Empirically, we show that our best model archives an accuracy of 75% after accounting for 57% of missing data. Experimentally, we show that our model would assist in automatically assessing water quality screening based on possibly incomplete real-world data.