S.S. Dijkstra
2 records found
1
Restoration of Missing Data using a Human Adaptive Framework
The Cleansing Algorithm
Improving data quality is of the utmost importance for any data-driven company, as data quality is unmistakably tied to business analytics and processes. One method to improve upon data quality is to restore missing and wrong data entries.
Improving data quality is of the utmost importance for any data-driven company, as data quality is unmistakably tied to business analytics and processes. One method to improve upon data quality is to restore missing and wrong data entries.
The goal of this research is construct an algorithm such that it is possible to restore missing and wrong data entries, while making use of a human adaptive framework. This algorithm has been constructed in a modular fashion and consists of three main modules: Data Transformation, Data Structure Analysis and Model Selection. Data Transformation has concerned itself with conversion of raw data to data types and forms the other modules can use.
Data Structure Analysis has been designed to deal with correctly missing data and dichotomy in the target feature by making use of three clustering algorithms: DBSCAN, K-Means and Diffusion Maps. DBSCAN is used to determine the necessity of clustering as well as the initialisation of the K-Means algorithm. K-Means and Diffusion Maps have been used as clustering methods in the one-dimensional target feature and the two-dimensional input-target feature pairs, respectively. Data Structure Analysis has further been designed to perform feature selection through three filter methods: CorrCoef, FCBF and Treelet.
Model Selection has proposed a novel approach to selection of the best model of a candidate set through the optimisation of a conditional model ranking strategy based on the prior construction of theoretical testing. Our candidate set consisted of Expectation Maximisation, K-Means, Multi-Layer Perceptron, Nearest Neighbor, Random Forest, Linear Regression, Polynomial Regression, ElasticNet Regression.
In terms of restorability, it was shown that the optimal configuration of the Cleansing Algorithm for the restoration of missing data, was provided by opting not to use clustering, using a custom alteration to the Treelet algorithm for feature selection and making use of the model selection strategy. This not only lead to the greatest restorability of 56.90% on Aegon data sets, which was an improvement of 44.83% when compared to not using the Cleansing Algorithm, but also to the reduction of computation time by over 400%. A more realistic restorability due to the presence of correctly missing data, was given by the same configuration making use of one-dimensional output clustering. This resulted in a restorability on Aegon data sets of 43.10%. As such it was deemed possible to restore missing data on Aegon data sets.
With respect to the human adaptive framework, it was determined that the construction of the algorithm be modular in the sense that any alternate feature selection or clustering approach can be implemented with ease. Furthermore, the model selection module allows us to customize the theoretical testing and choice of regression or classification models for the restoration of missing data. In doing so, the algorithm has laid the foundations for human adaptivity of the Cleansing Algorithm.