Abstract

Many data analysis applications such as data mining web mining, information retrieval system, require various forms of data preparation. Mostly all this works on the assumption that the data they work is complete in nature, but that is not true! In data preparation, one takes the data in its raw form, removes as much as noise, redundancy and incompleteness as possible and brings out that core for further processing. Indeed, data preparation often presents a less glamorous but in fact a most critical step than other in data analysis applications. It is a data processing technique such that minor data quality adjustments may lead to wrong interpretation that deteriorates the overall effectiveness of any techniques (viz. Data Mining). The input to any algorithm for interpretation is assumed a nice data distribution, containing no missing, inconsistent or incorrect values. But in real world databases, information is missing, incomplete, imprecise, incorrect. For such cases, we introduce the novel idea of conceptual reconstruction, in which we create effective conceptual representations on which the data mining algorithms can be directly applied. The attraction behind the idea of conceptual reconstruction is to use the correlation structure of the data in order to express it in terms of concepts rather the original dimensions. As a result, the reconstruction procedure estimates only those conceptual aspects of the data which can be mined from the incomplete data set, rather than force errors created by extrapolation. We demonstrate the effectiveness of the approach on a variety of real data sets. This paper describes a theory and implementation of a new filter PCA to the WEKA workbench, for estimating the complete dataset from an incomplete dataset.
