To search, Click below search items.


All Published Papers Search Service


Hybrid Clustering Approach for Concept Generation


K.Thammi Reddy, M.Shashi, L.Pratap Reddy


Vol. 7  No. 4  pp. 62-69


Information retrieval is one of the major research areas due to accumulation of huge information in digital form. Various techniques of Information retrieval are based on the fact that various terms present in a document along with their frequency of occurrence signify the semantics of the document. Recent attempts to find the relevant document for a context represents documents in a Latent Semantic Indexing (LSI) model as document-term vector representing term weights for every index term in that document. As there will be enormous number of index terms this leads to high dimensionality problem. We can reduce the dimensionality based on the observation that groups of terms associated with related concepts occur together or do not occur in a document based on whether the document is relevant or not to that concept. Such a group of terms is identified as a Concept and can be viewed as a single dimension in a Rough set based information retrieval system. In this paper we present a hybrid clustering approach for the formation of equivalence classes of terms associated with related concepts. It uses the outcome of hierarchical clustering to provide seed points for implementing Incremental K-means algorithm. Due to the sparsity of the term vector, the cosine similarity estimate is found to be less effective for term clustering. Another promising measure of proximity estimate generally used in information retrieval is the Euclidian distance that it is biased towards changes in the term frequencies in larger documents when the term weights are represented by Term frequency-inverse document frequency (tf-idf) estimates. In this paper we propose a new term weight estimate namely term probability?inverse document frequency (tp-idf) for representing a term as a vector before clustering the terms


Hierarchical clustering, Partitional clustering, Text mining, Dimensionality reduction, Proximity estimate.