Author : M Mintu 1
Date of Publication :7th October 2015
Abstract: Text mining is the process of deriving high quality information from text. In such applications side information is embedded with text documents. It contains a vast majority of information that enhanceclustering approach. The use of side information may become inefficient when some of the data are erroneous. In such cases, it can be risky to use the side-information into the mining process, because it may either destroy the quality of the collection of data for the mining process, or may add noise to the process. Therefore, we need some advanced efficient way to perform such mining process, so as to increase the capabilities and advantages from using this side information. So the mining process must be carried out in a proper way so as to make use of the side information. Besides to the existing side information like links in the document, user-access behavior from web logs, metadata, this paper proposes a method to mine text data using TF-IDF. It is a numerical static that is intended to reflect how important a word is to a document in a corpus. In this paper we design a distance based clustering algorithm with vector space inorder to create an efficient clustering approach using TF-IDF. We then present how this can add-on to classification problem.
Reference :