We describe a new feature selection method for text categorization system using Topic Signature and co-occurrence words. Co-occurrence word is a pair of words to occur within a window in same documents. We use co-occurred words to classify documents i...
We describe a new feature selection method for text categorization system using Topic Signature and co-occurrence words. Co-occurrence word is a pair of words to occur within a window in same documents. We use co-occurred words to classify documents instead of a single word, because we hypothesize that co-occurred words have high ability to classify documents for unique meaning. We use Topic Signature as a feature selection method based log-likelihood ratio. Topic Signature was applied for finding topic words in text summarization. In order to archive a high performance, we use TF-Topic Signature and weight of features to occur in within titles. And we use Naive Bayesian classifier for text classification.
We use Reuters-21578 data collection, a standard data collection for evaluating English text categorization system, for evaluating proposed system. We can compare objectively between the proposed system and the previous systems from the data collection. For the result of experiments, we can see that the proposed system give a good performance, when compare the previous systems.
The proposed system has some weak point that make many features by using co-occurrence word feature generation. Focus of our future works is to solve the weak points. But we give a good possibility with proposed method, so we expect that our research result is contributed to feature research.