This paper presents an Usenet retrieval system. Two new methods are used to reflect specific characteristics of an Usenet in the system : an improved term weight method and a document re-ranking method reflecting the domain specific knowledge.
The te...
This paper presents an Usenet retrieval system. Two new methods are used to reflect specific characteristics of an Usenet in the system : an improved term weight method and a document re-ranking method reflecting the domain specific knowledge.
The term weight method combines the tf*idf and the icf (inverted category frequency). The icf, which relfects the information of group distributions of the term, is applied to improve the accuracy of document retrieval. The icf can improve the retrieval accuracy as it can reflect the domain specific knowledge while the tf*idf reflects the domain independent knowledge. The approach can be applied to the retrieval of documents that the classified by the categories of the documents.
The document re-ranking method uses the domain specific knowledge such as a newsthread and duplicated documents. By the TE(NewsThread Effect), a semantic information indirectly can be obtained without a semantic analysis. Removing duplicated documents gives a opportunity to retrieve other relevant documents and serves for a spam filter.
Our experiment shows that the adjustment of two weight values, that of the idf and that of the icf, has an influence on retrieval efficiency. The method of combining tf*(idf+icf) and the domain-specific knowledge is 13.5% of more efficient than tf*idf method.