Two or more sentences that convey similar meanings using different language expressions are called paraphrase, which are essential parts of learning for machines to better understand human language. Since the recognition of various paraphrase expressi...
Two or more sentences that convey similar meanings using different language expressions are called paraphrase, which are essential parts of learning for machines to better understand human language. Since the recognition of various paraphrase expressions is directly related to the performance of the natural language processing(NLP) application system, its importance is increasing. To improve the performance of the application system, a good quality corpus to train the model is required. However, the currently released Korean paraphrase corpus is very insufficient, and in the case of the open-source paraphrase corpus, it is difficult to keep updated information on new paraphrase expressions. Also, there is a limitation in that a refinement process must be continuously performed until the final paraphrase sentence pair is found.
Therefore, this paper proposed a new methodology called a keyphrase dataset for paraphrase extraction that can easily add various paraphrase expressions and minimize the refinement process. The keyphrase dataset combines the concept of extracting a paraphrase based on a named entity and that sentences in a paraphrase relationship will share the same or similar keyphrase. The keyphrase dataset is expressed in a hierarchical structure consisting of the first classification named entity, the second classification named entity, and the third classification keyphrase. In this paper, after selecting the article text as the named entity for the article text, the first classification named entity and the second class named entity were selected in consideration of the semantic relationship, and TextRank, LDA, and Kr-WordRank were used to construct the third class keyphrase. Thus, a keyphrase was constructed. The paraphrase was extracted by combining the first, second, and third classifications in the keyphrase dataset, and the extracted sentence pairs were collected to construct a paraphrase corpus. To secure the validity of the keyphrase dataset methodology proposed in this paper, a paraphrase evaluation process was performed to calculate the similarity between sentences using the Doc2Vec model. As a result, it was confirmed that the paraphrase extraction method based on the keyphrase dataset was effective in finding sentence pairs with high semantic similarity.