RISS 검색 - 학위논문 상세보기

다국어 초록 (Multilingual Abstract)

Two or more sentences that convey similar meanings using different language expressions are called paraphrase, which are essential parts of learning for machines to better understand human language. Since the recognition of various paraphrase expressions is directly related to the performance of the natural language processing(NLP) application system, its importance is increasing. To improve the performance of the application system, a good quality corpus to train the model is required. However, the currently released Korean paraphrase corpus is very insufficient, and in the case of the open-source paraphrase corpus, it is difficult to keep updated information on new paraphrase expressions. Also, there is a limitation in that a refinement process must be continuously performed until the final paraphrase sentence pair is found.
Therefore, this paper proposed a new methodology called a keyphrase dataset for paraphrase extraction that can easily add various paraphrase expressions and minimize the refinement process. The keyphrase dataset combines the concept of extracting a paraphrase based on a named entity and that sentences in a paraphrase relationship will share the same or similar keyphrase. The keyphrase dataset is expressed in a hierarchical structure consisting of the first classification named entity, the second classification named entity, and the third classification keyphrase. In this paper, after selecting the article text as the named entity for the article text, the first classification named entity and the second class named entity were selected in consideration of the semantic relationship, and TextRank, LDA, and Kr-WordRank were used to construct the third class keyphrase. Thus, a keyphrase was constructed. The paraphrase was extracted by combining the first, second, and third classifications in the keyphrase dataset, and the extracted sentence pairs were collected to construct a paraphrase corpus. To secure the validity of the keyphrase dataset methodology proposed in this paper, a paraphrase evaluation process was performed to calculate the similarity between sentences using the Doc2Vec model. As a result, it was confirmed that the paraphrase extraction method based on the keyphrase dataset was effective in finding sentence pairs with high semantic similarity.

번역하기

국문 초록 (Abstract)

다른 언어 표현을 사용하여 유사한 의미를 전달하는 두 개 이상의 문장을 패러프레이즈(paraphrase)라 하는데 이는 기계가 인간의 언어를 보다 더 잘 이해하기 위해서는 반드시 학습용 자원으로 구축할 필요가 있다. 다양한 패러프레이즈 표현에 대한 인식이 자연어 처리 응용 시스템의 성능과 직결되기 때문에 그 중요성이 더욱 커지고 있다. 응용 시스템의 성능 향상을 위해서는 모델을 학습시킬 양질의 말뭉치가 필요하다. 그러나 한국어 패러프레이즈 말뭉치는 매우 부족하며 공개된 패러프레이즈 말뭉치의 경우 새로운 패러프레이즈 표현에 대한 정보가 계속해서 업데이트되기에는 어러운 점이 있다. 또한 최종 패러프레이즈 문장 쌍을 찾는 데까지 계속해서 정제 과정을 거쳐야 한다는 한계점이 있다.
본 논문은 다양한 패러프레이즈 표현의 추가가 용이하며 여러 단계의 정제 과정을 최소화할 수 있는 패러프레이즈 추출을 위한 키프레이즈 데이터셋이라는 새로운 방법론을 제안하였다. 키프레이즈 데이터셋이란 개체명 기반의 패러프레이즈 추출과 패러프레이즈 관계에 있는 문장은 서로 유사한 키프레이즈를 공유할 것이라는 개념을 접목시킨 것이다. 키프레이즈 데이터셋은 1차 개체명 분류, 2차 개체명 분류, 3차 키프레이즈 분류로 구성된 계층 구조로 표현된다. 본 논문에서는 기사문을 대상으로 하여 개체명으로 기사문을 선정한 후에 의미 관계를 고려하여 1차 개체명 분류와 2차 개체명 분류를 선정하였으며 3차 키프레이즈 분류 구성을 위해 TextRank와 LDA, Kr-WordRank를 활용하여 키프레이즈를 구성하였다. 키프레이즈 데이터셋 내의 1차, 2차, 3차 분류를 조합하여 패러프레이즈를 추출하고 추출된 문장 쌍을 모아 패러프레이즈 자원을 구축하였다. 본 논문에서 제안한 키프레이즈 데이터셋 방법론의 타당성 확보를 위하여 Doc2Vec 모델을 이용하여 문장 간의 유사도를 계산하는 패러프레이즈 검증의 과정을 거쳤다. 그 결과 키프레이즈 데이터셋을 기반으로 한 패러프레이즈 추출 방법이 의미적으로 유사도 높은 문장 쌍을 찾는 데에 효과적이었음을 확인하였다.

번역하기

다른 언어 표현을 사용하여 유사한 의미를 전달하는 두 개 이상의 문장을 패러프레이즈(paraphrase)라 하는데 이는 기계가 인간의 언어를 보다 더 잘 이해하기 위해서는 반드시 학습용 자원으...

상세검색

RISS 보유자료

상세검색

해외전자자료

키프레이즈 데이터셋 기반 패러프레이즈 추출과 검증 연구

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료