RISS 검색 - 국내학술지논문 상세보기

국문 초록 (Abstract)

토픽 모델은 대규모 텍스트 데이터에 내재한 주제들을 탐색하기 위해 다양한 도메인에서 활발하게 적용되는 텍스트 마이닝 기법이다. 그런데 토픽 모델의 높은 수용도에 비해 이들에 대한 성능 비교 연구는 전반적으로 매우 부족한 상황이다. 토픽 모델들에 대한포괄적인 성능 비교를 위해, 본 연구는 유니그램 외에도 1~N-그램 토픽 표현에 주목한다. 기존 연구들은 대부분 유니그램 토픽 표현을 사용했지만, 1~N-그램 토픽 표현은 토픽 이해도를 높여주는 장점이 있기 때문이다. 그런데 기존의 파이썬 범용 라이브러리로 1~N-그램 토픽 표현에 대한 토픽 응집도를 구하기는 어렵다. 따라서, 본 연구는 일차적으로 이에 대한 원인을 규명하고 해결 방안을 도출및 구현한다. 다음으로, 전통적인 BOW기반과 최신 딥러닝 기반 주요 토픽 모델들의 한글 온라인 기사와 KCI 논문 데이터에 대한 유니그램 및 1~3-그램 토픽 표현 모델링을 수행하고 응집도와 다양성을 비교 분석한다. 결과적으로, BERTopic과 준-지도 BERTopic이 NMF 나 LDA에 비해 응집도와 다양성 측면에서 유니그램과 1~3-그램 전반적으로 더욱 우수한 편이고, BERTopic과 준-지도 BERTopic의 1~3- 그램 토픽 표현에 대한 응집도는 LDA나 NMF에 비해 토픽 수가 많고, 윈도우 사이즈는 길고, 주어진 텍스트가 길 때 증가하는 경향이있는 것 등 효과적인 토픽 모델링을 위한 다각적인 시사점들을 제시한다.

번역하기

토픽 모델은 대규모 텍스트 데이터에 내재한 주제들을 탐색하기 위해 다양한 도메인에서 활발하게 적용되는 텍스트 마이닝 기법이다. 그런데 토픽 모델의 높은 수용도에 비해 이들에 대한 ...

다국어 초록 (Multilingual Abstract)

Topic models are text mining techniques actively applied in various domains to explore the underlying themes in large-scale text data. However, despite their high popularity, there is a significant lack of performance comparison studies on these models. To provide a comprehensive performance comparison of topic models, this study focuses on 1~N-gram topic representations in addition to unigrams.
While most existing studies use unigram topic representations, 1~N-gram topic representations have the advantage of enhancing topic interpretability. However, it is challenging to calculate topic coherence for 1~N-gram topic representations using existing general-purpose Python libraries. Therefore, we first identify the causes of this difficulty, proposing and implementing solutions. Next, we model unigram and 1~3-gram topic representations of major traditional BOW-based and recent deep learning-based topic models on Korean online articles and KCI paper data, comparing and analyzing their coherence and diversity. As a result, we provide multifaceted insights for effective topic modeling: BERTopic and semi-supervised BERTopic generally outperform NMF and LDA in terms of coherence and diversity, both for unigrams and 1~3-grams. The coherence of BERTopic and semi-supervised BERTopic for 1~3-gram topic representations tends to increase with a larger number of topics, longer window sizes, and longer given texts compared to LDA or NMF, etc.

번역하기

상세검색

RISS 보유자료

상세검색

해외전자자료

토픽 표현의 N-그램 변화에 따른 토픽 모델 평가: 응집도와 다양성을 중심으로 = Evaluation of Topic Models with regard to N-gram Changes in Topic Representations: Focusing on Coherence and Diversity

부가정보

동일학술지(권/호) 다른 논문

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료