RISS 검색 - 국내학술지논문 상세보기

다국어 초록 (Multilingual Abstract)

Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information.
On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words.
The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM.
This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words.
We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

국문 초록 (Abstract)

텍스트 데이터가 특정 범주에 속하는지 판별하는 문장 분류에서, 문장의 특징을 어떻게 표현하고 어떤 특징을 선택할 것인가는 분류기의 성능에 많은 영향을 미친다. 특징 선택의 목적은 차...

텍스트 데이터가 특정 범주에 속하는지 판별하는 문장 분류에서, 문장의 특징을 어떻게 표현하고 어떤 특징을 선택할 것인가는 분류기의 성능에 많은 영향을 미친다. 특징 선택의 목적은 차원을 축소하여도 데이터를 잘설명할 수 있는 방안을 찾아내는 것이다. 다양한 방법이 제시되어 왔으며 Fisher Score나 정보 이득(Information Gain) 알고리즘 등을 통해 특징을 선택 하거나 문맥의 의미와 통사론적 정보를 가지는 Word2Vec 모델로 학습된 단어들을 벡터로 표현하여 차원을 축소하는 방안이 활발하게 연구되었다. 사전에 정의된 단어의 긍정 및 부정 점수에 따라 단어의 임베딩을 수정하는 방법 또한 시도하였다.
본 연구는 문장 분류 문제에 대해 선택적 단어 제거를 수행하고 임베딩을 적용하여 문장 분류 정확도를 향상시키는 방안을 제안한다. 텍스트 데이터에서 정보 이득 값이 낮은 단어들을 제거하고 단어 임베딩을 적용하는방식과, 정보이득 값이 낮은 단어와 코사인 유사도가 높은 주변 단어를 추가로 선택하여 텍스트 데이터에서 제거하고 단어 임베딩을 재구성하는 방식이다.
본 연구에서 제안하는 방안을 수행함에 있어 데이터는 Amazon.com의 ‘Kindle’ 제품에 대한 고객리뷰, IMDB 의 영화리뷰, Yelp의 사용자 리뷰를 사용하였다. Amazon.com의 리뷰 데이터는 유용한 득표수가 5개 이상을 만족하고, 전체 득표 중 유용한 득표의 비율이 70% 이상인 리뷰에 대해 유용한 리뷰라고 판단하였다. Yelp의 경우는 유용한 득표수가 5개 이상인 리뷰 약 75만개 중 10만개를 무작위 추출하였다. 학습에 사용한 딥러닝 모델은 CNN, Attention-Based Bidirectional LSTM을 사용하였고, 단어 임베딩은 Word2Vec과 GloVe를 사용하였다.
단어 제거를 수행하지 않고 Word2Vec 및 GloVe 임베딩을 적용한 경우와 본 연구에서 제안하는 선택적으로 단어 제거를 수행하고 Word2Vec 임베딩을 적용한 경우를 비교하여 통계적 유의성을 검정하였다.

참고문헌 (Reference)

1 이민식, "카테고리 중립 단어 활용을 통한 주가 예측 방안: 텍스트 마이닝 활용" 한국지능정보시스템학회 23 (23): 123-138, 2017

2 이민식, "중립도 기반 선택적 단어 제거를 통한유용 리뷰 분류 정확도 향상 방안" 한국지능정보시스템학회 22 (22): 129-142, 2016

3 Sahlgren, M., "The distributional hypothesis" 20 (20): 33-53, 2008

4 Joachims, T., "Text categorization with support vector machines" University of Dortmund 1997

5 Yu, L.C., "Refining word embeddings for sentiment analysis" 545-550, 2017

6 Jolliffe, I.T., "Principal Component Analysis" Springer-Verlag 1989

7 Duda, R. O., "Pattern classification" Wiley 2000

8 Rapp, M., "PMSE dependence on aerosol charge, number density and aerosol size" 108 (108): 1-11, 2003

9 Roweis, S.T., "Nonlinear dimensionality reduction by Locally Linear Embedding" 290 (290): 2323-2326, 2000

10 Lewis, D.D., "Naive (Bayes) at forty: The independence assumption in information retrieval" 4-15, 1998

1 이민식, "카테고리 중립 단어 활용을 통한 주가 예측 방안: 텍스트 마이닝 활용" 한국지능정보시스템학회 23 (23): 123-138, 2017

2 이민식, "중립도 기반 선택적 단어 제거를 통한유용 리뷰 분류 정확도 향상 방안" 한국지능정보시스템학회 22 (22): 129-142, 2016

3 Sahlgren, M., "The distributional hypothesis" 20 (20): 33-53, 2008

4 Joachims, T., "Text categorization with support vector machines" University of Dortmund 1997

5 Yu, L.C., "Refining word embeddings for sentiment analysis" 545-550, 2017

6 Jolliffe, I.T., "Principal Component Analysis" Springer-Verlag 1989

7 Duda, R. O., "Pattern classification" Wiley 2000

8 Rapp, M., "PMSE dependence on aerosol charge, number density and aerosol size" 108 (108): 1-11, 2003

9 Roweis, S.T., "Nonlinear dimensionality reduction by Locally Linear Embedding" 290 (290): 2323-2326, 2000

10 Lewis, D.D., "Naive (Bayes) at forty: The independence assumption in information retrieval" 4-15, 1998

11 Sahami, M., "Learning limited dependence Bayesian classifiers" 334-338, 1996

12 Barkan, O., "Item2Vec: Neural Item Embedding for Collaborative Filtering"

13 Landauer, T.K., "Introduction to Latent Semantic Analysis" 25 : 259-284, 1998

14 Deerwester, S., "Indexing by latent semantic analysis" 41 (41): 391-407, 1990

15 Zhu, L., "Improved information gain feature selection method for Chinese text classification based on word embedding" 72-76, 2017

16 Pennington, J., "Glove: Global vectors for word representation" EMNLP 2014

17 Mika, S., "Fisher discriminant analysis with kernels" 1999

18 Mika, S., "Fisher discriminant analysis with kernels" 1999

19 Peng, H., "Feature selection based on mutual information: Criteria of maxdependence, max-relevance, min-redundancy" 27 (27): 2005

20 Lewis, D.D., "Feature selection and feature extraction for text categorization" 212-217, 1992

21 Li, J., "Feature Selection: a data perspective" 50 (50): 94:1-94:45, 2017

22 Azhagusundari, B., "Feature Selection based on Information Gain" 2 (2): 18-21, 2013

23 Bojanowski, P., "Enriching word vectors with subword information"

24 Mikolov, T., "Efficient estimation of word representations in vector space" 2013

25 Frome, A., "Devise: A Deep Visual-Semantic Embedding Model" 26 : 1-11, 2013

26 Peters, M., "Deep contextualized word representations" NAACL 2018

27 Kim, Y., "Convolutional neural networks for sentence classification" 1746-1751, 2014

28 Barkan, O., "Bayesian Neural Word Embedding" 2017

29 Zhou, P., "Attention-based bidirectional long short-term memory networks for relation classification" 207-213, 2016

30 Zhang, R., "An Information gainbased approach for recommending useful product reviews" 26 (26): 419-434, 2011

31 Mohan, P., "A study on impact of dimensionality reduction on Naive Bayes classifier" 10 (10): 2017

연월일	이력구분	이력상세
2027	평가예정	재인증평가 신청대상 (재인증)
2021-01-01	평가	등재학술지 유지 (재인증)
2018-01-01	평가	등재학술지 유지 (등재유지)
2015-03-25	학회명변경	영문명 : 미등록 -> Korea Intelligent Information Systems Society
2015-03-17	학술지명변경	외국어명 : 미등록 -> Journal of Intelligence and Information Systems
2015-01-01	평가	등재학술지 유지 (등재유지)
2011-01-01	평가	등재학술지 유지 (등재유지)
2009-01-01	평가	등재학술지 유지 (등재유지)
2008-02-11	학술지명변경	한글명 : 한국지능정보시스템학회 논문지 -> 지능정보연구
2007-01-01	평가	등재학술지 유지 (등재유지)
2004-01-01	평가	등재학술지 선정 (등재후보2차)
2003-01-01	평가	등재후보 1차 PASS (등재후보1차)
2001-07-01	평가	등재후보학술지 선정 (신규평가)

기준연도	WOS-KCI 통합IF(2년)	KCIF(2년)	KCIF(3년)
2016	1.51	1.51	1.99
KCIF(4년)	KCIF(5년)	중심성지수(3년)	즉시성지수
1.78	1.54	2.674	0.38

상세검색

RISS 보유자료

상세검색

해외전자자료

문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안 = Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity

부가정보

동일학술지(권/호) 다른 논문

분석정보

인용정보 인용지수 설명보기

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료