RISS 검색 - 국내학술지논문 상세보기

다국어 초록 (Multilingual Abstract)

Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.

국문 초록 (Abstract)

품사부착말뭉치는 품사정보를 부착한 말뭉치를 말하며 자연언어처리 분야에서 다양한 학습말뭉치로 사용된다. 학습말뭉치는 일반적으로 오류가 없다고 가정하지만, 실상은 다양한 오류를 ...

품사부착말뭉치는 품사정보를 부착한 말뭉치를 말하며 자연언어처리 분야에서 다양한 학습말뭉치로 사용된다. 학습말뭉치는 일반적으로 오류가 없다고 가정하지만, 실상은 다양한 오류를 포함하고 있으며, 이러한 오류들은 학습된 시스템의 성능을 저하시키는 요인이 된다. 이러한 문제를 다소 완화시키기 위해서 본 논문에서는 XGBoost와 교차 검증을 이용하여 이미 구축된 품사부착말뭉치로부터 오류를 탐지하는 방법을 제안한다. 제안된 방법은 먼저 오류가 포함된 품사부착말뭉치와 XGBoost를 사용해서 품사부착기를 학습하고, 교차검증을 이용해서 품사오류를 검출한다. 그러나 오류가 부착된 학습말뭉치가 존재하지 않으므로 일반적인 분류기로서 오류를 검출할 수 없다. 따라서 본 논문에서는 매개변수를 조절하면서 학습된 품사부착기의 출력을 비교함으로써 오류를 검출한다. 매개변수를 조절하기 위해서 본 논문에서는 작은 규모의 오류부착말뭉치를 이용한다. 이 말뭉치는 오류 검출 대상의 전체 말뭉치로부터 임의로 추출된 것을 전문가에 의해서 오류가 부착된 것이다. 본 논문에서는 성능 평가의 척도로 정보검색에서 널리 사용되는 정밀도와 재현율을 사용하였다. 또한 모집단의 모든 오류 후보를 수작업으로 확인할 수 없으므로 표본 집단과 모집단의 오류 분포를 비교하여 본 논문의 타당성을 보였다. 앞으로 의존구조부착 말뭉치와 의미역 부착말뭉치에서 적용할 계획이다.

참고문헌 (Reference)

1 최명길, "한국어 품사 부착 말뭉치의 오류 검출 및 수정" 한국마린엔지니어링학회 37 (37): 227-235, 2013

2 홍진표, "품사 태거와 빈도 정보를 활용한 세종 형태 분석 말뭉치 오류 수정" 한국정보과학회 40 (40): 417-428, 2013

3 천민아, "기계학습 분류기의 예측확률과 만장일치를 이용한 한국어 서답형 문항 자동채점 시스템" 한국정보처리학회 5 (5): 527-534, 2016

4 C. Tianqi, "XGBoost : A Scalable Tree Boosting System" 16 : 785-794, 2016

5 M. Lee, "Verification of POS Tagged Corpus" 145-150, 2005

6 N. Kang, "Training Text Chunkers on a Silver Standard Corpus : Can Silver Replace Gold?" 13 (13): 17-22, 2012

7 L. Breiman, "Random Forests" 45 : 5-32, 2001

8 I. Rehbein, "POS Error Detection in Automatically Annotated Corpora" 20-28, 2014

9 Q. Ma, "On-line Error Detection of Annotated Corpus using Modular Neural Networks" 2130 : 1185-1195, 2001

10 S. Bybers, "Nearest-neighbor Clutter Removal for Estimating Features in Spatial Point" 93 (93): 572-584, 1998

1 최명길, "한국어 품사 부착 말뭉치의 오류 검출 및 수정" 한국마린엔지니어링학회 37 (37): 227-235, 2013

2 홍진표, "품사 태거와 빈도 정보를 활용한 세종 형태 분석 말뭉치 오류 수정" 한국정보과학회 40 (40): 417-428, 2013

3 천민아, "기계학습 분류기의 예측확률과 만장일치를 이용한 한국어 서답형 문항 자동채점 시스템" 한국정보처리학회 5 (5): 527-534, 2016

4 C. Tianqi, "XGBoost : A Scalable Tree Boosting System" 16 : 785-794, 2016

5 M. Lee, "Verification of POS Tagged Corpus" 145-150, 2005

6 N. Kang, "Training Text Chunkers on a Silver Standard Corpus : Can Silver Replace Gold?" 13 (13): 17-22, 2012

7 L. Breiman, "Random Forests" 45 : 5-32, 2001

8 I. Rehbein, "POS Error Detection in Automatically Annotated Corpora" 20-28, 2014

9 Q. Ma, "On-line Error Detection of Annotated Corpus using Modular Neural Networks" 2130 : 1185-1195, 2001

10 S. Bybers, "Nearest-neighbor Clutter Removal for Estimating Features in Spatial Point" 93 (93): 572-584, 1998

11 S. Kullback, "Information Theory and Statistics" Dover Publications 1968

12 D. Yu, "Findout : Finding Outliers in Very Large Datasets" 4 (4): 387-412, 2002

13 J. -H. Kim, "Error Correction Methods for Sejong Corpus" 435-436, 2010

14 T. G. Thomas, "Ensemble Methods in Machine Learning" 1857 : 2000

15 P. Bojanowski, "Enriching Word Vectors with Subword Information" 5 : 135-146, 2017

16 M. Dickinson, "Detection of Annotation Errors in Corpora" 9 (9): 119-138, 2015

17 T. Nakagawa, "Detecting Errors in Corpora using Support Vector Machines" 1-7, 2002

18 E. Eskin, "Detecting Errors Within a Corpus using Anomaly Detection" 148-153, 2000

19 M. P. Marcus, "Building a Large Annotated Corpus of English : The Penn Treebank" 19 (19): 313-330, 1993

20 J. Kim, "Building a Korean Part-of-speech Tagged Corpus: KAIST Corpus" 1995

21 A. Agovic, "Anomaly Detection in Transportation Corridors using Manifold Embedding" 435-455, 2007

22 V. Chandola, "Anomaly Detection : Survey" 41 (41): 15-, 2009

23 CORPUS, "21st Century Sejong Project"

연월일	이력구분	이력상세
2027	평가예정	재인증평가 신청대상 (재인증)
2021-01-01	평가	등재학술지 유지 (재인증)
2018-01-01	평가	등재학술지 유지 (등재유지)
2015-01-01	평가	등재학술지 유지 (계속평가)
2012-10-31	학술지명변경	한글명 : 소프트웨어 및 데이터 공학 -> 정보처리학회논문지. 소프트웨어 및 데이터 공학
2012-10-10	학술지명변경	한글명 : 정보처리학회논문지B -> 소프트웨어 및 데이터 공학 외국어명 : The KIPS Transactions : Part B -> KIPS Transactions on Software and Data Engineering
2010-01-01	평가	등재학술지 유지 (등재유지)
2008-01-01	평가	등재학술지 유지 (등재유지)
2006-01-01	평가	등재학술지 유지 (등재유지)
2003-01-01	평가	등재학술지 선정 (등재후보2차)
2002-01-01	평가	등재후보 1차 PASS (등재후보1차)
2000-07-01	평가	등재후보학술지 선정 (신규평가)

기준연도	WOS-KCI 통합IF(2년)	KCIF(2년)	KCIF(3년)
2016	0.35	0.35	0.28
KCIF(4년)	KCIF(5년)	중심성지수(3년)	즉시성지수
0.23	0.19	0.511	0.06

상세검색

RISS 보유자료

상세검색

해외전자자료

XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지 = Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation

부가정보

동일학술지(권/호) 다른 논문

분석정보

인용정보 인용지수 설명보기

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료