RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      검색결과 좁혀 보기

      선택해제
      • 좁혀본 항목 보기순서

        • 원문유무
        • 원문제공처
        • 등재정보
        • 학술지명
        • 주제분류
        • 발행연도
          펼치기
        • 작성언어
        • 저자
          펼치기

      오늘 본 자료

      • 오늘 본 자료가 없습니다.
      더보기
      • 무료
      • 기관 내 무료
      • 유료
      • KCI등재

        XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지

        최민석,김창현,박호민,천민아,윤호,남궁영,김재균,김재훈 한국정보처리학회 2020 정보처리학회논문지. 소프트웨어 및 데이터 공학 Vol.9 No.7

        Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus. 품사부착말뭉치는 품사정보를 부착한 말뭉치를 말하며 자연언어처리 분야에서 다양한 학습말뭉치로 사용된다. 학습말뭉치는 일반적으로 오류가 없다고 가정하지만, 실상은 다양한 오류를 포함하고 있으며, 이러한 오류들은 학습된 시스템의 성능을 저하시키는 요인이 된다. 이러한 문제를 다소 완화시키기 위해서 본 논문에서는 XGBoost와 교차 검증을 이용하여 이미 구축된 품사부착말뭉치로부터 오류를 탐지하는 방법을 제안한다. 제안된 방법은 먼저 오류가 포함된 품사부착말뭉치와 XGBoost를 사용해서 품사부착기를 학습하고, 교차검증을 이용해서 품사오류를 검출한다. 그러나 오류가 부착된 학습말뭉치가 존재하지 않으므로 일반적인 분류기로서 오류를 검출할 수 없다. 따라서 본 논문에서는 매개변수를 조절하면서 학습된 품사부착기의 출력을 비교함으로써 오류를 검출한다. 매개변수를 조절하기 위해서 본 논문에서는 작은 규모의 오류부착말뭉치를 이용한다. 이 말뭉치는 오류 검출 대상의 전체 말뭉치로부터 임의로 추출된 것을 전문가에 의해서 오류가 부착된 것이다. 본 논문에서는 성능 평가의 척도로 정보검색에서 널리 사용되는 정밀도와 재현율을 사용하였다. 또한 모집단의 모든 오류 후보를 수작업으로 확인할 수 없으므로 표본 집단과 모집단의 오류 분포를 비교하여 본 논문의 타당성을 보였다. 앞으로 의존구조부착 말뭉치와 의미역 부착말뭉치에서 적용할 계획이다.

      • KCI등재
      • 세종말뭉치의 오류 수정 방법

        김재훈(Jae-Hoon Kim),서형원(Hyung-Won Seo),전길호(Kil-Ho Jeon),최명길(Myung-Gil Choi) 한국마린엔지니어링학회 2010 한국마린엔지니어링학회 학술대회 논문집 Vol.2010 No.4

        Sejong corpus is a Korean corpus annotated with various linguistic information. The corpus contains a raw corpus, a part-of-speech (POS) tagged corpus, a syntactic tree bank and so on, according to the annotated information. This paper is related to the POS-tagged corpus, which is annotated with the POS information and used to develop natural language processing (NLP) systems, such as information retrieval, information extract, etc. The Sejong POS-tagged corpus had been built by the National Institute of the Korean Language for 9 years and consists of 10.6 million words. However, it's hard to use the corpus for developing some NLP systems because of various types of errors in the corpus. We treat errors which original words mismatch the concatenation of tagged morphemes. In this paper, we represent a method for detecting the errors and correcting them, and also our results. First, the error detection is to find mismatches of strings between original words and the concatenation of their analyzed words. The mismatches is candidates of errors and contains some valid forms transformed by irregular or phoneme conjugations. We develop a program to filter the valid forms out. The remaining mismatches are modified according to error types as follows: 1) Unnecessarily inserted or deleted words had been corrected by regular expressions, which are made manually. 2) Some special symbols as errors didn't be recognized by annotators correctly and had been corrected manually. 3) Others as the remaining errors account for very small portion and had also been corrected manually. As the result of our effort, the Sejong POS-tagged corpus is improved as good as it is useful for some applications.

      • KCI등재

        한국어 형태소 분석을 위한 3단계 확률 모델

        이재성(Jae Sung Lee) 한국정보과학회 2011 정보과학회논문지 : 소프트웨어 및 응용 Vol.38 No.5

        확률 모델을 기반으로 만들어진 형태소 분석기는 형태소 품사 부착 말뭉치의 다양한 언어 현상과 태깅 원칙을 바로 학습할 수 있으므로 다양한 분야에 대한 적응력이 높다. 본 논문에서는 한국어 형태소 분석을 위한 3단계 확률 모델을 제안한다. 이 모델은 분석 단계를 형태소 복원, 분리, 태깅의 3단계로 나누어 독립된 모듈로 처리함으로써 기존의 2단계 확률 모델보다 처리 복잡도를 줄였다. 또한, 음절 대신 자소 단위의 처리를 하고, 형태소 전이 확률을 이용하여 형태소 분리를 함으로써 다양한 품사 태깅 원칙을 학습할 수 있도록 했다. 모델의 성능 평가는 세종 계획 프로젝트에서 개발한 문어체 및 구어체 형태소 부착 말뭉치에 대해 실험하였고 기존의 방법들과 비교하였다. A morphological analyzer based on probabilistic model can learn easily various language phenomena and tagging principles used in morpheme-tagged corpus, so that it is very portable to various domains. In this paper, we propose a three-step probabilistic model for Korean morphological analysis which consists of original form restoring step, morpheme segmentation step and morpheme tagging step. The three-step method, which uses modular approach, reduces processing complexity compared with two-step probabilistic model. Processing in Jaso unit rather than syllable unit and using morpheme transition probability for morpheme segmentation increase portability for various tagging principles. Experiment on Sejong tagged corpus, both of written text corpus and spoken text corpus, was done to show the performance of the model and compare it with other methods.

      • KCI우수등재

        한국어 형태소 분석을 위한 음절 단위 확률 모델

        심광섭(Kwangseob Shim) Korean Institute of Information Scientists and Eng 2014 정보과학회논문지 Vol.41 No.9

        This paper proposes three probabilistic models for syllable-based Korean morphological analysis, and presents the performance of proposed probabilistic models. Probabilities for the models are acquired from POS-tagged corpus. The result of 10-fold cross-validation experiments shows that 98.3% answer inclusion rate is achieved when trained with Sejong POS-tagged corpus of 10 million eojeols. In our models, POS tags are assigned to each syllable before spelling recovery and morpheme generation, which enables more efficient morphological analysis than the previous probabilistic models where spelling recovery is performed at the first stage. This efficiency gains the speed-up of morphological analysis. Experiments show that morphological analysis is performed at the rate of 147K eojeols per second, which is almost 174 times faster than the previous probabilistic models for Korean morphology.

      • KCI등재후보

        통계 기반 한국어 형태소 분석기의 성능 개선

        심광섭 ( Kwangseob Shim ) 성신여자대학교 인문과학연구소 2016 人文科學硏究 Vol.34 No.-

        Statistical Korean morphological analysis is a brand-new approach in that it does not require a manually built machine-readable morphology dictionary. Instead, it uses statistical information that is acquired from POS-tagged corpus. The acquisition of statistical information is fully automated, so that no human intervention is required in the process. This is a good side of the statistical approach to Korean morphological analysis. The bad side of the approach is its low precision, meaning that the number of false positives is relatively high. In order to improve the precision, this paper proposes a method of filtering false positives. The proposed method introduces two types of dictionaries, one-syllable-morpheme dictionary and josa-eomi dictionary, which are automatically constructed when statistical information is collected from the POS-tagged corpus. To evaluate the performance of the proposed method, 10-fold cross-validation is performed with 10 million eojeol Sejong POS-tagged corpus. The experimental results show that the precision has been improved by 5%.

      연관 검색어 추천

      이 검색어로 많이 본 자료

      활용도 높은 자료

      해외이동버튼