RISS 검색 - 학위논문 상세보기

다국어 초록 (Multilingual Abstract)

This paper proposes an automatic word-spacing mathod for the Korean text, which uses word unigram and syllable bigram statistics. The statistics are extracted from a large amount of processed corpora that contain 33,643,884 wordtokens.
Although this method efficiently resolves problems due to data sparseness using Syllabic bigram statistics and large corpora, there still remains the problem of processing unseen words, which can hardly be overcome even with a huge corpus. Therefore, this study compensates for the stochastic-based approach expanding candidate words with stochastic method and rule knowledge-based method. The system splits an input sentence into a candidate-word sequence using stochastic method. Then, the system expands the candidate-word list using the longest-radix selection among morphemes proposed by the morphological analyzer. Combination of those two methods increase the system’s accuracy. Encouraging results of 98.26% precision in word-unit correction were obtained on average for spacing test data.

목차 (Table of Contents)

차례
1. 서론 = 1
2. 관련 연구 = 3
3. 어절과 음절 통계 정보를 이용한 자동 띄어쓰기 = 5
3.1. 어절과 음절 통계 정보 = 5

차례
1. 서론 = 1
2. 관련 연구 = 3
3. 어절과 음절 통계 정보를 이용한 자동 띄어쓰기 = 5
3.1. 어절과 음절 통계 정보 = 5
3.1.1. 통계 정보를 추출한 말뭉치 정보 = 5
3.1.2. 추출한 통계 정보 = 5
3.1.3. 통계 데이터를 이용한 확률 정보 = 8
3.2. 통계 정보를 이용한 자동 띄어쓰기 알고리즘 = 10
3.2.1. 어절과 음절 통계 정보를 이용한 띄어쓰기 알고리즘 = 10
3.2.2. 어절 기반 자동 띄어쓰기 기법의 평가 및 문제점 = 13
4. 자료부족 문제 해결을 통한 자동 띄어쓰기 기법 개선 = 15
4.1. 말뭉치 확장 = 15
4.2. 통계 정보에 기반을 둔 자료부족 문제 해결 방안 = 17
4.3. 규칙에 기반한 자료부족 문제 해결 방안 = 18
4.3.1. 최장의 형태소 조합 결과 이용 = 18
4.3.2. 형태소 빈도를 이용한 자료부족 문제 해결 = 18
5. 실험 및 평가 = 23
5.1. 테스트 데이터의 구성 및 정확도 측정 방법 = 23
5.1.1. 테스트 데이터 정보 = 23
5.1.2. 정확도 측정 방법 = 23
5.2. 자동 띄어쓰기 기법의 성능 평가 = 24
5.2.1. 학습 말뭉치 관련 실험 결과 = 24
5.2.2. 자동 띄어쓰기 기법별 성능 실험 결과 = 27
6. 결론 및 향후 연구 = 30
참고 문헌 = 33
Abstract = 35

상세검색

RISS 보유자료

상세검색

해외전자자료

어절과 음절 통계 정보를 이용한 한국어 자동 띄어쓰기

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료