This paper proposes an automatic word-spacing mathod for the Korean text, which uses word unigram and syllable bigram statistics. The statistics are extracted from a large amount of processed corpora that contain 33,643,884 wordtokens.
Although this ...
This paper proposes an automatic word-spacing mathod for the Korean text, which uses word unigram and syllable bigram statistics. The statistics are extracted from a large amount of processed corpora that contain 33,643,884 wordtokens.
Although this method efficiently resolves problems due to data sparseness using Syllabic bigram statistics and large corpora, there still remains the problem of processing unseen words, which can hardly be overcome even with a huge corpus. Therefore, this study compensates for the stochastic-based approach expanding candidate words with stochastic method and rule knowledge-based method. The system splits an input sentence into a candidate-word sequence using stochastic method. Then, the system expands the candidate-word list using the longest-radix selection among morphemes proposed by the morphological analyzer. Combination of those two methods increase the system’s accuracy. Encouraging results of 98.26% precision in word-unit correction were obtained on average for spacing test data.