This study compiles and refines collated and punctuated Classical Chinese texts accumulated through prior research and projects to construct a database of approximately 3.4 million items (≈420 million characters). Building on this resource, we devel...
This study compiles and refines collated and punctuated Classical Chinese texts accumulated through prior research and projects to construct a database of approximately 3.4 million items (≈420 million characters). Building on this resource, we develop a punctuation inference model specialized for Korean Classical Chinese by fine-tuning the pretrained deep learning language model Chinese-RoBERTa into a multi-label token classification architecture. The training corpus—covering eight genres including annals, collected works, and diaries—was preprocessed and standardized to seven punctuation marks (, 。 · ? ! 《 》). The final model achieves an overall F1 score of 0.9050 on held-out validation data. On unseen corpora containing only traditional ring-dot punctuation (Hanguk Munjip Chonggan and Ilseongnok), the model attains F1 scores of 0.8784 and 0.9065, respectively, for punctuation-position matching. By punctuation type, question marks, commas, periods, and middle dots exhibit high performance, whereas book-title brackets (《》)— which require long-range dependencies in paired structures—and exclamation marks—sparse in the data—show lower recall. We release an open-source integrated system—including model weights, training data, source code, and GUI/CLI batch processing—to support records and information services and research workflows using natural-language analysis, such as text preprocessing, indexing and search, translation preprocessing, and OCR postprocessing. Future work includes a dual-path architecture for paired punctuation, genre-adaptive modules, and multi-task integration with sentence-structure analysis and named-entity recognition.