RISS 검색 - 학위논문 상세보기

다국어 초록 (Multilingual Abstract)

Most modern automatic speech recognition (ASR) systems are developed in an end-to-end manner. The main feature of end-to-end ASR is to overcome the dependence of hidden Markov models on forced alignments of conventional deep neural network(DNN)-weighted finite-state transducer(WFST)-based ASR using methods such as connectionist temporal classification (CTC) and attention mechanisms. It also enables the training of acoustic and speech models in a single DNN. However, only paired audio and text are used in training. Rich linguistic knowledge from large-scale text data can be used in many natural language processing tasks. However, collecting paired speech-text data is more expensive than collecting only text or speech data. Consequently, the quality and quantity of paired speech-text data determine how to improve the performance of end-to-end based speech recognition systems. In contrast, for DNN-WFST-based speech recognition, large-scale text-only data can train the language model to improve speech recognition performance. Moreover, compared to paired speech-text data, text-only data are easier to collect and less costly.

Decoding networks are exploratory for combining end-to-end ASR with external language models from text-only data. Decoding networks have two types. The partially expanded decoding network passes the probabilities from the end-to-end ASR model to the language model, and the probabilities from the language model are combined to calculate the probability of the entire sequence. partially expanded decoding networks can learn both models independently. Decoding is possible using external information. It can also be applied regardless of the model type (input form and topology) because it only considers the probability of the model's output. However, it has the drawback of requiring the same model call operation to be processed at every instant in time and requiring the same word set to be used by both the end-to-end and language models. It also has the disadvantage of high memory usage because a graph must store all paths. This work proposes a fully expanded decoding network because it 1) can use off-the-shelf pre-trained end-to-end ASR and language models without modification and 2) outperforms partially expanded methods.

A WFST implements a fully expanded decoding network used in many previous studies. WFSTs are used statically to incorporate fixed lattices in discriminative sequence criteria. To incorporate a graph-based language model into ASR, we propose an enhanced CTC transducer for collapsing output labels in CTC-based ASR. The new CTC transducer, tightly developed in a standard CTC algorithm, reduces the word error rate (WER). Subsequently, we introduce a tokenization transducer for non-speech processing, showing that with adequately designed schemes of processing non-speech symbols. We found that the tokenizing transducer significantly improves decoding performance compared to a vanilla WFST-based decoder.

번역하기

참고문헌 (Reference)

1 A. Ghoshal, Y. Qian, P. Schwarz., P. Motlicek, O. Glembek, N. Goel, M. Hannemann, L. Burget, G. Boulianne, D. Povey, "The kaldi speech recognition toolkit", Proceedings of IEEE 2011 workshop on automatic speech recognition and understanding, 2011

2 A. Ghoshal, Y. Qian., S. Kombrink, P. Motlíček, M. Karafiát, M. Janda, M. Hannemann, L. Burget, G. Boulianne, D. Povey, "Generating exact lattices in the WFST framework", Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

3 K. Heafield, "KenLM: Faster and smaller language model queries", Proceedings of the sixth workshop on statistical machine translation, pp. 187–197, 2011

4 A. Pawlak, M. Jarkiewicz, L. Okruszek, J. Szymanowska, J. Sarzynska-Wawer, I. Stefaniak, A. Wawer, "Detecting formal thought disorder by deep contextualized word representations,", vol. 304, pp. 114– 135, 2021

5 A. Tripathi, S. Kumar, S. Koo, Q. Zhang, H. Sak, H. Lu, E. McDermott, "Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,", Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833, 2020

6 T. Nakatani, "Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,", Proceedings of Interspeech, 2019

3 K. Heafield, "KenLM: Faster and smaller language model queries", Proceedings of the sixth workshop on statistical machine translation, pp. 187–197, 2011

6 T. Nakatani, "Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,", Proceedings of Interspeech, 2019

상세검색

RISS 보유자료

상세검색

해외전자자료

Graph-based decoding network using WFST and Its use in the integration of end-to-end speech recognition and language model : WFST를 이용한 그래프 기반의 디코딩 네트워크 및 이를 이용한 End-to-End 음성인식과 언어 모델의 통합

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료