RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      Graph-based decoding network using WFST and Its use in the integration of end-to-end speech recognition and language model : WFST를 이용한 그래프 기반의 디코딩 네트워크 및 이를 이용한 End-to-End 음성인식과 언어 모델의 통합

      한글로보기

      https://www.riss.kr/link?id=T16795334

      • 저자
      • 발행사항

        서울 : 서강대학교 대학원, 2023

      • 학위논문사항

        Thesis (Ph.D.) -- 서강대학교 대학원 : 컴퓨터공학과, 2023.8

      • 발행연도

        2023

      • 작성언어

        영어

      • 발행국(도시)

        대한민국

      • 형태사항

        vii, 103 p. : ill. ; 26 cm.

      • 일반주기명

        지도교수: 김지환.
        Includes bibliographical references.

      • UCI식별코드

        I804:11029-000000076342

      • 소장기관
        • 서강대학교 도서관 소장기관정보
      • 0

        상세조회
      • 0

        다운로드
      서지정보 열기
      • 내보내기
      • 내책장담기
      • 공유하기
      • 오류접수

      부가정보

      다국어 초록 (Multilingual Abstract)

      Most modern automatic speech recognition (ASR) systems are developed in an end-to-end manner. The main feature of end-to-end ASR is to overcome the dependence of hidden Markov models on forced alignments of conventional deep neural network(DNN)-weighted finite-state transducer(WFST)-based ASR using methods such as connectionist temporal classification (CTC) and attention mechanisms. It also enables the training of acoustic and speech models in a single DNN. However, only paired audio and text are used in training. Rich linguistic knowledge from large-scale text data can be used in many natural language processing tasks. However, collecting paired speech-text data is more expensive than collecting only text or speech data. Consequently, the quality and quantity of paired speech-text data determine how to improve the performance of end-to-end based speech recognition systems. In contrast, for DNN-WFST-based speech recognition, large-scale text-only data can train the language model to improve speech recognition performance. Moreover, compared to paired speech-text data, text-only data are easier to collect and less costly.

      Decoding networks are exploratory for combining end-to-end ASR with external language models from text-only data. Decoding networks have two types. The partially expanded decoding network passes the probabilities from the end-to-end ASR model to the language model, and the probabilities from the language model are combined to calculate the probability of the entire sequence. partially expanded decoding networks can learn both models independently. Decoding is possible using external information. It can also be applied regardless of the model type (input form and topology) because it only considers the probability of the model's output. However, it has the drawback of requiring the same model call operation to be processed at every instant in time and requiring the same word set to be used by both the end-to-end and language models. It also has the disadvantage of high memory usage because a graph must store all paths. This work proposes a fully expanded decoding network because it 1) can use off-the-shelf pre-trained end-to-end ASR and language models without modification and 2) outperforms partially expanded methods.

      A WFST implements a fully expanded decoding network used in many previous studies. WFSTs are used statically to incorporate fixed lattices in discriminative sequence criteria. To incorporate a graph-based language model into ASR, we propose an enhanced CTC transducer for collapsing output labels in CTC-based ASR. The new CTC transducer, tightly developed in a standard CTC algorithm, reduces the word error rate (WER). Subsequently, we introduce a tokenization transducer for non-speech processing, showing that with adequately designed schemes of processing non-speech symbols. We found that the tokenizing transducer significantly improves decoding performance compared to a vanilla WFST-based decoder.
      번역하기

      Most modern automatic speech recognition (ASR) systems are developed in an end-to-end manner. The main feature of end-to-end ASR is to overcome the dependence of hidden Markov models on forced alignments of conventional deep neural network(DNN)-weight...

      Most modern automatic speech recognition (ASR) systems are developed in an end-to-end manner. The main feature of end-to-end ASR is to overcome the dependence of hidden Markov models on forced alignments of conventional deep neural network(DNN)-weighted finite-state transducer(WFST)-based ASR using methods such as connectionist temporal classification (CTC) and attention mechanisms. It also enables the training of acoustic and speech models in a single DNN. However, only paired audio and text are used in training. Rich linguistic knowledge from large-scale text data can be used in many natural language processing tasks. However, collecting paired speech-text data is more expensive than collecting only text or speech data. Consequently, the quality and quantity of paired speech-text data determine how to improve the performance of end-to-end based speech recognition systems. In contrast, for DNN-WFST-based speech recognition, large-scale text-only data can train the language model to improve speech recognition performance. Moreover, compared to paired speech-text data, text-only data are easier to collect and less costly.

      Decoding networks are exploratory for combining end-to-end ASR with external language models from text-only data. Decoding networks have two types. The partially expanded decoding network passes the probabilities from the end-to-end ASR model to the language model, and the probabilities from the language model are combined to calculate the probability of the entire sequence. partially expanded decoding networks can learn both models independently. Decoding is possible using external information. It can also be applied regardless of the model type (input form and topology) because it only considers the probability of the model's output. However, it has the drawback of requiring the same model call operation to be processed at every instant in time and requiring the same word set to be used by both the end-to-end and language models. It also has the disadvantage of high memory usage because a graph must store all paths. This work proposes a fully expanded decoding network because it 1) can use off-the-shelf pre-trained end-to-end ASR and language models without modification and 2) outperforms partially expanded methods.

      A WFST implements a fully expanded decoding network used in many previous studies. WFSTs are used statically to incorporate fixed lattices in discriminative sequence criteria. To incorporate a graph-based language model into ASR, we propose an enhanced CTC transducer for collapsing output labels in CTC-based ASR. The new CTC transducer, tightly developed in a standard CTC algorithm, reduces the word error rate (WER). Subsequently, we introduce a tokenization transducer for non-speech processing, showing that with adequately designed schemes of processing non-speech symbols. We found that the tokenizing transducer significantly improves decoding performance compared to a vanilla WFST-based decoder.

      더보기

      참고문헌 (Reference) 논문관계도

      1 A. Ghoshal, Y. Qian, P. Schwarz., P. Motlicek, O. Glembek, N. Goel, M. Hannemann, L. Burget, G. Boulianne, D. Povey, "The kaldi speech recognition toolkit", Proceedings of IEEE 2011 workshop on automatic speech recognition and understanding, 2011

      2 A. Ghoshal, Y. Qian., S. Kombrink, P. Motlíček, M. Karafiát, M. Janda, M. Hannemann, L. Burget, G. Boulianne, D. Povey, "Generating exact lattices in the WFST framework", Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

      3 K. Heafield, "KenLM: Faster and smaller language model queries", Proceedings of the sixth workshop on statistical machine translation, pp. 187–197, 2011

      4 A. Pawlak, M. Jarkiewicz, L. Okruszek, J. Szymanowska, J. Sarzynska-Wawer, I. Stefaniak, A. Wawer, "Detecting formal thought disorder by deep contextualized word representations,", vol. 304, pp. 114– 135, 2021

      5 A. Tripathi, S. Kumar, S. Koo, Q. Zhang, H. Sak, H. Lu, E. McDermott, "Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,", Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833, 2020

      6 T. Nakatani, "Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,", Proceedings of Interspeech, 2019

      1 A. Ghoshal, Y. Qian, P. Schwarz., P. Motlicek, O. Glembek, N. Goel, M. Hannemann, L. Burget, G. Boulianne, D. Povey, "The kaldi speech recognition toolkit", Proceedings of IEEE 2011 workshop on automatic speech recognition and understanding, 2011

      2 A. Ghoshal, Y. Qian., S. Kombrink, P. Motlíček, M. Karafiát, M. Janda, M. Hannemann, L. Burget, G. Boulianne, D. Povey, "Generating exact lattices in the WFST framework", Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

      3 K. Heafield, "KenLM: Faster and smaller language model queries", Proceedings of the sixth workshop on statistical machine translation, pp. 187–197, 2011

      4 A. Pawlak, M. Jarkiewicz, L. Okruszek, J. Szymanowska, J. Sarzynska-Wawer, I. Stefaniak, A. Wawer, "Detecting formal thought disorder by deep contextualized word representations,", vol. 304, pp. 114– 135, 2021

      5 A. Tripathi, S. Kumar, S. Koo, Q. Zhang, H. Sak, H. Lu, E. McDermott, "Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,", Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833, 2020

      6 T. Nakatani, "Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,", Proceedings of Interspeech, 2019

      더보기

      분석정보

      View

      상세정보조회

      0

      Usage

      원문다운로드

      0

      대출신청

      0

      복사신청

      0

      EDDS신청

      0

      동일 주제 내 활용도 TOP

      더보기

      주제

      연도별 연구동향

      연도별 활용동향

      연관논문

      연구자 네트워크맵

      공동연구자 (7)

      유사연구자 (20) 활용도상위20명

      이 자료와 함께 이용한 RISS 자료

      나만을 위한 추천자료

      해외이동버튼