http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
Learning Self-Informed Feature Contribution for Deep Learning-Based Acoustic Modeling
Kim, Younggwan,Kim, Myungjong,Goo, Jahyun,Kim, Hoirin IEEE 2018 IEEE/ACM transactions on audio, speech, and langua Vol.26 No.11
<P>In this paper, we introduce a new feature engineering approach for deep learning-based acoustic modeling, which utilizes input feature contributions. For this purpose, we propose an auxiliary deep neural network (DNN) called a feature contribution network (FCN) whose output layer is composed of sigmoid-based contribution gates. In our framework, the FCN tries to learn element-level discriminative contributions of input features and an acoustic model network (AMN) is trained by gated features generated by element-wise multiplication between contribution gate outputs and input features. In addition, we also propose a regularization method for the FCN, which helps the FCN to activate the minimum number of the gates. The proposed methods were evaluated on the TED-LIUM release 1 corpus. We applied the proposed methods to DNN- and long short-term memory-based AMNs. Experimental results results showed that AMNs with the FCNs consistently improved recognition performance compared with AMN-only frameworks.</P>
SVM Based Speaker Verification Using Sparse Maximum A Posteriori Adaptation
Kim, Younggwan,Roh, Jaeyoung,Kim, Hoirin The Institute of Electronics and Information Engin 2013 IEIE Transactions on Smart Processing & Computing Vol.2 No.5
Modern speaker verification systems based on support vector machines (SVMs) use Gaussian mixture model (GMM) supervectors as their input feature vectors, and the maximum a posteriori (MAP) adaptation is a conventional method for generating speaker-dependent GMMs by adapting a universal background model (UBM). MAP adaptation requires the appropriate amount of input utterance due to the number of model parameters to be estimated. On the other hand, with limited utterances, unreliable MAP adaptation can be performed, which causes adaptation noise even though the Bayesian priors used in the MAP adaptation smooth the movements between the UBM and speaker dependent GMMs. This paper proposes a sparse MAP adaptation method, which is known to perform well in the automatic speech recognition area. By introducing sparse MAP adaptation to the GMM-SVM-based speaker verification system, the adaptation noise can be mitigated effectively. The proposed method utilizes the L0 norm as a regularizer to induce sparsity. The experimental results on the TIMIT database showed that the sparse MAP-based GMM-SVM speaker verification system yields a 42.6% relative reduction in the equal error rate with few additional computations.
Kim, Myung Jong,Kim, Younggwan,Kim, Hoirin IEEE 2015 IEEE/ACM transactions on audio, speech, and langua Vol.23 No.4
<P>This paper presents a new method for automatically assessing the speech intelligibility of patients with dysarthria, which is a motor speech disorder impeding the physical production of speech. The proposed method consists of two main steps: feature representation and prediction. In the feature representation step, the speech utterance is converted into a phone sequence using an automatic speech recognition technique and is then aligned with a canonical phone sequence from a pronunciation dictionary using a weighted finite state transducer to capture the pronunciation mappings such as match, substitution, and deletion. The histograms of the pronunciation mappings on a pre-defined word set are used for features. Next, in the prediction step, a structured sparse linear model incorporated with phonological knowledge that simultaneously addresses phonologically structured sparse feature selection and intelligibility prediction is proposed. Evaluation of the proposed method on a database of 109 speakers consisting of 94 dysarthric and 15 control speakers yielded a root mean square error of 8.14 compared to subjectively rated scores in the range of 0 to 100. This is a promising performance in which the system can be successfully applied to help speech therapists in diagnosing the degree of speech disorder.</P>
Regularized Speaker Adaptation of KL-HMM for Dysarthric Speech Recognition
Myungjong Kim,Younggwan Kim,Joohong Yoo,Jun Wang,Hoirin Kim IEEE 2017 IEEE transactions on neural systems and rehabilita Vol.25 No.9
<P>This paper addresses the problem of recognizing the speech uttered by patients with dysarthria, which is a motor speech disorder impeding the physical production of speech. Patients with dysarthria have articulatory limitation, and therefore, they often have trouble in pronouncing certain sounds, resulting in undesirable phonetic variation. Modern automatic speech recognition systems designed for regular speakers are ineffective for dysarthric sufferers due to the phonetic variation. To capture the phonetic variation, Kullback-Leibler divergence-based hidden Markov model (KL-HMM) is adopted, where the emission probability of state is parameterized by a categorical distribution using phoneme posterior probabilities obtained from a deep neural network-based acoustic model. To further reflect speaker-specific phonetic variation patterns, a speaker adaptation method based on a combination of L2 regularization and confusion-reducing regularization, which can enhance discriminability between categorical distributions of the KL-HMM states while preserving speaker-specific information is proposed. Evaluation of the proposed speaker adaptation method on a database of several hundred words for 30 speakers consisting of 12 mildly dysarthric, 8 moderately dysarthric, and 10 non-dysarthric control speakers showed that the proposed approach significantly outperformed the conventional deep neural network-based speaker adapted system on dysarthric as well as non-dysarthric speech.</P>
Myung Jong Kim,Hoirin Kim IEEE 2012 IEEE transactions on multimedia Vol.14 No.5
<P>In this paper, the problem of detecting objectionable sounds, such as sexual screaming or moaning, to classify and block objectionable multimedia content is addressed. Objectionable sounds show distinctive characteristics, such as large temporal variations and fast spectral transitions, which are different from general audio signals, such as speech and music. To represent these characteristics, segment-based two-dimensional Mel-frequency cepstral coefficients and histograms of gradient directions are used as a feature set to characterize the time-frequency dynamics within a long-range segment of the target signal. After extracting the features, they are transformed to features with lower dimensions while preserving discriminative information using linear discriminant analysis based on a combination of global and local Fisher criteria. A Gaussian mixture model is adopted to statistically represent objectionable and non-objectionable sounds, and test sounds are classified by using a likelihood ratio test. Evaluation of the proposed feature extraction method on a database of several hundred objectionable and non-objectionable sound clips yielded precision/recall breakeven point of 91.25%, which is a promising performance which shows that the system can be applied to help an image-based approach to block such multimedia content.</P>