An important aspect of a Speech tutoring aimed talking‐head system (STTS) is the accuracy of produced articulatory movements. Little work has been done for the Articulatory movements' accuracy (AMA) evaluation in STTSs. Although subjective evaluatio...
An important aspect of a Speech tutoring aimed talking‐head system (STTS) is the accuracy of produced articulatory movements. Little work has been done for the Articulatory movements' accuracy (AMA) evaluation in STTSs. Although subjective evaluation is reliable, it is time consuming and inconvenient. The traditional objective evaluation is comparing the motion of several points on the surface of the synthetic articulator to the Electromagnetic articulography (EMA) data which describes the motion of corresponding points on the articulatory surface of a speaker. The EMA information is too limited to describe the whole shape changing of deformable articulators for a speech process. To solve this problem, we propose a substantially different objective evaluation method based on a separately recorded medical video. The synthetic articulatory shapes in a speech process are compared to the corresponding shapes tracked from the medical video. This method is translation, rotation, and scaling invariant which allows the comparison of the shapes from the synthetic tongue and the medical images. The time difference problem of synthesis results and medical video is solved by introducing Dynamic time warping (DTW) to the proposed method. Experimental results demonstrate that our method has the ability to evaluate the deformation shape accuracy from an entire articulation process. The comparison results suggest that our method is more accurate than the traditional method especially for deformable articulators.