The study on the impact of knowledge transfer of feature matching using L1 loss on speech emotion recognition in this paper is one of the methods for knowledge transfer from vision transformer (ViT) (teacher), which includes a relatively large-scale C...
The study on the impact of knowledge transfer of feature matching using L1 loss on speech emotion recognition in this paper is one of the methods for knowledge transfer from vision transformer (ViT) (teacher), which includes a relatively large-scale CNN, to relatively small-scale ViT (student), excluding even positional embedding. We studied feature matching using L1 loss. As a result, the performance of the student network that went through the feature matching step that mimics the features of the teacher network was significantly higher than that of the student network that was trained for classification from scratch. The accuracy of the teacher network was 94.17%, and the accuracy of the student who performed feature matching was 94.65%, showing higher accuracy with a smaller structure.