Video anomaly detection has witnessed significant advancement since the development of the multiple instance learning (MIL) approach [1], and recently is expanding to incorporate both visual and textual features of videos [3]. The main idea behind thi...
Video anomaly detection has witnessed significant advancement since the development of the multiple instance learning (MIL) approach [1], and recently is expanding to incorporate both visual and textual features of videos [3]. The main idea behind this is that text features also contain frame-specific information that can complement visual features. In this work, we aim to enhance video anomaly detection by introducing a contrastive approach to robustly align visual features and textual features. For this purpose, we propose a loss function to increase the similarity between frame-level visual features and textual features [5]. Experimental results demonstrate that this approach is effective when applied to existing algorithms.