Text-to-video retrieval (TVR) task aims to align text and video in a joint embedding space to understand the relationship between two different modalities. The typical solution is to directly align the deterministic embeddings in the joint embedding s...
Text-to-video retrieval (TVR) task aims to align text and video in a joint embedding space to understand the relationship between two different modalities. The typical solution is to directly align the deterministic embeddings in the joint embedding space with contrastive learning. However, in the real-world dataset, there is a one-to-many matching problem where one video (text) happens to correspond to multiple matching texts (videos). In this case, if the model encounters hard negative samples that are difficult to distinguish, it may lead to overfitting on specific embeddings losing generalization which is acquired from the rich pretraining. To address this issue, we propose Probabilistic Embedding Variational AutoEncoder(PEVAE) that introduces probabilistic embedding for each modality, providing margin for hard negative samples. We demonstrate the effectiveness of our model through comparisons with existing models on the MSRVTT and LSMDC benchmark datasets. Additionally, our model is designed in a plug and-play manner, which can be applied easily with various models without complex modifications.