RISS 검색 - 국내학술지논문 상세보기

다국어 초록 (Multilingual Abstract)

State-of-the-art Natural Language Video Localization (NLVL) models mostly use existing labels to train. The use of either full-supervision or weak-supervision needs costly annotations, which are not applicable to the real-world NLVL problems. Thus, in this study, we propose the framework of External Knowledge-based Natural Language Video Localization (EK-NLVL), which leverages the idea of generating the pseudo-supervision based on a captioning model that generates sentences from the given frames and summarizes them to ground the video event. Moreover, we propose data augmentation using the pre-trained multi-modal representation learning model CLIP for visual-aligned sentence filtering to generate pseudo-sentences that could effectively provide better quality augmentation. We also propose a new model, Query-Attentive on Segmentations Network (QAS) which effectively uses external knowledge for the NLVL task.
Experiments using the Charades-STA dataset demonstrated the efficacy of our method compared to the existing models.

국문 초록 (Abstract)

최근 자연어 기반 비디오 탐색 알고리즘 연구들은 대부분 이미 존재하는 레이블들을 활용한 데이터셋을 바탕으로 완전지도학습 혹은 준지도학습의 알고리즘들을 기반으로 하고 있다. 그러...

최근 자연어 기반 비디오 탐색 알고리즘 연구들은 대부분 이미 존재하는 레이블들을 활용한 데이터셋을 바탕으로 완전지도학습 혹은 준지도학습의 알고리즘들을 기반으로 하고 있다. 그러나 이러한 데이터셋의 구축에는 많은 비용이 들어가며, 레이블을 만들기 어려운 현실 세계에서 사용하기 적합하지 않다. 그렇기에 본 연구에서는 외부지식을 활용한 자연어 기반 비디오 탐색 알고리즘(EK-NLVL)을 제안하며, 사전 학습된 캡셔닝 모델과 비지도 기반의 비디오 영역 탐색 기법을 통해 효과적인 pseudo-supervision을 모델에 줄 수 있는 프레임워크를 제안한다. 거기에 더해 대규모 데이터셋에 사전 학습된 멀티 모달 표현학습 모델인 CLIP을 활용하여 기존의 자연어 증강 기법인 역번역기법을 바탕으로 시각 정보와 텍스트 정보를 동기화 시켜 pseudo-sentence의 정보의 품질을 향상 시키는 Visual-Aligned Sentence Filtering(VAF) 데이터 필터링 기법을 제안한다. 이렇게 외부지식을 통해 생성된 데이터를 효과적으로 활용할 수 있는 Query-Attentive on Segmentation(QAS) 모델 또한 제안하며 Charades-STA 데이터셋에서의 실험을 통해 EK-NLVL 방법론의 효과를 볼 수 있다.

참고문헌 (Reference)

1 Nam, "Zero-shot Natural Language Video Localization" 1450-1459, 2021

2 Mingfei Gao, "Wslln:weakly supervised natural language localization networks" 2019

3 Zhijie Lin, "Weakly-supervised video moment retrieval via semantic completion network" 2020

4 Piotr Bojanowski, "Weakly-supervised alignment of video with text" 4462-4470, 2015

5 Niluthpol Chowdhury Mithun, "Weakly supervised video moment retrieval from text queries" 11584-11593, 2019

6 Xuguang Duan, "Weakly supervised dense event captioning in videos" 2018

7 Mithun, "Weakly Supervised Video Moment Retrieval From Text Queries" 11584-11593, 2019

8 Dahua Lin, "Visual semantic search: Retrieving videos via complex textual queries" 2657-2664, 2014

9 Chen Sun, "Videobert: A joint model for video and language representation learning" 7463-7472, 2019

10 Andrei Barbu, "Video in sentences out" 2012