RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      KCI우수등재

      외부 지식 활용 자연어 기반 비디오 탐색 알고리즘 = Utilizing External Knowledge in Natural Language Video Localization

      한글로보기

      https://www.riss.kr/link?id=A108380353

      • 0

        상세조회
      • 0

        다운로드
      서지정보 열기
      • 내보내기
      • 내책장담기
      • 공유하기
      • 오류접수

      부가정보

      다국어 초록 (Multilingual Abstract)

      State-of-the-art Natural Language Video Localization (NLVL) models mostly use existing labels to train. The use of either full-supervision or weak-supervision needs costly annotations, which are not applicable to the real-world NLVL problems. Thus, in...

      State-of-the-art Natural Language Video Localization (NLVL) models mostly use existing labels to train. The use of either full-supervision or weak-supervision needs costly annotations, which are not applicable to the real-world NLVL problems. Thus, in this study, we propose the framework of External Knowledge-based Natural Language Video Localization (EK-NLVL), which leverages the idea of generating the pseudo-supervision based on a captioning model that generates sentences from the given frames and summarizes them to ground the video event. Moreover, we propose data augmentation using the pre-trained multi-modal representation learning model CLIP for visual-aligned sentence filtering to generate pseudo-sentences that could effectively provide better quality augmentation. We also propose a new model, Query-Attentive on Segmentations Network (QAS) which effectively uses external knowledge for the NLVL task.
      Experiments using the Charades-STA dataset demonstrated the efficacy of our method compared to the existing models.

      더보기

      국문 초록 (Abstract)

      최근 자연어 기반 비디오 탐색 알고리즘 연구들은 대부분 이미 존재하는 레이블들을 활용한 데이터셋을 바탕으로 완전지도학습 혹은 준지도학습의 알고리즘들을 기반으로 하고 있다. 그러...

      최근 자연어 기반 비디오 탐색 알고리즘 연구들은 대부분 이미 존재하는 레이블들을 활용한 데이터셋을 바탕으로 완전지도학습 혹은 준지도학습의 알고리즘들을 기반으로 하고 있다. 그러나 이러한 데이터셋의 구축에는 많은 비용이 들어가며, 레이블을 만들기 어려운 현실 세계에서 사용하기 적합하지 않다. 그렇기에 본 연구에서는 외부지식을 활용한 자연어 기반 비디오 탐색 알고리즘(EK-NLVL)을 제안하며, 사전 학습된 캡셔닝 모델과 비지도 기반의 비디오 영역 탐색 기법을 통해 효과적인 pseudo-supervision을 모델에 줄 수 있는 프레임워크를 제안한다. 거기에 더해 대규모 데이터셋에 사전 학습된 멀티 모달 표현학습 모델인 CLIP을 활용하여 기존의 자연어 증강 기법인 역번역기법을 바탕으로 시각 정보와 텍스트 정보를 동기화 시켜 pseudo-sentence의 정보의 품질을 향상 시키는 Visual-Aligned Sentence Filtering(VAF) 데이터 필터링 기법을 제안한다. 이렇게 외부지식을 통해 생성된 데이터를 효과적으로 활용할 수 있는 Query-Attentive on Segmentation(QAS) 모델 또한 제안하며 Charades-STA 데이터셋에서의 실험을 통해 EK-NLVL 방법론의 효과를 볼 수 있다.

      더보기

      참고문헌 (Reference) 논문관계도

      1 Nam, "Zero-shot Natural Language Video Localization" 1450-1459, 2021

      2 Mingfei Gao, "Wslln:weakly supervised natural language localization networks" 2019

      3 Zhijie Lin, "Weakly-supervised video moment retrieval via semantic completion network" 2020

      4 Piotr Bojanowski, "Weakly-supervised alignment of video with text" 4462-4470, 2015

      5 Niluthpol Chowdhury Mithun, "Weakly supervised video moment retrieval from text queries" 11584-11593, 2019

      6 Xuguang Duan, "Weakly supervised dense event captioning in videos" 2018

      7 Mithun, "Weakly Supervised Video Moment Retrieval From Text Queries" 11584-11593, 2019

      8 Dahua Lin, "Visual semantic search: Retrieving videos via complex textual queries" 2657-2664, 2014

      9 Chen Sun, "Videobert: A joint model for video and language representation learning" 7463-7472, 2019

      10 Andrei Barbu, "Video in sentences out" 2012

      1 Nam, "Zero-shot Natural Language Video Localization" 1450-1459, 2021

      2 Mingfei Gao, "Wslln:weakly supervised natural language localization networks" 2019

      3 Zhijie Lin, "Weakly-supervised video moment retrieval via semantic completion network" 2020

      4 Piotr Bojanowski, "Weakly-supervised alignment of video with text" 4462-4470, 2015

      5 Niluthpol Chowdhury Mithun, "Weakly supervised video moment retrieval from text queries" 11584-11593, 2019

      6 Xuguang Duan, "Weakly supervised dense event captioning in videos" 2018

      7 Mithun, "Weakly Supervised Video Moment Retrieval From Text Queries" 11584-11593, 2019

      8 Dahua Lin, "Visual semantic search: Retrieving videos via complex textual queries" 2657-2664, 2014

      9 Chen Sun, "Videobert: A joint model for video and language representation learning" 7463-7472, 2019

      10 Andrei Barbu, "Video in sentences out" 2012

      11 Hyolim Kang, "Uboco : Unsupervised boundary contrastive learning for generic event boundary detection"

      12 Ut Austin, "Translating videos to natural language using deep recurrent neural networks"

      13 Yitian Yuan, "To find where you talk: Temporal sentence localization in video with attention based location regression" 2019

      14 Jingyuan Chen, "Temporally grounding natural sentence in video" 2018

      15 J. Gao, "Tall: Temporal activity localization via language query" 5277-5285, 2017

      16 Qi Zheng, "Syntax-aware action targeting for video captioning" 13093-13102, 2020

      17 S. Buch, "Sst: Singlestream temporal action proposals" 6373-6382, 2017

      18 W. Liu, "Ssd: Single shot multibox detector" 2016

      19 Tianwei Lin, "Single shot temporal action detection" 2017

      20 Zhe Gan, "Semantic compositional networks for visual captioning" 1141-1150, 2017

      21 João Carreira, "Quo vadis, action recognition? a new model and the kinetics dataset" 4724-4733, 2017

      22 Cristian Rodriguez-Opazo, "Proposal-free temporal moment localization of a natural-language query in video using guided attention" 2020

      23 Atsuhiro Kojima, "Natural language description of human activities from video images based on concept hierarchy of actions" 50 : 171-184, 2004

      24 Satanjeev Banerjee, "Meteor: An automatic metric for mt evaluation with improved correlation with human judgments" 2005

      25 Jonghwan Mun, "Marioqa: Answering Questions by Watching Gameplay Videos" 2017

      26 Xuelong Li, "Mam-rnn:Multi-level attention model based rnn for video captioning" 2017

      27 Marcella Cornia, "M2: Meshed-memory transformer for image captioning"

      28 Zhenfang Chen, "Look closer to ground better: Weakly-supervised temporal grounding of sentence in video"

      29 JeffDonahue, "Long-term recurrent convolutional networks for visual recognition and description" 2625-2634, 2015

      30 Lisa Anne Hendricks, "Localizing moments in video with natural language" 5804-5813, 2017

      31 Pascal, "Localizing Actions from Video Labels and Pseudo-Annotations"

      32 Jonghwan Mun, "Local-Global Video-Text Interactions for Temporal Grounding" 2020

      33 Yangyu Chen, "Less is more: Picking informative frames for video captioning" 2018

      34 Yangyu Chen, "Less is more: Picking informative frames for video captioning" 2018

      35 Junyu Gao, "Learning video moment retrieval without a single annotated video" 32 : 1646-1657, 2022

      36 Alec Radford, "Learning transferable visual models from natural language supervision" 2021

      37 Du Tran, "Learning spatiotemporal features with 3d convolutional networks" 4489-4497, 2015

      38 Chuming Lin, "Learning salient boundary feature for anchor-free temporal action localization" 3319-3328, 2021

      39 Otani, "Learning Joint Representations of Videos and Sentences with Web Image Search" 2016

      40 Guoshun Nan, "Interventional video grounding with dual contrastive learning" 2764-2774, 2021

      41 Christian Szegedy, "Inception-v4, inception-resnet and the impact of residual connections on learning" 2017

      42 Sennrich, "Improving Neural Machine Translation Models with Monolingual Data"

      43 Ziyang Ma, "Hierarchical deep residual reasoning for temporal moment localization" 2021

      44 Jin-Hwa Kim, "Hadamard Product for Low-rank Bilinear Pooling" 2017

      45 Michaela Regneri, "Grounding action descriptions in videos" 1 : 25-36, 2013

      46 Wu, Yonghui, "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation"

      47 Fuchen Long, "Gaussian temporal awareness networks for action localization" 344-353, 2019

      48 Shaoqing Ren, "Faster r-cnn: Towards real-time object detection with region proposal networks" 39 : 1137-1149, 2015

      49 Fabian Caba Heilbron, "Fast temporal activity proposals for efficient detection of human actions in untrimmed videos" 1914-1923, 2016

      50 Cristian Rodriguez-Opazo, "Discovering object relationships for moment localization of a natural language query in a video" 1078-1087, 2021

      51 Justin Johnson, "Densecap: Fully convolutional localization networks for dense captioning" 4565-4574, 2016

      52 Krishna, "Dense-Captioning Events in Videos" 706-715, 2017

      53 Kaiming He, "Deep residual learning for image recognition" 770-778, 2016

      54 Victor Escorcia, "Daps: Deep action proposals for action understanding" 2016

      55 Daizong Liu, "Context-aware biaffine localizing network for temporal sentence grounding" 11235-11244, 2021

      56 Richard Socher, "Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora" 966-973, 2010

      57 Ziwei Yang, "Catching the temporal regions-of-interest for video captioning" 2017

      58 Xue, Hongwei, "CLIP-VIP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment"

      59 Tianwei Lin, "Bsn: Boundary sensitive network for temporal action proposal generation" 2018

      60 Peter Anderson, "Bottom-up and top-down attention for image captioning and visual question answering" 6077-6086, 2018

      61 Jacob Devlin, "Bert: Pre-training of deep bidirectional transformers for language understanding"

      62 Mike Lewis, "Bart:Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension" 2020

      63 Sainbayar Sukhbaatar, "Augmenting self-attention with persistent memory"

      64 Shuning Chang, "Augmented transformer with adaptive graph for temporal action proposal generation"

      65 Liu, "Attentive Moment Retrieval in Videos" 2018

      66 Adina Williams, "A broadcoverage challenge corpus for sentence understanding through inference" 2018

      더보기

      동일학술지(권/호) 다른 논문

      분석정보

      View

      상세정보조회

      0

      Usage

      원문다운로드

      0

      대출신청

      0

      복사신청

      0

      EDDS신청

      0

      동일 주제 내 활용도 TOP

      더보기

      주제

      연도별 연구동향

      연도별 활용동향

      연관논문

      연구자 네트워크맵

      공동연구자 (7)

      유사연구자 (20) 활용도상위20명

      이 자료와 함께 이용한 RISS 자료

      나만을 위한 추천자료

      해외이동버튼