1 Nam, "Zero-shot Natural Language Video Localization" 1450-1459, 2021
2 Mingfei Gao, "Wslln:weakly supervised natural language localization networks" 2019
3 Zhijie Lin, "Weakly-supervised video moment retrieval via semantic completion network" 2020
4 Piotr Bojanowski, "Weakly-supervised alignment of video with text" 4462-4470, 2015
5 Niluthpol Chowdhury Mithun, "Weakly supervised video moment retrieval from text queries" 11584-11593, 2019
6 Xuguang Duan, "Weakly supervised dense event captioning in videos" 2018
7 Mithun, "Weakly Supervised Video Moment Retrieval From Text Queries" 11584-11593, 2019
8 Dahua Lin, "Visual semantic search: Retrieving videos via complex textual queries" 2657-2664, 2014
9 Chen Sun, "Videobert: A joint model for video and language representation learning" 7463-7472, 2019
10 Andrei Barbu, "Video in sentences out" 2012
1 Nam, "Zero-shot Natural Language Video Localization" 1450-1459, 2021
2 Mingfei Gao, "Wslln:weakly supervised natural language localization networks" 2019
3 Zhijie Lin, "Weakly-supervised video moment retrieval via semantic completion network" 2020
4 Piotr Bojanowski, "Weakly-supervised alignment of video with text" 4462-4470, 2015
5 Niluthpol Chowdhury Mithun, "Weakly supervised video moment retrieval from text queries" 11584-11593, 2019
6 Xuguang Duan, "Weakly supervised dense event captioning in videos" 2018
7 Mithun, "Weakly Supervised Video Moment Retrieval From Text Queries" 11584-11593, 2019
8 Dahua Lin, "Visual semantic search: Retrieving videos via complex textual queries" 2657-2664, 2014
9 Chen Sun, "Videobert: A joint model for video and language representation learning" 7463-7472, 2019
10 Andrei Barbu, "Video in sentences out" 2012
11 Hyolim Kang, "Uboco : Unsupervised boundary contrastive learning for generic event boundary detection"
12 Ut Austin, "Translating videos to natural language using deep recurrent neural networks"
13 Yitian Yuan, "To find where you talk: Temporal sentence localization in video with attention based location regression" 2019
14 Jingyuan Chen, "Temporally grounding natural sentence in video" 2018
15 J. Gao, "Tall: Temporal activity localization via language query" 5277-5285, 2017
16 Qi Zheng, "Syntax-aware action targeting for video captioning" 13093-13102, 2020
17 S. Buch, "Sst: Singlestream temporal action proposals" 6373-6382, 2017
18 W. Liu, "Ssd: Single shot multibox detector" 2016
19 Tianwei Lin, "Single shot temporal action detection" 2017
20 Zhe Gan, "Semantic compositional networks for visual captioning" 1141-1150, 2017
21 João Carreira, "Quo vadis, action recognition? a new model and the kinetics dataset" 4724-4733, 2017
22 Cristian Rodriguez-Opazo, "Proposal-free temporal moment localization of a natural-language query in video using guided attention" 2020
23 Atsuhiro Kojima, "Natural language description of human activities from video images based on concept hierarchy of actions" 50 : 171-184, 2004
24 Satanjeev Banerjee, "Meteor: An automatic metric for mt evaluation with improved correlation with human judgments" 2005
25 Jonghwan Mun, "Marioqa: Answering Questions by Watching Gameplay Videos" 2017
26 Xuelong Li, "Mam-rnn:Multi-level attention model based rnn for video captioning" 2017
27 Marcella Cornia, "M2: Meshed-memory transformer for image captioning"
28 Zhenfang Chen, "Look closer to ground better: Weakly-supervised temporal grounding of sentence in video"
29 JeffDonahue, "Long-term recurrent convolutional networks for visual recognition and description" 2625-2634, 2015
30 Lisa Anne Hendricks, "Localizing moments in video with natural language" 5804-5813, 2017
31 Pascal, "Localizing Actions from Video Labels and Pseudo-Annotations"
32 Jonghwan Mun, "Local-Global Video-Text Interactions for Temporal Grounding" 2020
33 Yangyu Chen, "Less is more: Picking informative frames for video captioning" 2018
34 Yangyu Chen, "Less is more: Picking informative frames for video captioning" 2018
35 Junyu Gao, "Learning video moment retrieval without a single annotated video" 32 : 1646-1657, 2022
36 Alec Radford, "Learning transferable visual models from natural language supervision" 2021
37 Du Tran, "Learning spatiotemporal features with 3d convolutional networks" 4489-4497, 2015
38 Chuming Lin, "Learning salient boundary feature for anchor-free temporal action localization" 3319-3328, 2021
39 Otani, "Learning Joint Representations of Videos and Sentences with Web Image Search" 2016
40 Guoshun Nan, "Interventional video grounding with dual contrastive learning" 2764-2774, 2021
41 Christian Szegedy, "Inception-v4, inception-resnet and the impact of residual connections on learning" 2017
42 Sennrich, "Improving Neural Machine Translation Models with Monolingual Data"
43 Ziyang Ma, "Hierarchical deep residual reasoning for temporal moment localization" 2021
44 Jin-Hwa Kim, "Hadamard Product for Low-rank Bilinear Pooling" 2017
45 Michaela Regneri, "Grounding action descriptions in videos" 1 : 25-36, 2013
46 Wu, Yonghui, "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation"
47 Fuchen Long, "Gaussian temporal awareness networks for action localization" 344-353, 2019
48 Shaoqing Ren, "Faster r-cnn: Towards real-time object detection with region proposal networks" 39 : 1137-1149, 2015
49 Fabian Caba Heilbron, "Fast temporal activity proposals for efficient detection of human actions in untrimmed videos" 1914-1923, 2016
50 Cristian Rodriguez-Opazo, "Discovering object relationships for moment localization of a natural language query in a video" 1078-1087, 2021
51 Justin Johnson, "Densecap: Fully convolutional localization networks for dense captioning" 4565-4574, 2016
52 Krishna, "Dense-Captioning Events in Videos" 706-715, 2017
53 Kaiming He, "Deep residual learning for image recognition" 770-778, 2016
54 Victor Escorcia, "Daps: Deep action proposals for action understanding" 2016
55 Daizong Liu, "Context-aware biaffine localizing network for temporal sentence grounding" 11235-11244, 2021
56 Richard Socher, "Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora" 966-973, 2010
57 Ziwei Yang, "Catching the temporal regions-of-interest for video captioning" 2017
58 Xue, Hongwei, "CLIP-VIP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment"
59 Tianwei Lin, "Bsn: Boundary sensitive network for temporal action proposal generation" 2018
60 Peter Anderson, "Bottom-up and top-down attention for image captioning and visual question answering" 6077-6086, 2018
61 Jacob Devlin, "Bert: Pre-training of deep bidirectional transformers for language understanding"
62 Mike Lewis, "Bart:Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension" 2020
63 Sainbayar Sukhbaatar, "Augmenting self-attention with persistent memory"
64 Shuning Chang, "Augmented transformer with adaptive graph for temporal action proposal generation"
65 Liu, "Attentive Moment Retrieval in Videos" 2018
66 Adina Williams, "A broadcoverage challenge corpus for sentence understanding through inference" 2018