Dense-Captioning Events in Videos
Dense-Captioning Events in Videos
info
project page http://cs.stanford.edu/people/ranjaykrishna/densevid/
文章做了以下几个工作:
-
a new model:
- identify all events in a single pass of the video
- describing the detected events with natural language
- a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes.
捕捉事件之间的依赖关系:a new captioning module that uses contextual information from past and future events to jointly describe all events(采用上下文信息)
- 提供数据集 ActivityNet Captions
Dense-captioning events model
Goal : design an architecture that
- jointly localizes temporal proposals of interest
- and then describes each with natural language.
Input: sequence of videoframes
Output: a set of sentences(且包含起止时间)
Event proposal module
framework:先把视频序列输入C3D得到特征,送入proposal module(就是DAPs),得到proposal(包含起始时间、分数、hidden representation
对DAPs的更改:We do not modify the training of DAPs and only change the model at inference time by outputting K proposals at every time step, each proposing an event with offsets.
While traditional DAPs uses non-maximum suppression to eliminate overlapping outputs, we keep them separately and treat them as individual events。
Captioning module with context
从时间上下文获取信息,对于一个事件来说,把所有其他时间都划分为两类:past和future,如果是cocurrent的时间,那么在当前事件结束前就结束划分为past,否则future。past和future的表示如下:
最终得到的特征表达
实现细节
loss:两个loss,one for proposal,another for captioning model。总的loss:
其中
训练和优化:
- train our full densecaptioning model by alternating between training the language model and the proposal module every 500 iterations.
- first train the captioning module by masking all neighboring events for 10 epochs before adding in the context features.
- initialize all weights using a Gaussian with standard deviation of 0:01.
- stochastic gradient descent with momentum 0:9 to train.
- learning rate : 0.01 for the language model and 0.001 for the proposal module.
- For efficiency, we do not finetune the C3D feature extraction.
- training batch-size is set to 1
-
We cap all sentences to be a maximum sentence length of 30 words
PyTorch 0.1.10.
One mini-batch runs in approximately 15:84 ms on a Titan X GPU and it takes 2 days for the model to converge.