We are stumbling across a video tsunami ﬂooding our communication channels.
The ubiquity of digital cameras and social networks has increased the amount of visual
media content generated and shared by people, in particular videos. Cisco reports
that 82% of the internet traﬃc would be in the form of videos by 2022. The computer
vision community has embraced this challenge by oﬀering the ﬁrst building blocks to
translate the visual data in segmented video clips into semantic tags. However, users
usually require to go beyond tagging at the video level. For example, someone may
want to retrieve important moments such as the “ﬁrst steps of her child” from a large
collection of untrimmed videos; or retrieving all the instances of a home-run from an
unsegmented video of baseball. In the face of this data deluge, it becomes crucial
to develop eﬃcient and scalable algorithms that can intelligently localize semantic
visual content in untrimmed videos.
In this work, I address three diﬀerent challenges on the localization of actions in
videos. First, I develop deep-based action proposals and detection models that take a
video and generate action-agnostic and class-speciﬁc temporal segments, respectively.
These models retrieve temporal locations with high accuracy in an eﬃcient manner,
faster than real-time. Second, I propose the new task to retrieve and localize temporal
moments from a collection of videos given a natural language query. To tackle this
challenge, I introduce an eﬃcient and eﬀective model that aligns the text query to
individual clips of ﬁxed length while still retrieves moments spanning multiple clips.
This approach not only allows smooth interactions with users via natural languagequeries but also reduce the index size and search time for retrieving the moments.
Lastly, I introduce the concept of actor-supervision that exploits the inherent compo
sitionality of actions, in terms of transformations of actors, to achieve spatiotemporal
localization of actions without the need of action box annotations. By designing ef
ﬁcient models to scan a single video in real-time; retrieve and localizing moments of
interest from multiple videos; and an eﬀective strategy to localize actions without
resorting in action box annotations, this thesis provides insights that put us closer to
the goal of general video understanding.
|Date of Award||Jul 2019|
|Original language||English (US)|
- Computer, Electrical and Mathematical Science and Engineering
|Supervisor||Bernard Ghanem (Supervisor)|
- action localization
- video understating
- human activities understanding
- computer vision
- vision-language models
- deep learning