Deep Learning for Broadcast Videos and Multimedia

Size: px

Start display at page:

Download "Deep Learning for Broadcast Videos and Multimedia"

Joshua Todd
6 years ago
Views:

1 Deep Learning for Broadcast Videos and Multimedia Lorenzo Baraldi University of Modena and Reggio Emilia

Deep Learning State-of-the-art in image classification, object detection, semantic object

It can be applied to images, videos and multimodal data.

Segmentation and Concept Detection, with C. Grana and R.

2 Deep Learning State-of-the-art in image classification, object detection, semantic object segmentation and action recognition. It is general! It can be applied to images, videos and multimodal data. At Imagelab: Città Educante project Develop and test new DL algorithms for Temporal Video Segmentation and Concept Detection, with C. Grana and R. Cucchiara Two hardware grants for Deep Learning projects: NVIDIA Hardware Grant, with the donation of one Tesla K40 GPU. Italian Supercomputing Resource Allocation (ISCRA) Grant from CINECA, which gives access to the Galileo HPC Platform.

3 Città Educante Almaviva SpA, RAI UniTN, UNIMORE, Reggio Children, CNR ATI Città Educante (13 PMI) exo Platform (Almaviva) RAI server RAI metadata Neuralstory Web interface (ATI) OR 3.2: Knowledge Extraction Video annotation Temporal Video Segmentation Deep learning engine (UNIMORE)

Broadcast videos Audio Speech to text Words Sentences Increasing

level of abstraction Video Annotation of basic units (shots and

darkness In the Arctic, polar bear cubs take their first steps into

caribou complete an overland migration The forests of eastern

4 Broadcast videos Audio Speech to text Words Sentences Increasing level of abstraction Visual Shot detection Frames Shots Increasing level of abstraction Video Annotation of basic units (shots and sentences) is a necessary step for dividing a video into complex segments, like storyboards. During Antarctica winter, emperor penguins endure four months of darkness In the Arctic, polar bear cubs take their first steps into a world of rapidly thawing ice In northern Canada, 3 million caribou complete an overland migration The forests of eastern Russia are home to the Amur leopard In the tropics, the jungle that covers 3% of the planet's surface supports 50% of its species

5 Video story detection Group adjacent shots according to semantic coherence Can not be identified with visual features Can be identified with visual features only Need of multi-modal features!

6 Perceptual multi-modal features Visual appearance 1000 Visual concepts 205 Scenes ILSVRC-12: 1.2 million images Places: 2.5 million images Short term audio spectrum features POS tagger Quantity of speech Time

leaves in preparation for the dark cold months ahead.

7 Semantic multi-modal features Textual semantic Cluster words in the transcript using a Word2Vec embedding space: words with similar semantics lie close The deciduous forests of America begin to shut down, losing their leaves in preparation for the dark cold months ahead. Textual concept space Visual semantic A Visual Word2Vec: words in transcript are visually confirmed using the entire Imagenet dataset ( categories) Visual concept space

8 A Deep Multi-modal architecture

Retrieval: merging semantics and aesthetics Scene-based: retrieve parts

query Aesthetic: thumbnail should be aesthetically pleasant Low and high

9 Retrieval: merging semantics and aesthetics Scene-based: retrieve parts of videos instead of videos Semantic: thumbnail should represent the query Aesthetic: thumbnail should be aesthetically pleasant Low and high level activations from CNN + max-margin linear Ranking Less data, no DNN training! L. Baraldi, C. Grana, R. Cucchiara, Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features ICMR 2016, New York

10 Retrieval: merging semantics and aesthetics Query: penguin and calf Same video, different scenes Query: ant and spider Same scene, different thumbnails

11 Evaluation Synthetic: YFCC100M-Stories Built using YFCC100M videos TV Series: AllyMcBeal dataset for scene detection First four episodes of the first season Documentaries: BBC Planet Earth 11 episodes from a BBC educational TV Series 4900 shots and 670 segments Our method (Deep) State of the art YFCC100M-Stories AllyMcBeal BBC Planet Earth

12 Visualization Automatically generated stories can be visualized in a timeline fashion. Visual concepts enhance navigation and search inside the archive.

13 Visualization Video re-use!

14 Thank you Any questions?

A Video Library System using Scene Detection and Automatic Tagging

A Video Library System using Scene Detection and Automatic Tagging Lorenzo Baraldi, Costantino Grana, Rita Cucchiara Dipartimento di Ingegneria Enzo Ferrari Università degli Studi di Modena e Reggio Emilia