Fraunhofer IAIS Audio Mining Solution for Broadcast Archiving. Dr. Joachim Köhler LT-Innovate Brussels

Similar documents
Automatic Transcription of Speech From Applied Research to the Market

Lecture 7: Neural network acoustic models in speech recognition

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

D6.1.2: Second report on scientific evaluations

Speech Technology Using in Wechat

3 Publishing Technique

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

Making an on-device personal assistant a reality

How to Build Optimized ML Applications with Arm Software

How to Build Optimized ML Applications with Arm Software

SVD-based Universal DNN Modeling for Multiple Scenarios

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

RLAT Rapid Language Adaptation Toolkit

Multimedia Event Detection for Large Scale Video. Benjamin Elizalde

Imagine How-To Videos

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

Multimedia Databases. 9 Video Retrieval. 9.1 Hidden Markov Model. 9.1 Hidden Markov Model. 9.1 Evaluation. 9.1 HMM Example 12/18/2009

C. The system is equally reliable for classifying any one of the eight logo types 78% of the time.

TV/radio monitoring and compliance recording TV program and audience-ratings analysis Mobile-TV with time-machine Information and Advanced Education

Speech Recognition Systems for Automatic Transcription, Voice Command & Dialog applications. Frédéric Beaugendre

Enterprise Multimedia Integration and Search

INFORMATION ACCESS VIA VOICE. dissertation is my own or was done in collaboration with my advisory committee. Yapin Zhong

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Search Framework for a Large Digital Records Archive DLF SPRING 2007 April 23-25, 25, 2007 Dyung Le & Quyen Nguyen ERA Systems Engineering National Ar

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

Important New Enhancements to Inclusive Learning using Recorded Lectures

BUILDING CORPORA OF TRANSCRIBED SPEECH FROM OPEN ACCESS SOURCES

Real-time large-scale analysis of audiovisual data

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017

Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

Practical Applications of Machine Learning for Image and Video in the Cloud

Na de bevrijding XL: Expanding a Historical Television Series with Archival Sources

How GPUs Power Comcast's X1 Voice Remote and Smart Video Analytics. Jan Neumann Comcast Labs DC May 10th, 2017

Evaluation of Spoken Document Retrieval for Historic Speech Collections

Copyright 2012 Pulse Systems, Inc. Page 1 of 21

Characterization of Speech Recognition Systems on GPU Architectures

Interactive Handwritten Text Recognition and Indexing of Historical Documents: the transcriptorum Project

Fraunhofer German-Turkish Days Kick-off Conference

ADDITIONAL APPLICATION PROTOTYPES 2ND

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

Interactive Media Retrieval in Mobile Communication. Robert van Kommer

BroadcastLogger. TV/Radio/Video Monitoring and Compliance Recording. TV Program and Audience-Ratings Analysis. Mobile-Video with time shift function

Speech Applications. How do they work?

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram

A Deep Learning primer

Riding the Wave: Move Beyond Text TIB's strategy in the context of non-textual materials. Uwe Rosemann, Irina Sens IATUL Conference Singapur

THE IMPLEMENTATION OF CUVOICEBROWSER, A VOICE WEB NAVIGATION TOOL FOR THE DISABLED THAIS

Lecture Video Indexing and Retrieval Using Topic Keywords

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon

FP SIMPLE4ALL deliverable D6.5. Deliverable D6.5. Initial Public Release of Open Source Tools

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

Object-based audio production. Chris Baume EBU-PTS - 27th January 2016

Multi-Modal Communication

arxiv: v1 [cs.cl] 28 Nov 2016

GPU Accelerated Model Combination for Robust Speech Recognition and Keyword Search

Who is behind terrorist events?

TwNC: a Multifaceted Dutch News Corpus

Voice command module for Smart Home Automation

Variable-Component Deep Neural Network for Robust Speech Recognition

System Identification Related Problems at SMN

Andi Buzo, Horia Cucu, Mihai Safta and Corneliu Burileanu. Speech & Dialogue (SpeeD) Research Laboratory University Politehnica of Bucharest (UPB)

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval and Browsing. Ciprian Chelba

Britannica Library is an online general reference resource for all ages and abilities.

Hybrid Accelerated Optimization for Speech Recognition

DRAGON NATURALLYSPEAKING 12 FEATURE MATRIX COMPARISON BY PRODUCT EDITION

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

ehumanities: IntelligentAnalysis and Information System forhumanitiesand Culture(Extended Abstract)

System Identification Related Problems at SMN

Noisy Text Clustering

Confidence Measures: how much we can trust our speech recognizers

Xuedong Huang Chief Speech Scientist & Distinguished Engineer Microsoft Corporation

Web2cToGo: Bringing the Web2cToolkit to Mobile Devices. Reinhard Bacher DESY, Hamburg, Germany

Preservation. Session 4: Techniques & Audio. Arienne M. Dwyer University of Kansas. Yoshi Ono University of Alberta

Getting Started with Ensemble

Evaluation in Quaero. Edouard Geoffrois, DGA Quaero Technology Evaluation Manager. Quaero/imageCLEF workshop Aarhus, Denmark Sept 16 th, 2008

Multimedia Information Retrieval at ORL

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, March 18, ISSN

A SYSTEM FOR SELECTIVE DISSEMINATION OF MULTIMEDIA INFORMATION RESULTING FROM THE ALERT PROJECT

Deep Learning for Broadcast Videos and Multimedia

Panopto Quick Start (Faculty)

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Speech Recognition Voice Pro Enterprise 4.0 (Windows based Client) MANUAL Linguatec GmbH

Conversational Knowledge Graphs. Larry Heck Microsoft Research

USING DEEP LEARNING TECHNOLOGIES TO PROVIDE DESCRIPTIVE METADATA FOR LIVE VIDEO CONTENTS

Large Scale Distributed Acoustic Modeling With Back-off N-grams

Semi-Automatic Video Analysis for Linking Television to the Web

Computer Vision and Media Analytics

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

MESH. Multimedia Semantic Syndication for Enhanced News Services. Project Overview

Keyword Recognition Performance with Alango Voice Enhancement Package (VEP) DSP software solution for multi-microphone voice-controlled devices

Digital Archive and Asset Management A Case Study from Europe s Broadcasting Industry

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

PHONE-BASED SPOKEN DOCUMENT RETRIEVAL IN CONFORMANCE WITH THE MPEG-7 STANDARD

Learning The Lexicon!

E-Energy! Information and Communications Technologies for the Energy Sector

Turns your voice into text with up to 99% accuracy. New - Up to a 15% improvement to out-of-the-box accuracy compared to Dragon version 12

Dynamic Time Warping

Using the data in the archive

Transcription:

Fraunhofer IAIS Audio Mining Solution for Broadcast Archiving Dr. Joachim Köhler LT-Innovate Brussels 22.11.2016 1

Outline Speech Technology in the Broadcast World Deep Learning Speech Technologies Fraunhofer Audio Mining System Audio Mining Architecture Audio Mining User Interface Outlook From Research Prototype to Productive Use! <Name Dr. Joachim des Köhler Vortragenden> 2 2 -Institut für Intelligente Analyse- und Informationssysteme IAIS

Fraunhofer IAIS Audio Mining B2B Speech Recognition Solution for the Media Industry Key Facts Large Vocabulary Continuous Speech Recognition (1.000.000 words) optimized for media content Automatic structuring of audio-visual content Applications along the Media Asset Chain Archive: Indexing and transcription of media archive content Online: Search functionalities for media portals (e.g. InClip-Search) and content-based recommendation Subtitling of TV content SocialTV: Second Screen information enrichment Advertising:/Marketing Video Search Engine Optimization (VSEO) and contextualized advertisement Joachim.Koehler@iais.fraunhofer.de

Deep Learning in Speech Recognition for Content Mining Massive usage of DNN technologies for pattern recognition and machine learning Usage of very large annotated databases Significant reduction of word error rates Scenarios Image recognition, IBM Watson: Visual Search in AREMA Speech recognition (Siri, Cortana, Google Now, Audio Mining) Joachim.Koehler@iais.fraunhofer.de 5

Speech Recogntion Results and Improvements Gaussian Mixture Models (GMM) => Deep Neural Networks (DNN) and Recurrent Neural Networks (RNN) Extension of the vocabluary size from 200 000 auf 510.000 words Extension of the training database from 105 to 100 hours for acoustic training Extension of the data size for LM training from 40 Mio. words to 1 Mrd. words Jahr Acoustic Model Language Model Training data [h] WER [%] planned WER [%] Spontaneous 2012/13 1. GMM 3gram, 200k 105 26.4 33.5 2013/14 2. GMM 3gram, 200k 323 24.0 31.1 2014 3. DNN 3gram, 200k 323 18.4 22.6 2015 4. DNN 5gram, 510k 1005 13.3 16.5 2016 5. RNN 5gram, 510k 1005 11.9 14.5 Joachim.Koehler@iais.fraunhofer.de 6

Fraunhofer IAIS Audio Mininig: Technology Speaker diarization to structure recordings automatically (e.g. speaker information) ASR System based on KALDI open source package Using Deep Neural Networks Completely speaker independent Real-time processing Trained on 1.000 hours large-scale German broadcast database Variety of different TV channels All speech material manually transcribed Service-orientated architecture to control and run the recognition engine Highly adaptable for specific domains and scenarios Performance open domain: 11,9% word error rate on planned speech 14.5% word error rate on spontaneous speech Joachim.Koehler@iais.fraunhofer.de

Speech Recognition for Media Archiving powered by Fraunhofer IAIS Customer: WDR, German Broadcaster (Archive Department) Project facts: Integration of Fraunhofer IAIS Audio- Mining system into the WDR IT environment (ARCHIMEDES und IVZ) Content mining of large amounts of AVdata, immediately! Better navigation and segmentation of radio and TV material Search in spoken utterances Full transcription and keyword generation Technology provided by Fraunhofer: Broadcast speech recognition Automatic speech segmentation Speech Recognition Strukturierte Aufbereitung Structured Segmentation

Audio Mining Solution ARCHITECTURE

Audio Mining System architecture Clients (e.g. AREMA) Web interface Import, analysis, status and deletion requests Asset status, details Analysis priorities Asset details, processing updates, deletion updates Audio Mining core Asset details,. processing updates,. deletion updates Analysis priorities Audio Mining Monitor Analysis requests Analysis results Audio ifinder Mining core Web services Messaging Joachim.Koehler@iais.fraunhofer.de

Audio Mining System architecture Clients (e.g. AREMA) Web interface Audio Mining core Audio Mining File system Data base Audio Mining Monitor Search index Audio ifinder Mining core Web services Messaging Joachim.Koehler@iais.fraunhofer.de

Audio Mining System architecture Clients (e.g. AREMA) Web interface Audio Mining core Audio Mining Monitor ifinder Web services Messaging Structural Structural Structural Analysis Analysis Analysis Automatic Automatic Automatic Speech Speech Recognition Recognition Speech Recognition Joachim.Koehler@iais.fraunhofer.de

Audio Mining System architecture Clients (e.g. AREMA) Web interface Audio Mining Monitor HTTP Server Audio Mining core Audio Mining Monitor Data base Messaging Server Audio ifinder Mining core Web services Messaging Joachim.Koehler@iais.fraunhofer.de

Fraunhofer Audio Mining: Search Interface (I) https://nm-prototype.iais.fraunhofer.de/demo/ Search functionality: Find audio and video files with specific keywords, specific words in the title or the transcript, or with a specific series name.

Fraunhofer Audio Mining: Search Interface (II) https://nm-prototype.iais.fraunhofer.de/demo/ Preview functionality: Select a media file from the righthand side to watch it or listen to it. Subtitles: Audio Mining creates subtitles based on the transcript and the structural analysis results. Segmentation/Speaker clustering: Audio Mining detects whenever the speaker changes and divides the media file into multiple segments. Jump to a specific segment by clicking the timeline.

Fraunhofer Audio Mining: Search Interface (III) https://nm-prototype.iais.fraunhofer.de/demo/ Advanced search functionality: You are also able to search for a specific word inside the transcript. Word occurrences: Marks indicate the occurrences of the search term. Click on a mark to jump to the corresponding position in the video.

Fraunhofer Audio Mining: Search Interface (IV) https://nm-prototype.iais.fraunhofer.de/demo/ Keywords: Audio Mining generates keywords for every media file, based on particular relevant words in the transcript. Again, marks indicate the occurrences of the keyword. Click on a mark to jump to the corresponding position in the video.

Speech Recognition for ARD Mediathek Artifical Intelligence applicaiton powered by Fraunhofer Content analytics of 200.000 media assets Advanced search and retrieval capabilities Full transcription of multimedia content Daily processing of 2000 new media assets from radio and TV Core technology for recommendation and personalization services Link: http://www.ardmediathek.de 18

DNN-based Speech Recognition for Live Subtitling 19

Outlook Audio Mining is accepted in the broadcast domain (Germany) Projects in the archive environment are on the way First project with large media portals (ARD) are realized Future domains in the media (research prototypes) Subtitling (huge amount of TV data is subtitled, either manually or by respeaking) VSEO Second Screen Applications Transfer to other languages and media organisations outside Germany Transfer to other domains outside the media world Adding more language technologies (TTS, MT, ) Joachim.Koehler@iais.fraunhofer.de

Contact Fraunhofer Institut IAIS Schloss Birlinghoven 53754 St. Augustin www.iais.fraunhofer.de Dr. Joachim Köhler Tel: +49 (0) 2241 14 1900 Mail: joachim.koehler@iais.fraunhofer.de Link: https://www.iais.fraunhofer.de/audiomining.html Joachim.Koehler@iais.fraunhofer.de