Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers
|
|
- Eileen Burke
- 5 years ago
- Views:
Transcription
1 Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ron Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, Jason Mars University of Michigan Ann Arbor, MI
2 Intelligent Personal Assistants (IPAs) 2
3 Rise of the Wearables 40% $80bn 3
4 Scaling Current Datacenters Compute Resources % 50% 100% Ratio of IPA to Web Search Queries 4
5 Scaling Current Datacenters 10% IPA: 16x Machines Compute Resources % 50% 100% Ratio of IPA to Web Search Queries 4
6 Scaling Current Datacenters Compute Resources % IPA: 80x Machines 1 0% 50% 100% Ratio of IPA to Web Search Queries 4
7 Scaling Current Datacenters Compute Resources % 50% 100% Ratio of IPA to Web Search Queries 100% IPA: 160x Machines 4
8 The Challenge Redesign the datacenter for intelligent personal assistants No Open Source IPA Investigate Future Datacenters Designs 5
9 Open Source Intelligent Personal Assistant Benchmark Suite Ported Suite Across Accelerator Platforms Investigate Future Datacenter Designs For IPAs 6
10 Sirius 7
11 Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
12 Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
13 Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering Search Database Execute Server Mobile Users Image Database Image Data Question Image Set my alarm for 6am 8
14 What is the capital of Turkey? Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
15 Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering What is the capital of Turkey? Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
16 Ankara Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering What is the capital of Turkey? Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
17 Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering What is the capital of Turkey? Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
18 How tall is the eiffel tower? Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering What is the capital of Turkey? Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
19 Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering What is the capital of Turkey? How tall is the eiffel tower? Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
20 300 meters Answer Display Answer Image Matching Voice Automatic Speech-Recognition Query Classifier Question or Question-Answering What is the capital of Turkey? How tall is the eiffel tower? Search Database Execute Server Mobile Users Image Database Image Data Question Image 8
21 Answer Display Answer Image Matching Voice Server Mobile Users Image Database Automatic Speech-Recognition Query Classifier Execute Question-Answering What is the capital of Turkey? How tall is the eiffel tower? Search Database Question or Image Data Question Image Sirius: full end-to-end with inputs, pre-trained models, and databases Sirius-suite: 7 kernels with inputs to study each service individually sirius.clarity-lab.org 8
22 Answer Display Answer Image Matching Voice Server Mobile Users Image Database Automatic Speech-Recognition Query Classifier Execute Question-Answering What is the capital of Turkey? How tall is the eiffel tower? Search Database Question or Image Data Question Image Sirius: full end-to-end with inputs, pre-trained models, and databases Sirius-suite: 7 kernels with inputs to study each service individually sirius.clarity-lab.org 8
23 How does Sirius work? Users Voice Command (VC) Automatic-Speech Recognition (ASR) Voice Query (VQ) Voice-Image Query (VIQ) Question Answering (QA) Image Matching (IMM) CMU Sphinx Signal Processing Query Taxonomy IPA Services Open Source Tools Natural Language Processing Image Processing Tasks 11
24 Sirius-suite 12
25 Sirius-suite Automatic-Speech Recognition (ASR) Question Answering (QA) Image Matching (IMM) IPA Services 13
26 Sirius-suite Automatic-Speech Recognition (ASR) Gaussian Mixture Model Question Answering (QA) Image Matching (IMM) IPA Services Deep Neural Network 13
27 Sirius-suite Automatic-Speech Recognition (ASR) Question Answering (QA) Image Matching (IMM) IPA Services GMM (85%) DNN (78%) 13
28 Sirius-suite Automatic-Speech Recognition (ASR) Question Answering (QA) Image Matching (IMM) IPA Services GMM (85%) DNN (78%) Conditional Random Fields 13
29 Sirius-suite Automatic-Speech Recognition (ASR) Question Answering (QA) Image Matching (IMM) IPA Services GMM (85%) DNN (78%) Stemmer (46%) Regex (22%) CRF (17%) 13
30 Sirius-suite Automatic-Speech Recognition (ASR) Question Answering (QA) Image Matching (IMM) IPA Services GMM (85%) DNN (78%) Stemmer (46%) Regex (22%) CRF (17%) Feature Extraction Feature Description 13
31 Sirius-suite Automatic-Speech Recognition (ASR) Question Answering (QA) Image Matching (IMM) IPA Services GMM (85%) DNN (78%) Stemmer (46%) Regex (22%) FE (41%) FD (56%) CRF (17%) 13
32 Sirius-suite Automatic-Speech Recognition (ASR) Question Answering (QA) Image Matching (IMM) IPA Services GMM (85%) DNN (78%) Stemmer (46%) Regex (22%) FE (41%) FD (56%) CRF (17%) 7 kernels: 92% total execution of Sirius Suite entirely written in C/C++/CUDA Release includes inputs and models 13
33 Future Datacenter Design 15
34 How must current datacenters be upgraded to meet demand? What is the efficiency of the upgraded datacenter? 16
35 Upgrading Datacenters with COTS Systems Platform Model Clock Threads Multicore CPU Intel Xeon E V GHz 8 GPU NVIDIA GTX GHz Intel Phi Phi 5110P 1.05 GHz 240 FPGA Xilinx Virtex-6 ML MHz N/A 17
36 Upgrading Datacenters with COTS Systems Platform Advantage Disadvantage Multicore CPU Minor SW changes Limited speedup GPU Many threads Programability Intel Phi Manycore Limited compiler support FPGA Flexible New implementation 18
37 Acceleration Overview Platform GMM DNN Stemmer Regex CRF FE FD CMP GPU * 3.8* Intel Phi FPGA * * 7.5* 34.6* 75.5* 19
38 Acceleration Overview Platform GMM DNN Stemmer Regex CRF FE FD CMP Custom Porting: GPU * 3.8* % of the Implementations Intel Phi FPGA * * 7.5* 34.6* 75.5* 19
39 Acceleration Results Speedup 20
40 Acceleration Results Speedup Speech Recognition Question Answering Image Matching 20
41 Acceleration Results ~6x Speedup ~5x Speech Recognition Question Answering Image Matching 20
42 Acceleration Results Speedup ~52x Speech Recognition Question Answering Image Matching 120x 20
43 Acceleration Results 169x ~99x Speedup ~52x Speech Recognition Question Answering Image Matching 120x 20
44 Service Latency Improvement Platform Latency (s) 21
45 Service Latency Improvement 2.8s Platform Latency (s) 21
46 Service Latency Improvement 2.8s Platform Using 8 threads Latency (s) 21
47 Service Latency Improvement 21
48 Service Latency Improvement Average Latency Reduction: FPGA: 16x GPU: 10x 21
49 Performance improvements increase throughput Reduce the number of servers 22
50 Performance improvements increase throughput Reduce the number of servers What is the Total Cost of Ownership of an accelerator upgraded Datacenter? 22
51 TCO Model Parameters [1] Parameter Value Server Price $2,102 Server Power 164 W PUE 1.1 DC Depreciation 12 years Server Depreciation 3 years Average Server Utilization 45% Electricity Cost 0.067/kWh Datacenter Price $10/W Datacenter Opex $0.04/W Server Opex 5% of Capex/year [1] Barroso, Luiz André, et. al. "The datacenter as a computer: An introduction to the design of warehouse-scale machines." 23
52 TCO Query Level Results Improvement 24
53 TCO Query Level Results Improvement Average TCO improvement: GPU: 2.6x FPGA: 1.4x 24
54 Other topics included in the paper: Real System Analysis Question Variability Analysis Accelerator Porting Methodology FPGA Implementation Accelerator Details Performance per Watt Throughput Improvement at Various Load Levels Homogeneous/Heterogenous Datacenter Design 26
55 Sirius: full application Sirius-suite: 7 kernels to study each service sirius.clarity-lab.org 27
56 Thank you 28
Winter 2018 Prof. Satish Narayanasamy Special thanks to Babak Falsafi (EPFL) for ecocloud slides
EECS 570 Applications Winter 2018 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Special thanks to Babak Falsafi (EPFL) for ecocloud slides Slides developed in part by Profs. Falsafi,
More informationSIRIUS IMPLICATIONS FOR FUTURE WAREHOUSE-SCALE COMPUTERS
... SIRIUS IMPLICATIONS FOR FUTURE WAREHOUSE-SCALE COMPUTERS... DEMAND IS EXPECTED TO GROW SIGNIFICANTLY FOR CLOUD SERVICES THAT DELIVER Johann Hauswald Michael A. Laurenzano Yunqi Zhang Cheng Li Austin
More informationLucida Sirius and DjiNN Tutorial
Lucida Sirius and DjiNN Tutorial Speakers: Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang Organizers: Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Lingjia Tang, Jason Mars Before We Begin
More informationSystem Design for Intelligent Web Services
System Design for Intelligent Web Services by Johann-Alexander Hauswald A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and
More informationDjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers
DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers Johann Hauswald Yiping Kang Michael A. Laurenzano Quan Chen Cheng Li Trevor Mudge Ronald G. Dreslinski Jason
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationA Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models
A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating Michael Price*, James Glass, Anantha Chandrakasan MIT, Cambridge, MA * now at Analog Devices, Cambridge,
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationBaymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers Quan Chen 1 Hailong Yang 1 Jason Mars Lingjia Tang Clarity Lab, University of Michigan - Ann
More informationProfiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang
Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationLeveraging Mobile GPUs for Flexible High-speed Wireless Communication
0 Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *, Ann Arbor The 3 rd International Workshop on Parallelism in Mobile Platforms
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationA Refined Latent Semantic Analysis (LSA) Technique to Improve Accuracy Of Agriculture Data In Mobile Cloud Computing (MCC)
I J C T A, 9(27), 2016, pp. 543-550 International Science Press ISSN: 0974-5572 A Refined Latent Semantic Analysis (LSA) Technique to Improve Accuracy Of Agriculture Data In Mobile Cloud Computing (MCC)
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationSMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers
SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers Yunqi Zhang, Michael A. Laurenzano, Jason Mars, Lingjia Tang Clarity-Lab Electrical Engineering
More informationProphet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers Quan Chen 1 Hailong Yang 1 Minyi Guo Ram Srivatsa Kannan? Jason Mars? Lingjia Tang? Department
More informationExploring CPU-GPU Coherence
Exploring CPU-GPU Coherence Gokul Subramanian, Urmish Thakker, Swapnil Haria, Rohit Shukla, Han Lin May 14, 2015 Abstract AMD, ARM and other members of the Heterogeneous Systems Architecture Foundation
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationChapter 1: Fundamentals of Quantitative Design and Analysis
1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the
More informationAccelerators in Technical Computing: Is it Worth the Pain?
Accelerators in Technical Computing: Is it Worth the Pain? A TCO Perspective Sandra Wienke, Dieter an Mey, Matthias S. Müller Center for Computing and Communication JARA High-Performance Computing RWTH
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationContour Detection on Mobile Platforms
Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationContinuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting
Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars International Symposium on Microarchitecture
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationFacial Recognition Using Neural Networks over GPGPU
Facial Recognition Using Neural Networks over GPGPU V Latin American Symposium on High Performance Computing Juan Pablo Balarini, Martín Rodríguez and Sergio Nesmachnow Centro de Cálculo, Facultad de Ingeniería
More informationPerformance Analysis in the Real World of Online Services
Performance Analysis in the Real World of Online Services Dileep Bhandarkar, Ph. D. Distinguished Engineer 2009 IEEE International Symposium on Performance Analysis of Systems and Software My Background:
More informationToday s Data Centers. How can we improve efficiencies?
Today s Data Centers O(100K) servers/data center Tens of MegaWatts, difficult to power and cool Very noisy Security taken very seriously Incrementally upgraded 3 year server depreciation, upgraded quarterly
More informationIs There A Tradeoff Between Programmability and Performance?
Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationParallelization Techniques for Implementing Trellis Algorithms on Graphics Processors
1 Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor
More informationTopics. CIT 470: Advanced Network and System Administration. Google DC in The Dalles. Google DC in The Dalles. Data Centers
CIT 470: Advanced Network and System Administration Data Centers Topics Data Center: A facility for housing a large amount of computer or communications equipment. 1. Racks 2. Power 3. PUE 4. Cooling 5.
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationAdaptable Intelligence The Next Computing Era
Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion
More informationOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint
More informationRe-architecting Virtualization in Heterogeneous Multicore Systems
Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College
More informationHP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads
HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads Gen9 server blades give more performance per dollar for your investment. Executive Summary Information Technology (IT)
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationNVIDIA PLATFORM FOR AI
NVIDIA PLATFORM FOR AI João Paulo Navarro, Solutions Architect - Linkedin i am ai HTTPS://WWW.YOUTUBE.COM/WATCH?V=GIZ7KYRWZGQ 2 NVIDIA Gaming VR AI & HPC Self-Driving Cars GPU Computing 3 GPU COMPUTING
More informationExtending the Power of FPGAs to Software Developers:
Extending the Power of FPGAs to Software Developers: The Journey has Begun Salil Raje Xilinx Corporate Vice President Software and IP Products Group Page 1 Agenda The Evolution of FPGAs and FPGA Programming
More informationExpressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17
Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]
More informationAdaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018
Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationTreadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference
Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference Yunqi Zhang, David Meisner, Jason Mars, Lingjia Tang Internet services User interactive applications
More informationTreadmill: Tail Latency Measurement at Microsecond-level Precision. Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang
Treadmill: Tail Latency Measurement at Microsecond-level Precision Yunqi Zhang Johann Hauswald David Meisner Jason Mars Lingjia Tang Schedule Welcome Section 1: Tail latency 08:40 ~ 09:00 Overview of data
More informationA GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou
A GPU Implementation of Tiled Belief Propagation on Markov Random Fields Hassan Eslami Theodoros Kasampalis Maria Kotsifakou BP-M AND TILED-BP 2 BP-M 3 Tiled BP T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 4 Tiled
More informationRecent Advances in Software Router Technologies
Recent Advances in Software Router Technologies KRNET 2013 2013.6.24-25 COEX Sue Moon In collaboration with: Sangjin Han 1, Seungyeop Han 2, Seonggu Huh 3, Keon Jang 4, Joongi Kim, KyoungSoo Park 5 Advanced
More informationFast Hardware For AI
Fast Hardware For AI Karl Freund karl@moorinsightsstrategy.com Sr. Analyst, AI and HPC Moor Insights & Strategy Follow my blogs covering Machine Learning Hardware on Forbes: http://www.forbes.com/sites/moorinsights
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationCharacterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager
Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance
More informationOpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision. Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017
OpenCV on Zynq: Accelerating 4k60 Dense Optical Flow and Stereo Vision Kamran Khan, Product Manager, Software Acceleration and Libraries July 2017 Agenda Why Zynq SoCs for Traditional Computer Vision Automated
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationEvaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi
Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
More informationParallelism in Spiral
Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was
More informationDNNWEAVER: From High-Level Deep Network Models to FPGA Acceleration
DNNWEAVER: From High-Level Deep Network Models to FPGA Acceleration Hardik Sharma Jongse Park Emmanuel Amaro Bradley Thwaites Praneetha Kotha Anmol Gupta Joon Kyung Kim Asit Mishra Hadi Esmaeilzadeh Alternative
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationArtificial Intelligence Enriched User Experience with ARM Technologies
Artificial Intelligence Enriched User Experience with ARM Technologies Daniel Heo Senior Segment Manager Mobile, BSG, ARM ARM Tech Forum Singapore July 12 th 2017 Global AI survey: the world is ready 71
More informationGPU-Accelerated Deep Learning
GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,
More informationDEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA
DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA J.Jayalakshmi 1, S.Ali Asgar 2, V.Thrimurthulu 3 1 M.tech Student, Department of ECE, Chadalawada Ramanamma Engineering College, Tirupati Email
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationMachine Learning on VMware vsphere with NVIDIA GPUs
Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology
More informationM.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 High Performance Scalable Deep Learning Accelerator
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationPerformance of computer systems
Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type
More informationAccelerating String Matching Using Multi-threaded Algorithm
Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline
More informationPower Profiling and Optimization for Heterogeneous Multi-Core Systems
Power Profiling and Optimization for Heterogeneous Multi-Core Systems Kuen Hung Tsoi and Wayne Luk Department of Computing, Imperial College London {khtsoi, wl}@doc.ic.ac.uk ABSTRACT Processing speed and
More informationSDA: Software-Defined Accelerator for general-purpose big data analysis system
SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search
More informationTiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation
Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer
More informationCross-Layer Memory Management for Managed Language Applications
Cross-Layer Memory Management for Managed Language Applications Michael R. Jantz University of Tennessee mrjantz@utk.edu Forrest J. Robinson Prasad A. Kulkarni University of Kansas {fjrobinson,kulkarni}@ku.edu
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationA Deep Learning primer
A Deep Learning primer Riccardo Zanella r.zanella@cineca.it SuperComputing Applications and Innovation Department 1/21 Table of Contents Deep Learning: a review Representation Learning methods DL Applications
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationZhang Tianfei. Rosen Xu
Zhang Tianfei Rosen Xu Agenda Part 1: FPGA and OPAE - Intel FPGAs and the Modern Datacenter - Platform Options and the Acceleration Stack - FPGA Hardware overview - Open Programmable Acceleration Engine
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationBrainchip OCTOBER
Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History
More informationVirtual Melting Temperature: Managing Server Load to Minimize Cooling Overhead with Phase Change Materials
Virtual Melting Temperature: Managing Server Load to Minimize Cooling Overhead with Phase Change Materials Matt Skach1, Manish Arora2,3, Dean Tullsen3, Lingjia Tang1, Jason Mars1 University of Michigan1
More informationAccelerating Mobile Applications at the Network Edge with Software-Programmable FPGAs
Accelerating Mobile Applications at the Network Edge with Software-Programmable FPGAs Shuang Jiang, Dong He, Chenxi Yang, Chenren Xu, Guojie Luo, Yang Chen, Yunlu Liu, Jiangwei Jiang Peking University,
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationDeep learning in MATLAB From Concept to CUDA Code
Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,
More informationGASPP: A GPU- Accelerated Stateful Packet Processing Framework
GASPP: A GPU- Accelerated Stateful Packet Processing Framework Giorgos Vasiliadis, FORTH- ICS, Greece Lazaros Koromilas, FORTH- ICS, Greece Michalis Polychronakis, Columbia University, USA So5ris Ioannidis,
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationTowards Energy-Proportional Datacenter Memory with Mobile DRAM
Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University
More informationWhite Paper Assessing FPGA DSP Benchmarks at 40 nm
White Paper Assessing FPGA DSP Benchmarks at 40 nm Introduction Benchmarking the performance of algorithms, devices, and programming methodologies is a well-worn topic among developers and research of
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationAccelerating molecular docking on multi- and manycore computer architectures
Accelerating molecular docking on multi- and manycore computer architectures Simon McIntosh-Smith University of Bristol, UK simonm@cs.bris.ac.uk 1 ! Power-limited regimes Processor power consumption now
More informationECE 486/586. Computer Architecture. Lecture # 3
ECE 486/586 Computer Architecture Lecture # 3 Spring 2014 Portland State University Lecture Topics Measuring, Reporting and Summarizing Performance Execution Time and Throughput Benchmarks Comparing and
More information(software agnostic) Computational Considerations
(software agnostic) Computational Considerations The Issues CPU GPU Emerging - FPGA, Phi, Nervana Storage Networking CPU 2 Threads core core Processor/Chip Processor/Chip Computer CPU Threads vs. Cores
More informationΕΠΛ372 Παράλληλη Επεξεργάσια
ΕΠΛ372 Παράλληλη Επεξεργάσια Warehouse Scale Computing and Services Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2014 READING 1. Read Barroso The Datacenter as a Computer http://www.morganclaypool.com/doi/pdf/10.2200/s00193ed1v01y200905cac006?cookieset=1
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationBurrows-Wheeler Short Read Aligner on AWS EC2 F1 Instances
University of Virginia High-Performance Low-Power Lab Prof. Dr. Mircea Stan Burrows-Wheeler Short Read Aligner on AWS EC2 F1 Instances Smith-Waterman Extension on FPGA(s) Sergiu Mosanu, Kevin Skadron and
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More information