Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Similar documents
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems

Accelerating Spark Workloads using GPUs

IBM POWER SYSTEMS: YOUR UNFAIR ADVANTAGE

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Building NVLink for Developers

IBM Power Systems: Open Innovation to put data to work. Juan López-Vidriero Mata Director técnico de ventas de servidores

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

OpenPOWER Performance

Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

OpenPOWER Performance

Capturing value from an open ecosystem

Deep Learning mit PowerAI - Ein Überblick

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Deep Learning Accelerators

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

IBM Power AC922 Server

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Towards Scalable Machine Learning

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

OpenCAPI and its Roadmap

IBM Power Advanced Compute (AC) AC922 Server

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

High-Performance Data Loading and Augmentation for Deep Neural Network Training

HANA Performance. Efficient Speed and Scale-out for Real-time BI

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

IBM Power Systems HPC Cluster

A unified multicore programming model

NVIDIA GPU TECHNOLOGY UPDATE

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Arm Processor Technology Update and Roadmap

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

IBM Power 9 надежная платформа для развертывания облаков. Ташкент. Юрий Кондратенко Cross-Brand Sales Specialist

ECS289: Scalable Machine Learning

Specialist ICT Learning

Parallelism. CS6787 Lecture 8 Fall 2017

Tools and Primitives for High Performance Graph Computation

High Performance Computing on GPUs using NVIDIA CUDA

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

The Future of High Performance Computing

Revolutionizing Open. Cecilia Carniel IBM Power Systems Scale Out sales

Implementing and Maintaining Microsoft SQL Server 2005 Analysis Services

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

2016 IBM Corporation 1

Server Side Applications (i.e., public/private Clouds and HPC)

Pervasive DataRush TM

XPU A Programmable FPGA Accelerator for Diverse Workloads

IBM Leading High Performance Computing and Deep Learning Technologies

Big Data with Hadoop Ecosystem

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Tribhuvan University Institute of Science and Technology MODEL QUESTION

IBM CORAL HPC System Solution

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

1 (eagle_eye) and Naeem Latif

Oracle 1Z0-515 Exam Questions & Answers

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Chapter 2 Parallel Hardware

Unifying Big Data Workloads in Apache Spark

The Case for Heterogeneous HTAP

Infrastructure Matters: POWER8 vs. Xeon x86

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

AWS & Intel: A Partnership Dedicated to fueling your Innovations. Thomas Kellerer BDM CSP, Intel Central Europe

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Oracle Big Data Connectors

Towards Approximate Computing: Programming with Relaxed Synchronization

PARTNERSHIPS AND ECOSYSTEMS

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

IN11E: Architecture and Integration Testbed for Earth/Space Science Cyberinfrastructures

Harp-DAAL for High Performance Big Data Computing

Graph Database and Analytics in a GPU- Accelerated Cloud Offering

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

IBM DB2 Analytics Accelerator Trends and Directions

The strategic advantage of OLAP and multidimensional analysis

Fast Hardware For AI

Evolving To The Big Data Warehouse

Interconnect Your Future

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

OLAP Introduction and Overview

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

Sampling Using GPU Accelerated Sparse Hierarchical Models

Université IBM i 2017

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Implementing Deep Learning for Video Analytics on Tegra X1.

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Predicting Service Outage Using Machine Learning Techniques. HPE Innovation Center

Gen-Z Memory-Driven Computing

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

The Evolution of Big Data Platforms and Data Science

Transcription:

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation

Outline IBM OpenPower Platform Accelerating Analytics using GPUs Case Studies of GPU-accelerated workloads Cognitive/Graph Analytics In-memory OLAP Deep Learning Financial Modeling Summary 2 3/17/2015

OpenPOWER: An Open Development Community Implementation / HPC / Research System / Software / Integration I/O / Storage / Acceleration Chip / SOC Boards / Systems Complete member list at www.openpowerfoundation.org

The OpenPOWER Foundation creates a pipeline of continued innovation and extends POWER8 capabilities Opening the architecture and innovating across the full hardware & software stack Driving an expansion of enterprise-class hardware and software Building a complete sever ecosystem delivering maximum client flexibility More information at the OpenPower summit today and tomorrow 4 2014 IBM Corporation

IBM and NVIDIA deliver new acceleration capabilities for analytics, big data, and Java Runs pattern extraction analytic workloads 8x faster Provides new acceleration capability for analytics, big data, Java, and other technical computing workloads Delivers faster results and lower energy costs by accelerating processor intensive applications Power System S824L Up to 24 POWER8 cores Up to 1 TB of memory Up to 2 NVIDIA K40 GPU Accelerators Ubuntu Linux running bare metal Power 8 provides high memory bandwidth and a large number of concurrent threads. Ideal for scaling memory-bound workloads such as OLTP and OLAP. GPUs provide high memory and compute bandwidth. Nvlink to provide very high host-to-gpu bandwidth leading to further performance Efficiency and simplified multi-gpu programming Together, Power 8+GPUs ideal of accelerating analytics, HPC and data Management workloads. 5 2014 IBM Corporation

Analytics, HPC, and Data Management Landscape Time-Series Processing Rule Mining Modeling Simulation Graph Analysis Analytics Regression Analysis Mathematical Programming Clustering Decision-tree Learning OLAP/BI Neural HPC Networks Support Vector Machines Visualization Nearest-neighbor Search RDF Text Analytics Streams Data Management Transaction Processing Data Integration 6 3/17/2015

Acceleration Opportunities for GPUs Analytics Model Regression Analysis Clustering Nearest-neighbor Search Computational Patterns suitable for GPU Acceleration Cholesky Factorization, Matrix Inversion, Transpose Cost-based iterative convergence Distance calculations, Singular Value Decomposition, Hashing Neural Networks Support Vector Machines Matrix Multiplications, Convolutions, FFTs, Pair-wise dot-products Linear Solvers, Dot-product Association Rule Mining Set Operations: Intersection, union Recommender Systems Matrix Factorizations, Dot-product Time-series Processing Text Analytics Monte Carlo Methods Mathematical Programming OLAP/BI Graph Analytics FFT, Distance and Smoothing functions Matrix multiplication, factorization, Set operations, String computations, Distance functions, Pattern Matching Random number generators, Probability distribution generators Linear solvers, Dynamic Programming Aggregation, Sorting, Hash-based grouping, User-defined functions Matrix multiplications, Path traversals 7 3/17/2015

Case Study: Concept Graph Analytics Human-like natural language understanding via identifying and ranking related concepts from massively crowd-sourced knowledge sources like Wikipedia. A query for GPU would return documents first on GPUs, then on multicore and FPGAs (GPUs, multi-cores, and FPGAs are related concepts). reading text means linking it to the graph understanding text means seeing how it connects to all other concepts Wikipedia Concept Graph Goal: To use GPUs to implement instantaneous concept discovery system that can scale to millions of documents

Concept Graph Analysis Basics Operates on a corpora of documents (e.g., webpages) Relationships in the external knowledge base represented as values in a N*N sparse matrix, where N is the total number of concepts Each document has a set of concepts that get mapped to sparse N-element vectors, one vector per concept Two core operations: Indexing to relate concepts from current document corpus to the knowledge base (Throughput-oriented operation) Real-time querying to relate few concepts generated by the queries (e.g., gpu, multi-core, FPGA) to documents (Latency-sensitive operation) 9 3/17/2015

Implementation of Concept Graph Analysis Core computation: Markov-chain based iterative personalized page rank algorithm to calculate relationships between concepts of a knowledge graph Implemented as a sparse matrix-dense matrix multiplication operation (cusparse csrmm2) over a sparse matrix representing the concept graph and a dense matrix representing the concepts from document Usually requires 5 iterations to converge Indexing involves multiplication over a large number of concept vectors, querying involves multiplication over a small number of concept vectors. GPUs suitable for accelerating indexing. On a wikipedia knowledge graph, the sequential indexing execution requires around 59 sec. The GPU implementation takes around 2.31 sec 10 3/17/2015

Case Studies: Deep Learning Deep learning usually exploits neural network architectures such as multiple layers of fully-connected networks (called deep neural networks or DNNs) or convolution neural networks (CNNs) DNNs very good at classification CNNs very good at translation-invariant feature extraction Both approaches heavily compute-bound and use extensive use of matrix multiplications Our focus on exploiting GPUs for accelerating speech-to-text processing using a joint CNN-DNN architecture Our system uses a native implementation of CNN and DNN operations using only cublas functions Performance competitive to (sometimes better than) cudnn Talk on this work on Friday 9 am, 210A (S5231) 11 3/17/2015

Case Studies: In-memory Relational OLAP Relational OLAP characterized by both compute-bound and memory-bound operations Aggregation operations compute-bound Join, Grouping, and Ordering memory-bound Query optimizers use either sort or hash based approaches Both sorting and hashing can exploit GPU s high memory bandwidth Data needs to fit in the device memory Two projects GPU-optimized new hash table (GTC 2014) Exploiting GPU-accelerated hashing for OLAP group-by operations in DB2 BLU (Talk and demo on Tuesday S5835 and S5229) 12 3/17/2015

Case Studies: In-memory Multi-dimensional OLAP Sparse data, representing values (measures) of multidimensional attributes (dimensions) with complex hierarchies (e.g., IBM Cognos TM1) Logically viewed as a multi-dimensional cube (MOLAP) accesses by a non-sql query language Unlike relational OLAP, MOLAP characterized by noncontiguous memory accesses similar to accessing multidimensional arrays Most queries involve aggregation computations GPU-accelerated MOLAP prototype implemented Up to 4X improvement over multi-threaded SIMDized execution 13 3/17/2015

Case Studies: Financial Modeling via Monte Carlo Simulation Monte Carlo simulation extensively used in financial modeling Used for pricing esoteric options, when there is no analytical solution. Typically 10-20% of pricing functions in a portfolio Low I/O- High Compute Workload: suitable for accelerators such as FPGA and GPUs Key computational functions Random number generators (e.g., Sobol) Generating probability distribution (e.g., Inverse Normal) Pricing functions Current focus on exploiting low-power GPUs Talk on this work on Thursday 10.30 am, 210C (S5227) 14 3/17/2015

Other Analytics Opportunities Sparse Matrix Factorization Watson Sparse Matrix Package on GPU (Presentation S5232 on Thursday, 9 am, 210G) Streaming Analytics GPUs to accelerate functions on data streams (e.g., aggregation, Join) Graph Analytics Identification of Concurrent Cycles in a fmri image graph Mathematical Programming Accelerating Linear Programming Libraries 15 3/17/2015

Summary OpenPower consortium creates an open community to spur innovation using Power-based platforms Demonstrated advantages of GPU via different analytics workloads cognitive computing, OLAP, deep learning, financial modeling Power 8 and GPU system ideal for accelerating analytics, HPC, and data management workloads 16 3/17/2015