Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Similar documents
XPU A Programmable FPGA Accelerator for Diverse Workloads

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Pactron FPGA Accelerated Computing Solutions

Introduction to Xeon Phi. Bill Barth January 11, 2013

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Altera SDK for OpenCL

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

OpenCL on FPGAs - Creating custom accelerated solutions

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Parallel Systems. Project topics

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

FPGA-based Supercomputing: New Opportunities and Challenges

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

CPU-GPU Heterogeneous Computing

Welcome. Altera Technology Roadshow 2013

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

rcuda: an approach to provide remote access to GPU computational power

Cover TBD. intel Quartus prime Design software

Elaborazione dati real-time su architetture embedded many-core e FPGA

The Era of Heterogeneous Computing

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Modern Processor Architectures. L25: Modern Compiler Design

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

HPC future trends from a science perspective

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Accelerating Data Center Workloads with FPGAs

AN 831: Intel FPGA SDK for OpenCL

GPU Architecture. Alan Gray EPCC The University of Edinburgh

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Cover TBD. intel Quartus prime Design software

Intel HLS Compiler: Fast Design, Coding, and Hardware

Hybrid Implementation of 3D Kirchhoff Migration

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Introduc)on to Xeon Phi

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

FPGA Acceleration of 3D Component Matching using OpenCL

Cuda C Programming Guide Appendix C Table C-

Architecture, Programming and Performance of MIC Phi Coprocessor

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Michael Adler 2017/09

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

When MPPDB Meets GPU:

Parallel Programming on Ranger and Stampede

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

World s most advanced data center accelerator for PCIe-based servers

Exploring Automatically Generated Platforms in High Performance FPGAs

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Parallel Computing: Parallel Architectures Jin, Hai

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Cloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Exploring OpenCL Memory Throughput on the Zynq

Efficient Parallel Programming on Xeon Phi for Exascale

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Higher Level Programming Abstractions for FPGAs using OpenCL

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

ECE 574 Cluster Computing Lecture 15

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

CME 213 S PRING Eric Darve

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Intel Xeon with FPGA IP Asynchronous Core Cache Interface (CCI-P) Shim

Heterogeneous Computing and OpenCL

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Trends in HPC (hardware complexity and software challenges)

Intel Many Integrated Core (MIC) Architecture

Simulation using MIC co-processor on Helios

Preparing for Highly Parallel, Heterogeneous Coprocessing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes

Intel Architecture for HPC

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

CGRA HARP: Virtualization of a Reconfigurable Architecture on the Intel HARP Platform

HPC Architectures. Types of resource currently in use

Using FPGAs as Microservices

Transcription:

High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018

Agenda Introduction Background K-Means Clustering and the CPU-FPGA platform Proposed Implementation Experimental Methodology Results and Conclusion 2

Introduction

Introduction The demand for HPC Massively parallel architectures Large-scale multi-cores Challenge: Power consumption Heterogeneous computing GPUs and Co-processors (Xeon Phi) New challenge: data movement between devices Big data scenario 4

Introduction Field Programmable Gate Arrays (FPGAs) Is FPGA a good alternative for Big Data processing Runs specific tasks in hardware (no general purpose units) K-Means implementations [3][4] Offers 10% of GPUs power consumption [5] Scales better than CPU and GPU if used properly [6] 5

Introduction Challenge: Data movement between host and device Multi-chip Packaging (MCP) [7] Low bandwidth Integrates heterogeneous devices or chips in a single board Intel launched their MCP platform Xeon Broadwell CPU + Arria 10 FPGA 6

Introduction Back to Big Data Consider the ever growing amount of data to be processed Can hybrid MCPs accelerate Machine Learning applications? Our goal is to propose and evaluate K-Means algorithm for the Intel Broadwell + Arria 10 platform Contributions OpenCL-based implementation for the novel platform Comparative analysis against other platforms 7

Background

Background: K-Means Clustering Unsupervised Machine Learning algorithm Partition and group data according to their features Euclidean distance Given: A set of n data points A real d-dimensional space k-clusters 9

Background: K-Means Clustering 10

Background: CPU-FPGA platform Intel hybrid MCP Host: 14-core Xeon Broadwell @ 2.4GHz Device: Arria 10 FPGA, GX1150 Interconnection: 2 PCIe (16 GB/s together) 1 QPI (12.8 GB/s) 11

Background: CPU-FPGA platform FIU: FPGA Interface Unit AFU: Accelerated Function Unit CCI: Core Cache Interface Communication done in a coherent manner 12

Proposed Parallel K-Means

Proposed Parallel K-Means Baseline: OpenMP version from CAP Bench [17] A benchmark suite for performance and energy evaluation of low-power many-core processors Intel hybrid and heterogeneous MCP platform Host: C/C++ code to initialize the system, data structures and for synchronization Device: OpenCL kernel to perform the desired computation 14

Proposed Parallel K-Means 15

Proposed Parallel K-Means Update centroids at the host (C/C++) Not much complex arithmetic operations Conditional branches (not suitable for the FPGA) Data movement (only centroids need to be offloaded) OpenMP using 14-cores 16

Proposed Parallel K-Means Distance computing using NDRange Kernel No gains when using SimpleTask pipeline High data parallelism (no dependency between data points) Number of Working Units equal to the amount of data points 17

Proposed Parallel K-Means Square root from Euclidean distance removed Does not change the point-to-centroid mapping Loop unroll for features Features are independent from each other at the first step of the Euclidean distance 18

Experimental Methodology

Experimental Methodology Compilers and tools Intel SDK for OpenCL compiler (AOCL) Quartus 16.0.2 (Power Play included) GCC 4.9.2 (OpenMP 4.0) PAPI (for measuring CPU energy consumption) 20

Experimental Methodology Other platforms CPU: Intel Xeon Sandy Bridge E5-2620 (12 cores @ 2.10 GHz) Intel Xeon Phi Knights Corner (61 cores) 8-node Raspberry Pi B2 cluster (32 cores) Kalray MPPA-256 many-core (256 cores) 20 tests (standard error lower than 8%) 21

Experimental Methodology Workload sizes Workload Data points Centroids Features Iterations* Tiny 4096 256 16 13 Small 8192 512 16 15 Standard 16384 1024 16 14 Large 32768 1024 16 25 Huge 65536 1024 16 48 * Measured after the tests 22

Results Evaluation

Results: Resource usage Resource Available AFU Logic utilization 1150 10% (115) ALUTs 427200 4% (17088) Registers 1708800 6% (102528) Memory blocks 67244 12% (8069) DSP blocks 1518 5% (76) There is still space for other kernels 24

Results: Time to solution Centroids computing does not represent much Read/write to/from host/device represent, at most, 1.22% from overall time Time increases proportionally to input sizes and iterations CPU scales slightly better, however, the FPGA is faster. 25

Results: Time to solution FPGA improvements: Tiny: 68.21% Small: 36.57% Standard: 14.03% Large: 23.59% Huge: 10.82% 26

Results: Energy efficiency Tiny to Standard: CPU improves efficiency, but it starts to decrease with higher inputs CPU+FPGA is more energy efficient than the CPU FPGA consumes less power Improvements: Lowest: 70.71% (Standard) Biggest: 85.92% (Tiny) 27

Results: Other platforms Xeon Phi: CPU+FPGA is 7.2x more energy efficient Raspberry Cluster: CPU+FPGA is 21.5x more energy efficient Kalray MPPA-256: CPU+FPGA is 3.8x more energy efficient 28

Results: Other platforms Xeon Phi: CPU+FPGA is 7.2x more energy efficient Raspberry Cluster: CPU+FPGA is 21.5x more energy efficient Kalray MPPA-256: CPU+FPGA is 3.8x more energy efficient 29

Conclusion Energy efficiency evaluation of K-Means Novel Intel hybrid MCP platform Xeon Broadwell CPU + Arria 10 FPGA Up to 85.92% more energy efficient than CPU-only More energy efficient than other platforms 30

Future work Further K-Means evaluations Real datasets Other platforms (e.g. GPU) Other Machine Learning algorithms Deep Neural Networks Support Vector Machines Decision Trees Load balance considering the system heterogeneity 31

Acknowledgement 32

High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018

Related Work

Related Work FPGA-based solutions for Deep Neural Networks Memory restrictions: key problems Stream buffers to reduce data exchange [18] Pipeline of kernels to increase data reuse [19] Floating point precision reduction [20] 35

Related Work K-Means clustering Hardware implementations vs. software ones Single FPGA: 95% less energy consumption [4] Single FPGA: Speedup of 10x [22] Multiple FPGAs: Speedup of 330x [24] 36

References [3] Q. Y. Tang et al: Acceleration of k-means Algorithm Using Altera SDK for OpenCL [4] L. A. Maciel et al: Projeto e Avaliação de uma Arquitetura do Algoritmo de Clusterização KMeans em VHDL e FPGA [5] J. Cong et al: Understanding performance differences of FPGAs and GPUs [6] K. O Brien et al: Towards exascale computing with heterogeneous architectures [7] R. Tummala et al: Heterogeneous and homogeneous package integration technologies at device and system levels [17] M. A. Souza et al: CAP Bench: a benchmark suite for performance and energy evaluation of low-power many-core processors [18] U. Aydonat et al: An OpenCL deep learning accelerator on Arria 10 [19] D. Wang et al: PipeCNN: An OpenCL-based FPGA accelerator for large-scale convolution neuron networks [20] J. Zhang et al: Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network [22] J. Canilho et al: Multi-core for K-means clustering on FPGA [24] M. Vidal et al: Implementation of the K-means algorithm on heterogeneous devices: a use case based on an industrial dataset 37

Background: K-Means Clustering 38

Results: Resource usage Resource Available FIU + CCI AFU Logic utilization 1150 49% (563) 10% (115) ALUTs 427200 23% (98256) 4% (17088) Registers 1708800 27% (461376) 6% (102528) Memory blocks 67244 23% (15466) 12% (8069) DSP blocks 1518 12% (182) 5% (76) FIU + CCI consume a considerable resources There is still space for other kernels 39