Fast Knowledge Discovery in Time Series with GPGPU on Genetic Programming

Similar documents
CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

High Performance Computing on GPUs using NVIDIA CUDA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Classification Using Genetic Programming. Patrick Kellogg General Assembly Data Science Course (8/23/15-11/12/15)

A case study of performance portability with OpenMP 4.5

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Performance impact of dynamic parallelism on different clustering algorithms

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU

Parallelising Pipelined Wavefront Computations on the GPU

Interpreting a genetic programming population on an nvidia Tesla

A Hybrid Genetic Algorithm for a Variant of Two-Dimensional Packing Problem

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Malware Detection based on Dependency Graph using Hybrid Genetic Algorithm

High Performance Video Artifact Detection Enhanced with CUDA. Atul Ravindran Digimetrics

High Performance Computing. Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

CUDA Experiences: Over-Optimization and Future HPC

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Parallel Computing with MATLAB

Building NVLink for Developers

General Purpose GPU Computing in Partial Wave Analysis

B. Tech. Project Second Stage Report on

GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU SUNDAR SRINIVASAN SENTHILKUMAR T. R.

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Highly Scalable Multi-Objective Test Suite Minimisation Using Graphics Card

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Mutations for Permutations

Generic Polyphase Filterbanks with CUDA

Numerical Simulation on the GPU

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Performance, Power, Die Yield. CS301 Prof Szajda

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Asynchronous K-Means Clustering of Multiple Data Sets. Marek Fiser, Illia Ziamtsov, Ariful Azad, Bedrich Benes, Alex Pothen

CS 179: GPU Programming. Lecture 7

Accelerated Machine Learning Algorithms in Python

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Computer Architectures

TAIL LATENCY AND PERFORMANCE AT SCALE

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Accelerating CFD with Graphics Hardware

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

OpenACC Course. Office Hour #2 Q&A

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Advances in Metaheuristics on GPU

AN OPTIMIZATION GENETIC ALGORITHM FOR IMAGE DATABASES IN AGRICULTURE

Parallel Computing: Parallel Architectures Jin, Hai

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Parallel Computer Architecture

Towards Approximate Computing: Programming with Relaxed Synchronization

GPU Accelerated Machine Learning for Bond Price Prediction

Solving Classification Problems Using Genetic Programming Algorithms on GPUs

GPU Programming Using NVIDIA CUDA

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Performance of computer systems

GPU-accelerated data expansion for the Marching Cubes algorithm

GENERAL-PURPOSE COMPUTATION USING GRAPHICAL PROCESSING UNITS

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture

Using GPUs to compute the multilevel summation of electrostatic forces

Cooperative Caching for GPUs

1 P a g e A r y a n C o l l e g e \ B S c _ I T \ C \

Introduction to Data Mining and Data Analytics

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

GPU Profiling and Optimization. Scott Grauer-Gray

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance potential for simulating spin models on GPU

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Two-Phase flows on massively parallel multi-gpu clusters

Parallel Compact Roadmap Construction of 3D Virtual Environments on the GPU

How to Optimize Geometric Multigrid Methods on GPUs

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

CUDA (Compute Unified Device Architecture)

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Height field ambient occlusion using CUDA

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Warped parallel nearest neighbor searches using kd-trees

Accelerating local search algorithms for the travelling salesman problem through the effective use of GPU

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CME 213 S PRING Eric Darve

CUDA Architecture & Programming Model

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

RECAP. B649 Parallel Architectures and Programming

Profiling of Data-Parallel Processors

WHY PARALLEL PROCESSING? (CE-401)

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Recent Advances in Heterogeneous Computing using Charm++

gpucc: An Open-Source GPGPU Compiler

Maintaining Data Privacy in Association Rule Mining

Adaptive Parallel Compressed Event Matching

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen

Parallelism paradigms

gpucc: An Open-Source GPGPU Compiler

Transcription:

Fast Knowledge Discovery in Time Series with GPGPU on Genetic Programming Sungjoo Ha, Byung-Ro Moon Optimization Lab Seoul National University Computer Science GECCO 2015 July 13th, 2015 Sungjoo Ha, Byung-Ro Moon 1 / 24

Outline Motivation Knowledge discovery in time series data Knowledge discovery engine Discovery engine Query engine CUDA execution/memory model Details of parallel framework Experiments & results Conclusion Sungjoo Ha, Byung-Ro Moon 2 / 24

Knowledge Discovery A process of finding interesting patterns or structures from the given data Sungjoo Ha, Byung-Ro Moon 3 / 24

Time Series Data A sequence of records each containing a set of attributes and a timestamp May be collected from multiple sources that share common characteristics Examples of time series data Stock markets Scientific computing Monitoring metrics IoT Sungjoo Ha, Byung-Ro Moon 4 / 24

Knowledge Discovery in Time Series Finding precursors to some events of interest If the closing price of yesterday is less than the five-day moving average of yesterday than it is going to be a bull market tomorrow If the volume of certain hashtag doubles in 10 minutes than that hashtag becomes a new trend in twitter. Sungjoo Ha, Byung-Ro Moon 5 / 24

Knowledge Discovery Engine Discovery engine Propose candidate pattern Genetic programming Query engine Match pattern against data GPGPU Sungjoo Ha, Byung-Ro Moon 6 / 24

Pattern Pattern is a conjunction of Boolean expressions Expression is a combination of constants, comparison, arithmetic, logical operators, and attributes May contain domain specific functions 1.1 p o ( 1) < p c (0) MA 20 (0) > MA 60 (0) Find patterns such that they indicate an occurrence of some event A pattern may not yield the same result all the time Deal with the uncertainties by taking the expectation of the events Sungjoo Ha, Byung-Ro Moon 7 / 24

Discovery Engine Standard GP Any additional features that are appropriate for the domain may be included Fitness of a pattern should reflect how interesting the events associated with the pattern are Profitability of a technical pattern r Expected earning rate of r after k trading days E k [r] = 1 R(r) (i,j) R(r) p c(i,j+k) p c(i,j) R(r) = {(i, j) r matches company i on trading day j} pc (i, j) is the closing price of company i at trading day j Sungjoo Ha, Byung-Ro Moon 8 / 24

Parallel GP Only parallelize fitness evaluation Low complexity Simpler code Flexible GP Fitness evaluation is most time consuming Especially for hybrid GP Sungjoo Ha, Byung-Ro Moon 9 / 24

CUDA Execution Model Grid Block Block Block Block Block Block Kernel executes in parallel by multiple threads Threads form blocks, blocks form grids Block Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Sungjoo Ha, Byung-Ro Moon 10 / 24

CUDA Memory Model Per-thread local memory Local memory (global memory, slow) Register (fast) Threads within a block can communicate using shared memory For global synchronization, launch multiple kernels Global memory Constant memory Thread Block Grid Block Block Block Block Block Block Grid Block Block Block Block Block Block Per-thread Local Memory Per-block Shared Memory Global Memory Sungjoo Ha, Byung-Ro Moon 11 / 24

Parallel Framework Big Picture Query Engine Discovery Engine Constant Memory P c (0) P o (-1) > Global Memory 10 12 Opening Prices 12 11 Closing Prices Time Shared Memory Index 5 Thread 5 False False Stack > 10.00 11.00 Sungjoo Ha, Byung-Ro Moon 12 / 24

Parallel Framework Storing Pattern Query Engine Constant Memory Global Memory Discovery engine generates candidate pattern Pattern is stored on constant memory in postfix notation Discovery Engine P c (0) P o (-1) > Shared Memory False False Stack > 10.00 11.00 10 12 12 11 Time Index 5 Thread 5 Opening Prices Closing Prices Sungjoo Ha, Byung-Ro Moon 13 / 24

Parallel Framework Storing Time Series Query Engine Constant Memory Global Memory Discovery Engine P c (0) P o (-1) > 10 12 Opening Prices Time series data is stored on global memory Shared Memory 12 11 Time Index 5 Thread 5 Closing Prices False False Stack > 10.00 11.00 Sungjoo Ha, Byung-Ro Moon 14 / 24

Memory Utilization Each time series as 1-D array per field arranged in time Threads jump along the time axis and process data Better memory access pattern Example Field 1/2 : opening/closing prices First index : company Second index : day index Attribute 1 Attribute 2 Opening Price Closing Price Field1[0][0] Field1[1][0] Field1[2][0] Field2[0][0] Field2[1][0] Field2[2][0] IBM[0] APPL[0] MSFT[0] IBM[0] APPL[0] MSFT[0] Time Field1[0][1] Field1[0][2] Field1[0][3] Field1[1][1] Field1[1][2] Field1[1][3] Field1[2][1] Field1[2][2] Field1[2][3] Field2[0][1] Field2[0][2] Field2[0][3] Field2[1][1] Field2[1][2] Field2[1][3] Field2[2][1] Field2[2][2] Field2[2][3] Time IBM[1] IBM[2] IBM[3] APPL[1] APPL[2] APPL[3] MSFT[1] MSFT[2] MSFT[3] IBM[1] IBM[2] IBM[3] APPL[1] APPL[2] APPL[3] MSFT[1] MSFT[2] MSFT[3] Sungjoo Ha, Byung-Ro Moon 15 / 24

Parallel Framework Evaluation of Pattern Query Engine Evaluation of postfix pattern using stack Shared memory vs local memory Local may lead to better performance (even though shared is much faster) Discovery Engine Constant Memory P c (0) P o (-1) > Shared Memory False False Global Memory 10 12 12 11 Time Index 5 Thread 5 Stack > 10.00 11.00 Opening Prices Closing Prices Sungjoo Ha, Byung-Ro Moon 16 / 24

Parallel Framework Per-Block Partial Results Per-block partial results Shared memory Parallel reduction If multiple input sources are needed to calculate the partial results Accumulate temporary values in global memory Launch a separate kernel for reduction Discovery Engine Query Engine Constant Memory P c (0) P o (-1) > Shared Memory False False Global Memory 10 12 12 11 Time Index 5 Thread 5 Stack > 10.00 11.00 Opening Prices Closing Prices Sungjoo Ha, Byung-Ro Moon 17 / 24

Parallel Framework Recap Query Engine Discovery Engine Constant Memory P c (0) P o (-1) > Global Memory 10 12 Opening Prices 12 11 Closing Prices Time Shared Memory Index 5 Thread 5 False False Stack > 10.00 11.00 Sungjoo Ha, Byung-Ro Moon 18 / 24

Experiments Stock market technical pattern mining Find patterns with high profitability Korean stock market data from 2000 to 2014 Genetic programming steady state 500 individuals Random initialization Roulette wheel selection Swap randomly chosen subtrees for crossover Random subtree mutation Traverse each nodes and change the contants for local optimization Sungjoo Ha, Byung-Ro Moon 19 / 24

Results A typical execution of GP for 50 generations using 8 CUDA devices take 120 to 180 seconds Roughly 17,000 to 21,000 fitness evaluations are performed Local optimization takes more than 97% of the time Sungjoo Ha, Byung-Ro Moon 20 / 24

Results Single Fitness Evaluation 300 Relative Speed Against CPU 50 Mean Execution Time (GPU-only) 250 40 Relative Speedup 200 150 100 Time (ms) 30 20 50 10 0 CPU 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7GPU 8GPU Number of Devices 0 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7GPU 8GPU Number of Devices Strong linear relationship between the relative speed and the number of GPU devices Roughly yields 80% gain per additional device Sungjoo Ha, Byung-Ro Moon 21 / 24

Result Local vs Shared Memory for Stack Relative Speed # Devices Local Shared µ σ µ σ CPU 1.0 0 1.0 0 1 GPU 56.9 3.8 35.8 2.8 2 GPU 101.7 7.2 65.0 5.5 3 GPU 132.7 11.1 88.0 8.1 4 GPU 157.1 15.5 107.2 10.7 5 GPU 191.7 19.9 130.3 13.7 6 GPU 222.0 24.5 151.5 16.6 7 GPU 247.5 29.3 172.2 19.5 8 GPU 277.0 36.8 188.7 22.4 Local memory for stack performs better than shared for this task Freeing shared memory lead to high occupancy of the multiprocessors High latency of the global memory is hidden Sungjoo Ha, Byung-Ro Moon 22 / 24

Result Tree Sizes 4000 Execution Time vs. Number of Nodes (CPU) 70 Execution Time vs. Number of Nodes (GPU) 1GPU 3500 60 4GPU 8GPU 50 Time (ms) 3000 2500 Time (ms) 40 30 20 2000 10 1500 35 40 45 50 55 Number of Nodes 0 35 40 45 50 55 Number of Nodes Parallelization allows us to build increasingly complex trees For CPU, time nearly doubles as tree grows from 31 to 55 nodes Using many devices leads to a slower increase in execution time Sungjoo Ha, Byung-Ro Moon 23 / 24

Conclusion Proposed knowledge discovery framework using genetic programming and GPGPU GP based discovery engine GPGPU based query engine Performs well on a real world task Strong linear relationship Sungjoo Ha, Byung-Ro Moon 24 / 24