Fast Knowledge Discovery in Time Series with GPGPU on Genetic Programming

Fast Knowledge Discovery in Time Series with GPGPU on Genetic Programming Sungjoo Ha, Byung-Ro Moon Optimization Lab Seoul National University Computer Science GECCO 2015 July 13th, 2015 Sungjoo Ha, Byung-Ro Moon 1 / 24

Outline Motivation Knowledge discovery in time series data Knowledge discovery engine Discovery engine Query engine CUDA execution/memory model Details of parallel framework Experiments & results Conclusion Sungjoo Ha, Byung-Ro Moon 2 / 24

Knowledge Discovery A process of finding interesting patterns or structures from the given data Sungjoo Ha, Byung-Ro Moon 3 / 24

Time Series Data A sequence of records each containing a set of attributes and a timestamp May be collected from multiple sources that share common characteristics Examples of time series data Stock markets Scientific computing Monitoring metrics IoT Sungjoo Ha, Byung-Ro Moon 4 / 24

Knowledge Discovery in Time Series Finding precursors to some events of interest If the closing price of yesterday is less than the five-day moving average of yesterday than it is going to be a bull market tomorrow If the volume of certain hashtag doubles in 10 minutes than that hashtag becomes a new trend in twitter. Sungjoo Ha, Byung-Ro Moon 5 / 24

Knowledge Discovery Engine Discovery engine Propose candidate pattern Genetic programming Query engine Match pattern against data GPGPU Sungjoo Ha, Byung-Ro Moon 6 / 24

Pattern Pattern is a conjunction of Boolean expressions Expression is a combination of constants, comparison, arithmetic, logical operators, and attributes May contain domain specific functions 1.1 p o ( 1) < p c (0) MA 20 (0) > MA 60 (0) Find patterns such that they indicate an occurrence of some event A pattern may not yield the same result all the time Deal with the uncertainties by taking the expectation of the events Sungjoo Ha, Byung-Ro Moon 7 / 24

Discovery Engine Standard GP Any additional features that are appropriate for the domain may be included Fitness of a pattern should reflect how interesting the events associated with the pattern are Profitability of a technical pattern r Expected earning rate of r after k trading days E k [r] = 1 R(r) (i,j) R(r) p c(i,j+k) p c(i,j) R(r) = {(i, j) r matches company i on trading day j} pc (i, j) is the closing price of company i at trading day j Sungjoo Ha, Byung-Ro Moon 8 / 24

Parallel GP Only parallelize fitness evaluation Low complexity Simpler code Flexible GP Fitness evaluation is most time consuming Especially for hybrid GP Sungjoo Ha, Byung-Ro Moon 9 / 24

CUDA Execution Model Grid Block Block Block Block Block Block Kernel executes in parallel by multiple threads Threads form blocks, blocks form grids Block Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Sungjoo Ha, Byung-Ro Moon 10 / 24

CUDA Memory Model Per-thread local memory Local memory (global memory, slow) Register (fast) Threads within a block can communicate using shared memory For global synchronization, launch multiple kernels Global memory Constant memory Thread Block Grid Block Block Block Block Block Block Grid Block Block Block Block Block Block Per-thread Local Memory Per-block Shared Memory Global Memory Sungjoo Ha, Byung-Ro Moon 11 / 24

Parallel Framework Big Picture Query Engine Discovery Engine Constant Memory P c (0) P o (-1) > Global Memory 10 12 Opening Prices 12 11 Closing Prices Time Shared Memory Index 5 Thread 5 False False Stack > 10.00 11.00 Sungjoo Ha, Byung-Ro Moon 12 / 24

Parallel Framework Storing Pattern Query Engine Constant Memory Global Memory Discovery engine generates candidate pattern Pattern is stored on constant memory in postfix notation Discovery Engine P c (0) P o (-1) > Shared Memory False False Stack > 10.00 11.00 10 12 12 11 Time Index 5 Thread 5 Opening Prices Closing Prices Sungjoo Ha, Byung-Ro Moon 13 / 24

Parallel Framework Storing Time Series Query Engine Constant Memory Global Memory Discovery Engine P c (0) P o (-1) > 10 12 Opening Prices Time series data is stored on global memory Shared Memory 12 11 Time Index 5 Thread 5 Closing Prices False False Stack > 10.00 11.00 Sungjoo Ha, Byung-Ro Moon 14 / 24

Memory Utilization Each time series as 1-D array per field arranged in time Threads jump along the time axis and process data Better memory access pattern Example Field 1/2 : opening/closing prices First index : company Second index : day index Attribute 1 Attribute 2 Opening Price Closing Price Field1[0][0] Field1[1][0] Field1[2][0] Field2[0][0] Field2[1][0] Field2[2][0] IBM[0] APPL[0] MSFT[0] IBM[0] APPL[0] MSFT[0] Time Field1[0][1] Field1[0][2] Field1[0][3] Field1[1][1] Field1[1][2] Field1[1][3] Field1[2][1] Field1[2][2] Field1[2][3] Field2[0][1] Field2[0][2] Field2[0][3] Field2[1][1] Field2[1][2] Field2[1][3] Field2[2][1] Field2[2][2] Field2[2][3] Time IBM[1] IBM[2] IBM[3] APPL[1] APPL[2] APPL[3] MSFT[1] MSFT[2] MSFT[3] IBM[1] IBM[2] IBM[3] APPL[1] APPL[2] APPL[3] MSFT[1] MSFT[2] MSFT[3] Sungjoo Ha, Byung-Ro Moon 15 / 24

Parallel Framework Evaluation of Pattern Query Engine Evaluation of postfix pattern using stack Shared memory vs local memory Local may lead to better performance (even though shared is much faster) Discovery Engine Constant Memory P c (0) P o (-1) > Shared Memory False False Global Memory 10 12 12 11 Time Index 5 Thread 5 Stack > 10.00 11.00 Opening Prices Closing Prices Sungjoo Ha, Byung-Ro Moon 16 / 24

Parallel Framework Per-Block Partial Results Per-block partial results Shared memory Parallel reduction If multiple input sources are needed to calculate the partial results Accumulate temporary values in global memory Launch a separate kernel for reduction Discovery Engine Query Engine Constant Memory P c (0) P o (-1) > Shared Memory False False Global Memory 10 12 12 11 Time Index 5 Thread 5 Stack > 10.00 11.00 Opening Prices Closing Prices Sungjoo Ha, Byung-Ro Moon 17 / 24

Parallel Framework Recap Query Engine Discovery Engine Constant Memory P c (0) P o (-1) > Global Memory 10 12 Opening Prices 12 11 Closing Prices Time Shared Memory Index 5 Thread 5 False False Stack > 10.00 11.00 Sungjoo Ha, Byung-Ro Moon 18 / 24

Experiments Stock market technical pattern mining Find patterns with high profitability Korean stock market data from 2000 to 2014 Genetic programming steady state 500 individuals Random initialization Roulette wheel selection Swap randomly chosen subtrees for crossover Random subtree mutation Traverse each nodes and change the contants for local optimization Sungjoo Ha, Byung-Ro Moon 19 / 24

Results A typical execution of GP for 50 generations using 8 CUDA devices take 120 to 180 seconds Roughly 17,000 to 21,000 fitness evaluations are performed Local optimization takes more than 97% of the time Sungjoo Ha, Byung-Ro Moon 20 / 24

Results Single Fitness Evaluation 300 Relative Speed Against CPU 50 Mean Execution Time (GPU-only) 250 40 Relative Speedup 200 150 100 Time (ms) 30 20 50 10 0 CPU 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7GPU 8GPU Number of Devices 0 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7GPU 8GPU Number of Devices Strong linear relationship between the relative speed and the number of GPU devices Roughly yields 80% gain per additional device Sungjoo Ha, Byung-Ro Moon 21 / 24

Result Local vs Shared Memory for Stack Relative Speed # Devices Local Shared µ σ µ σ CPU 1.0 0 1.0 0 1 GPU 56.9 3.8 35.8 2.8 2 GPU 101.7 7.2 65.0 5.5 3 GPU 132.7 11.1 88.0 8.1 4 GPU 157.1 15.5 107.2 10.7 5 GPU 191.7 19.9 130.3 13.7 6 GPU 222.0 24.5 151.5 16.6 7 GPU 247.5 29.3 172.2 19.5 8 GPU 277.0 36.8 188.7 22.4 Local memory for stack performs better than shared for this task Freeing shared memory lead to high occupancy of the multiprocessors High latency of the global memory is hidden Sungjoo Ha, Byung-Ro Moon 22 / 24

Result Tree Sizes 4000 Execution Time vs. Number of Nodes (CPU) 70 Execution Time vs. Number of Nodes (GPU) 1GPU 3500 60 4GPU 8GPU 50 Time (ms) 3000 2500 Time (ms) 40 30 20 2000 10 1500 35 40 45 50 55 Number of Nodes 0 35 40 45 50 55 Number of Nodes Parallelization allows us to build increasingly complex trees For CPU, time nearly doubles as tree grows from 31 to 55 nodes Using many devices leads to a slower increase in execution time Sungjoo Ha, Byung-Ro Moon 23 / 24

Conclusion Proposed knowledge discovery framework using genetic programming and GPGPU GP based discovery engine GPGPU based query engine Performs well on a real world task Strong linear relationship Sungjoo Ha, Byung-Ro Moon 24 / 24