Speeding Up Computation in Probabilistic Graphical Models using GPGPUs

Size: px

Start display at page:

Download "Speeding Up Computation in Probabilistic Graphical Models using GPGPUs"

Todd Horton
6 years ago
Views:

1 Speeding Up Computation in Probabilistic Graphical Models using GPGPUs Lu Zheng & Ole J. Mengshoel {lu.zheng GPU Technology Conference San Jose, CA, March 20, 2013

2 CMU in Silicon Valley Established 2002 Significant growth in the past 10 years

3 Overview Bayesian networks (BNs) Representation of joint probability distribution Computation of marginals and most probable explanations Compilation to junction trees GPU: Speed up junction tree computation Why? Parallel opportunities in junction trees Element-wise and arithmetic parallelism Thread optimization using machine learning Increasing importance of parallel computing Speed matters: Real-time requirements in applications Increasing size of data sets in machine learning Big Data Junction tree computation can be used in inner loop of EM

4 Bayesian Networks and Probability A Bayesian network is a compact representation of a joint probability distribution function P( V1,..., V n ) Representation is according to Bayesian network structure: P( V,..., V ) P( V V,..., V ) P( V V ) P( V ) 1 n n n n i 1 P( V pa( V )) i where pa(v i ) gives parents of V i in the graph. i C A B For our example: P(A, B, C, D, E) = P(E A, B, C, D) P(D A, B, C) P(C A, B) P(B A) P(A) = P(E D) P(D B, C) P(C A,) P(B A) P(A) D E

Bayesian Network Applications Diagnosis and

Naïve Bayes Error correction coding Biological and

information fusion Probabilistic risk assessment

5 Bayesian Network Applications Diagnosis and monitoring Electrical power systems Spam filtering Naïve Bayes Error correction coding Biological and medical research Computer security Sensor and information fusion Probabilistic risk assessment Fault trees Natural language understanding Intelligent tutoring systems

6 Stereo Diagnosis - Structure System hidden state - underlying causes of failure System observable state sensors or tests If the CD does not spin, then there might be bad or no sound. If there s a media fault, then the CD might not spin.

7 Stereo Diagnosis - Probabilities Each BN node comes with a conditional probability table (CPT).

8 Stereo Diagnosis - Marginals Marginal: Posterior distribution P(X e), where e is evidence (input). Input: Everything is wrong. Output: Marginals suggest there is a power failure, everything else is OK. Input: LED is lighting; everything else is wrong. Output: Marginals suggest there is a media failure, everything else is OK.

9 Stereo Diagnosis - Compilation Bayesian Network LEDFault PowerFault (P) AmpFault (A) SpeakerFault (S) MediaFault (M) NoLight (G) BadSound (B) NoSpin (N) Compile Separator: Table that is the intersection of its neighboring nodes PowerFault (P) Clique: Table exponential in number of nodes PowerFault (P) NoSpin (N) NoSpin (N) P, N P SpeakerFault (S) PowerFault (P) NoLight (G) Junction Tree MediaFault (M) BadSound (B) AmpFault (A) LEDFault Root clique: A designated clique

10 Probabilistic Diagnosis Approach LEDFault PowerFault (P) AmpFault (A) SpeakerFault (S) MediaFault (M) NoLight (G) BadSound (B) NoSpin (N) PowerFault (P) PowerFault (P) NoSpin (N) PowerFault (P) Each health variable has at least two states ( healthy and faulty ), thus admitting the diagnosis of an arbitrary number of discrete faults. NoSpin (N) MediaFault (M) SpeakerFault (S) BadSound (B) AmpFault (A) NoLight (G) LEDFault OFF-LINE PHASE Offline Generation Bayesian Network (BN) Offline Compilation System Specification Junction Tree (JT) For diagnosis, we perform message passing in the junction tree to answer a probabilistic query over the health variables to understand which components and/or sensors are in non-healthy state(s). Sensor, Commands ON-LINE PHASE Online Inference Diagnosis: MLV, MPE, MAP

11 Backgound Previous Work Parallel Message Passing Tuning GPU Parameters Summary Junction Tree Algorithm Bayesian Network Inference with Junction Tree Generate junction tree (moralization and triangulation) A,B,D B D B,C D,E Propagate the belief values B =? Infer for the desired random variable B based on the posterior distribution Pr(B e). 4 / 40

12 Backgound Previous Work Parallel Message Passing Tuning GPU Parameters Summary Computational Challenge for Junction Tree Computational Difficulty Graph structure A densely connected BN can result in big cliques in junction tree. Computational complexity increases exponentially with the treewidth, which is the optimal maximal number of nodes in clique. Random variables value sets The high cardinality of source discrete BN nodes. Algorithm The inner loop of iterative algorithms like the EM algorithm. Thus Develop parallel computing techniques for junction tree inference. 5 / 40

13 Backgound Previous Work Parallel Message Passing Tuning GPU Parameters Summary 1 Backgound 2 Previous Work 3 Parallel Message Passing Techniques Experimental Results 4 Tuning GPU Parameters Techniques Experimental Results 5 Summary 6 / 40

14 Backgound Previous Work Parallel Message Passing Tuning GPU Parameters Summary Related Works Parallel Junction Tree Graph level parallelism GraphLab [Low at al, 2010] Parallel Lauritzen-Spiegelhalter algorithm [Kozlov & Singh, 1996] Node level parallelism Pointer Jumper [Namasivayam & Prasanna, 2006] Parallel node level table operation primitives [Xia & Prasanna, 2007] 7 / 40

15 1 Backgound 2 Previous Work 3 Parallel Message Passing Techniques Experimental Results 4 Tuning GPU Parameters Techniques Experimental Results 5 Summary 8 / 40

16 Message Passing Message Passing Marginalization. The potential table φ Sik of the separator is updated to φ S ik by marginalizing the potential table φ Xi : φ S ik = X i /S ik φ Xi. Scattering. The potential table of C k is updated using both the old and new table of S ik : φ X k = φ Xk φ S ik φ Sik. 9 / 40

17 Parallelism in Message Passing Contribution Identified the parallel opportunities in message passing Element-wise parallelism Arithmetic parallelism Proposed and implemented a parallel belief propagation algorithm Proposed high level technique such as clique merging to increase the parallel opportunities. B,C Parallel Message Passing B A,B,D Thread 1 Thread 2 Thread N D D,E 10 / 40

18 Element-wise Parallelism B D Clique ABD Index mapping from right clique to separator BC Clique 010 B B Separator A,B,D Index mapping from left clique to separator B,C D,E Message passing: (A, B, D) (B) (B, C). Potential table operations marginalization: φ (B) = A,D φ(a, B, D). scattering: φ (B, C) = φ(b, C) φ (B) φ(b). 11 / 40

19 Element-wise Parallelism Details Sepset-cluster mapping table [Huang & Darwiche, 1996] - record the indices of the cells in the clique potential table that map to the same separator potential table cell, providing data separation. All sepset-cluster mapping table can be processed in parallel. 12 / 40

20 Arithmetic Parallelism Look Up Table Separator: CPT Look Up Table (d-1) 2 2 (d-1) d -1 Clique i Separator Clique k Clique k: CPT + + Clique i: CPT Sum Parallel Multiplication Parallel (d-2) (d-1) Arithmetic Parallelism Parallelize the summation in marginalization Parallelize the multiplication in scattering 13 / 40

21 Analysis Of Message Passing Time A message passing from clique i to clique k Let φ Xi, φ Xk and φ S be the potential tables of clique i, clique k and separator respectively. is the cardinality operation. p is the number of threads assigned to the multiplication parallelism. The time complexity of this message passing is log φ Xi log φ S + 2 φ X k φ S p. 14 / 40

22 Analysis Of Propagation Time Belief Propagation Let Ne(C) be the neighbors of C in the junction tree. Suppose the kernel overhead is a constant τ. The time complexity for belief propagation is 2(n 1)τ + ( log φ Xi log φ S + 2 φ ) X k. φ S p i k Ne(C i ) 15 / 40

23 Parallelism Challenges Clique Merging When φ Sik and µ Xi,S ik (n) are small, there is not enough amount of parallel opportunity to occupy the concurrency resources provided by the computing platform. Examples of Bayesian networks that consists of small cliques and separators 300 Pigs 450 Munin <2 2~5 5~10 10~20 20~5050~100 >100 Separator Table Size 0 (X10) <2 2~5 5~10 10~20 20~50 50~100 >100 Separator Table Size (X10) 16 / 40

24 Clique Merging: Handling Challenges Clique Merging A E A E C D C&D B F B F G G Theorem Two neighboring cliques C i and C j in a junction tree J with the separator S ij can be merged together into an equivalent new clique C ij with the potential function φ(x Cij ) = φ(x C i )φ(x Cj ), φ(x Sij ) while keeping all the other part of the junction tree unchanged. 17 / 40

25 Experimental Setups Data sets and Platform Implementation are tested on a number of Bayesian networks from different domains, with varying structures and state spaces. Implementation Platform NVIDIA Geforce GTX 460 # of Processing Cores 336 Shared Memory 48K per block Global Memory 785MB Memory Bandwidth 90 GB/sec peak Intel Core2 Quad CPU # of Cores 4 Processor Clock 2.5GHz Cache 8MB Memory 9 GB 18 / 40

26 CPU versus GPU: Experiments Performance Comparison Using: Element-wise parallelism, arithmetic parallelism and clique merging. The speedups for different Bayesian networks range from 1.82x to 11.94x, with an average of 5.54x (or 554%) CPU execution time GPU execution time Execution Time Pigs Munin2 Munin3 Munin4 Mildew Water Barley Diabetes 19 / 40

27 Results with AP, EP and CM (1) Dataset Mildew Water Barley Diabetes # of original JT nodes # of JT nodes after merge Ave. CPT size in original JT 341, , , Ave. CPT size after merge 447, , , Ave. SPT size in original JT 9,273 26,065 39, Ave. SPT size after merge 11,883 29,129 40, GPU time before merge [ms] GPU time after merge [ms] Merging Speedup 1.02x 1.01x 1.01x 1.02x CPU time [ms] CPU versus GPU 8.59x 7.35x 11.94x 5.81x 20 / 40

28 Results with AP, EP and CM (2) Dataset Pigs Munin2 Munin3 Munin4 # of original JT nodes # of JT nodes after merge Ave. CPT size in original JT 1,972 5,653 3,443 16,444 Ave. CPT size after merge 5,393 10,191 7,374 26,720 Ave. SPT size in original JT ,099 Ave. SPT size after merge 757 1, ,214 GPU time before merge [ms] GPU time after merge [ms] Merging Speedup 1.36x 1.14x 1.14x 1.10x CPU time [ms] CPU versus GPU 2.26x 2.43x 1.82x 3.35x 21 / 40

29 1 Backgound 2 Previous Work 3 Parallel Message Passing Techniques Experimental Results 4 Tuning GPU Parameters Techniques Experimental Results 5 Summary 22 / 40

30 Difficulty in GPU parameter tuning (PT) Tuning GPU Matching platform concurrency and parallel opportunity in message passing. Irregular topology of junction tree - there is high variability in workload and usage patterns. Limited thread availability for two dimensions of parallelism. GPU architecture evolves rapidly 23 / 40

31 GPU Performance for Large Separator Large separators great potential for many sepset-cluster mapping tables large element-wise parallel opportunity Example I Varying TBS, AP, EP 24 / 40

32 GPU Performance for Small Separator Small separators little potential for many sepset-cluster mapping tables small element-wise parallel opportunity. Example II Varying TBS, AP, EP 25 / 40

33 GPU Performance for Medium Separator Medium separators? Example III Varying TBS, AP, EP 26 / 40

34 Regression Models for GPU Goal Build regression models to emulate the GPU performance space. Use regression model to optimize parameter setting. Features Feature Description φ S Size of separator potential table φ X Size of clique potential table K Size of GPU thread block Threads for arithmetic parallelism p a Table: Features used in regression models 27 / 40

35 Polynomial Regression Polynomial Regression Polynomial model + lasso ˆβ lasso = arg min β N y i β 0 i=1 subject to Equivalent Lagrangian ˆβ lasso = arg min β p β j t, j=1 N y i β 0 i=1 p Poly j (x) j=1 p Poly j (x) j=1 2 + λ 2 p β j, j=1 28 / 40

36 Support Vector Regression (SVR) SVR SVR f (x, ω) = m ω j g j (x) + b, j=1 Loss Function: ɛ - insensitive { 0 if y f (x, ω) ɛ L ω (y, f (x, ω)) = y f (x, ω) ɛ otherwise 29 / 40

37 Metrics for Regression Model Quality Traditional Metric Residual Squared Sum (RSS) RSS = n (y i f (x i )) 2 i=1 Minimum Variance Squared Deviance (SD) SD = (T ( φ S, φ X, ˆK, ˆp a) T ) 2, φ S, φ X Miss Rate (MR) MR = φ S, φ X 1( ˆK K or ˆp a pa), N 30 / 40

38 Regression Model Evaluation Model Evaluation Method Data RSS SD MR Lasso(λ = 0) Reduce 12.3e % Lasso(λ = 1se) Reduce 14.9e4 11.3e % SVR Reduce 13.6e % Lasso(λ = 0) Scatter 1.31e % Lasso(λ = 1se) Scatter 1.57e % SVR Scatter 2.69e % Table: Residual Squared Sum (RSS), Squared Deviance (SD) and Miss Rate (MR) of polynomial and SVR model for GPU reduction and scattering execution time 31 / 40

39 Regression Example: Large Separator Example I - Marginalization actual lasso 30 svr 25 Execution Time p tot p e 32 / 40

40 Regression Example: Small Separator Example II - Marginalization 35 actual lasso svr 30 Execution Time p tot p e 33 / 40

41 Regression Example: Medium Separator Example III - Marginalization 0.6 actual lasso svr 0.5 Execution Time p tot p e 34 / 40

42 GPU Execution Times GPU Execution Times with EP + AP + PT threshold GPU parameters 1 thresold GPU parameters GPU parameters with lasso model (λ = 1SE) GPU parameters with lasso model (λ = 0) GPU parameters with SVR Execution Time Pigs Water Munin2 Munin3 Munin4 Mildew Barley Diabetes SVR works best in these experiments. 35 / 40

43 Results with Regression Parameter Tuning(1) Manual parameter tuning versus regression parameter tuning Pigs Water Munin2 Munin3 Munin4 # of JT nodes Ave. CPT size 1, ,297 5,653 3,443 16,444 Ave. SPT size ,099 9,273 0-threshold [ms] threshold [ms] SVR [ms] Lasso (λ = 0) [ms] Lasso (λ = 1se)[ms] GPU time [ms] CPU time [ms] SVR/CPU Speedup 3.72x 12.68x 4.80x 3.89x 7.04x 36 / 40

44 Results with Regression Parameter Tuning(2) Manual parameter tuning versus regression parameter tuning Mildew Barley Diabetes # of JT nodes Ave. CPT size 341, , Ave. SPT size 26,065 39, threshold [ms] threshold [ms] SVR [ms] Lasso (λ = 0) [ms] Lasso (λ = 1se)[ms] GPU time [ms] CPU time [ms] SVR/CPU Speedup 19.90x 21.19x 12.36x 37 / 40

45 Backgound Previous Work Parallel Message Passing Tuning GPU Parameters Summary 1 Backgound 2 Previous Work 3 Parallel Message Passing Techniques Experimental Results 4 Tuning GPU Parameters Techniques Experimental Results 5 Summary 38 / 40

46 Backgound Previous Work Parallel Message Passing Tuning GPU Parameters Summary Summary Summary Parallel belief propagation Element-wise parallelism. Arithmetic parallelism. Clique Merging. Parameter Tuning 39 / 40

47 Backgound Previous Work Parallel Message Passing Tuning GPU Parameters Summary Thanks for listening! Questions? 40 / 40

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview