Large Scale Parallel Monte Carlo Tree Search on GPU

Size: px
Start display at page:

Download "Large Scale Parallel Monte Carlo Tree Search on GPU"

Transcription

1 Large Scale Parallel Monte Carlo Tree Search on Kamil Rocki The University of Tokyo Graduate School of Information Science and Technology Department of Computer Science 1

2 Tree search Finding a solution by visiting each node in a tree/graph data structure in a systematic way Typical search problems: Shortest path (i.e. Dijkstra, A*), Traveling salesman problem Min-max depth (plies) SAT solver branching complexity = f(branching, depth) 2

3 Typical Applications Planning, scheduling Discrete optimization Database search Games/puzzles (AI) Games and puzzles are a good testbed for tree-search algorithms: - Complexity - Easy to measure results - The algorithm should work in other domains 3

4 Tree search (games) depth branching complexity = f(branching, depth) 4

5 Problem statement Game tree search Nodes represent <state,actions> pair Look ahead to find the best move (minimizing cost/ maximizing reward) Search strategy - defines the way the nodes are expanded Class of problems - decision making 5

6 Analyzed problem Reversi Game - tree represent a sequence of moves Non-uniform tree (variable children number) Depth up to 60 Average number of leaves Initial position Example final position (A) (B) 6

7 Problems complexity Exhaustive search Assuming only 1 ns per state, (in Reversi): 11 moves ahead ~ 21 s 12 moves ahead ~ 3 m 13 moves ahead ~ 27 m 14 moves ahead ~ 4 h 15 moves ahead ~ 1.45 d 7

8 Parallel systems - vs clock speed has been fairly static recently - 3~3.4Ghz Due to the rise of multi-core machines, the computational power has still been increasing s be far ahead of the s in terms of FLOPS 8

9 vs - Dedicated to computation 9

10 Goal Utilize highly parallel (/+) systems to search trees efficiently Increase in + systems share (TOP500 list) But, programming is challenging 10

11 Taxonomy Single Data stream Multiple Single Instruction stream Multiple Single Instruction Single Data SISD Multiple Instruction Single Data MISD Single Instruction Multiple Data SIMD Multiple Instruction Multiple Data MIMD 11

12 SIMD processing - threads in SIMD groups Flynn s Taxonomy (1966) PU - Processing Unit 12

13 SIMD processing - why a problem? Threads a[i] = b[i] + c[i]; if (a[i] > 0) WAIT else a[i] = a[i]*a[i]; a[i] = 0; Tree search - much more complex 13

14 Thread 0 2nd iteration Thread Inner loop 1 Thread divergence problem 14

15 Thread 0 5th iteration Thread Inner loop 1 4 Thread divergence problem 15

16 Thread 0 6th iteration Thread Inner loop Thread divergence problem 16

17 Processing 1. Data needs to be transferred from to 2. Program is executed 3. Output data is transferred back from to s memory 17

18 Processing 1. Data needs to be transferred from to 2. Program is executed 3. Output data is transferred back from to s memory 18

19 Processing 1. Data needs to be transferred from to 2. Program is executed 3. Output data is transferred back from to s memory Latency 19

20 Hardware Global memory (Big - GBs, all threads), Local memory (max 512kB per thread): slow New s (2010-) also have cache Architecture awareness Registers (per thread - max 32k per MP): fastest Shared memory (per block - max 48kB per MP): fast 20

21 Software Model Host = Device = Thread = basic execution unit = SP Block = batch of threads - sharing MP Grid = batch of blocks = MIMD Warp = scheduling unit = SIMD Programmability Thread Warp Block Grid... SIMD Processing 1 warp = 32 threads (currently) 21

22 programming difficulties Easy to write a simple program - just port the code from - bad performance Algorithms need to be rethought and reimplemented - implementation itself can be challenging ( hardware, CUDA software) Hard to achieve good performance (high parallelism) Memory hierarchy/constraints SIMD processing - warp divergence Limited communication Latency Latency Architecture awareness Programmability 22

23 Monte Carlo Tree Search 23

24 What is Monte Carlo Tree Search (MCTS)? A method for making optimal decisions in AI problems Can theoretically be applied to any domain described by {state, action} Games, i.e. GO, Reversi + some other difficult games - Coulom (2006, 2008) Optimization - Gaudel, Sebag (2010) Decision support systems - Chaslot et al. (2006) Alternative for dynamic programming, Markov models - Teytaud et al. (2008) Alternative for A* IDA* algorithms - Single Player MCTS - Schadd et al (2008) Constraint satisfaction problem - Baba et al. (2011) 24

25 MCTS - Coulom (2006) UCT - Kocsis and Szepesvári (2006) The basic MCTS algorithm is simple C - exploitation/exploration ratio factor, tunable mean value of node (i.e. success/loss ratio) standard UCT formula - selection step total simulations number of s node visited Repeated X s Selection Expansion Expension Simulation Backpropagation The selection function is applied recursively until the end of the tree The selection function is applied recursively One One (or (or more) leaf nodes are nodes created are created One simulated game is is played The The result of this this game game is is in the tree backpropagated in the tree 25

26 Exploitation vs exploration High exploitation (greedy) High exploration Quickly gets to a local optimum Greater chance of finding the global optimum 26

27 MCTS 3/6 2 parts 3/5 2/3 1/3 1/3 Tree building Stored in the memory Simulating 1. Temporary - not remembered 0 2. Done by or 3. The results are used to affect Final result: the tree s expansion strategy 0 or 1 27

28 MCTS components 3/5 Tree construction selection 3/4 1/3 2/3 C - exploitation/exploration ratio factor, tunable mean value of node (i.e. success/loss ratio) total simulations number of s node visited choose the highest evaluation 28

29 MCTS 3/5 3/4 1/3 Tree construction 2/3 expansion 1/2 Simulation I initiate the nodes with values 1 win / 2 simulations 29

30 MCTS 3/5 3/4 1/3 Tree construction 2/3 1/2 simulation Simulation 0 The result is remembered only 30

31 MCTS backpropagation 0/1 3/6 3/5 2/3 1/3 1/3 backpropagation 0/1 Tree construction Simulation 0 backpropagation 0 successes / 1 simulation 31

32 MCTS - Coulom (2006) After X iterations - the decision making step Choose the best option from the root s children Typically based on the average score (number of wins/number of simulations) UCT - Kocsis and Szepesvári (2006) What now? The best choice 32

33 The features of MCTS Aheuristic Asymmetric Tree Growth More iterations stronger algorithm Easy to parallelize No intermediate state evaluation is needed 33

34 Parallel MCTS Schemes Chaslot et al. (2008) Complex, not efficient Easy Efficient 34

35 Parallel MCTS Schemes Performance in GO Chaslot et al. (2008) 35

36 Sequential/leaf parallel MCTS Seen as a optimization problem 36

37 Root parallel MCTS - many starting points Greater chance of reaching the global 37

38 Parallel MCTS on Leaf parallelism - Not efficient, but easy to implement on No divergence problem, high throughput (simulation speed) Root parallelism - Very efficient on, but would be inefficient on One thread per tree Thousands of trees, where to store them?, how to manage them? Divergence problem again Solution? 38

39 Proposed solution: Block parallelism 39

40 Block parallelism (a) + (b) = Block parallelism (c) n trees Weakness: sequential tree management part (proportional to the number of trees) n simulations a. Leaf parallelism b. Root parallelism n = blocks(trees) x threads (simulations at once) Advantage: Works well with SIMD hardware, improves the overall result on 2 levels of parallelization c. Block parallelism 40

41 sequential overhead : stores and manages the trees Repeated X s Selection Expansion Expension Simulation Backpropagation : simulates The selection function is applied recursively until the end of the tree The selection function is applied recursively One One (or (or more) leaf nodes are nodes created are created One simulated game is is played The The result of this this game game is is in the tree backpropagated in the tree More trees processing takes more 4 trees 8 trees Sequential parts Parallel Parallel Parallel Parts Parallel Parallel 41

42 sequential overhead More trees: More spent on the sequential part 1 thread iterates over the trees Sequential parts (Backpropagation) (Selection) (Expansion) Possible solution: more threads per Parallel parts (simulation) 42

43 Analysis Leaf parallelism vs block parallelism vs cpu root parallelism More simulations/s = better score? Block parallelism: number of trees vs their management cost 43

44 Proposed solution - Block parallelism - block parallelism vs leaf parallelism speed Simulations/second x 105 Leaf parallelism (block size = 64) Block parallelism (block size = 32) Block parallelism (block size = 128) Average for 2000 games vs 1 cpu thread 1 - around sim/s is much faster! 112 trees 448 trees Threads 44

45 Win ratio Proposed solution - Block parallelism - block parallelism vs leaf parallelism result Average for 2000 games vs 1 cpu thread 112 trees 448 trees Leaf parallelism (block size = 64) Block parallelism (block size = 32) Block parallelism (block size = 128) 112 trees Threads 45

46 Proposed solution - Block parallelism Number of trees/speed/results More trees = higher score More simulations = higher score More trees = fewer simulations Block size needs to be adjusted 1 ~ s (AI strength) 46

47 Proposed solution - Block parallelism Weakness no 1 Still a part of the algorithm relies on weak leaf parallelism Solution: Variance based error estimation 47

48 Variance based error estimation Calculation based on all gathered samples Works better for larger number number of samples utilizes leaf parallelism, need many samples to work Applied to the decision making step (final selection) best average best lower estimate possible decisions values exact value solid lines - averages dotted lines - lower confidence bounds possible errors 48

49 Variance based error estimation - results 1 Tesla C2050 (8192 threads, 64 Trees) vs 1 (TSUBAME) blocks x 128 threads Points Leaf basic Block basic Leaf + error est Block + error est 49

50 Proposed solution - Block parallelism Weakness no 2 - s latency, shallow trees 1 simulation takes ~ 0.1ms = 10 sims/ms simulations take ~ 100ms = 300 sims/ms But it takes at least 50ms to execute any number of simulations on s MCTS iterations are much faster More iterations, more tree nodes, deeper trees, more look ahead Repeated X s Selection Expansion Expension Simulation Backpropagation The selection function is applied recursively until the end of the tree The selection function is applied recursively One One (or (or more) leaf nodes are nodes created are created One simulated game is is played The The result of this this game game is is in the tree backpropagated in the tree 50

51 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

52 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

53 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

54 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

55 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

56 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

57 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

58 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

59 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

60 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

61 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Simulating Processing cpu control can work here! kernel execution gpu ready event 51

62 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Simulating Processing cpu control can work here! kernel execution gpu ready event 51

63 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

64 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

65 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

66 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

67 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution Simulating gpu ready event 51

68 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution Simulating gpu ready event 51

69 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

70 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

71 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

72 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

73 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution Simulating gpu ready event 51

74 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution Simulating gpu ready event 51

75 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51

76 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

77 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Finished cpu control can work here! kernel execution gpu ready event 51

78 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

79 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

80 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

81 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

82 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

83 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

84 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

85 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

86 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

87 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

88 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

89 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call Simulating cpu control can work here! kernel execution gpu ready event 51

90 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call Simulating cpu control can work here! kernel execution gpu ready event 51

91 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

92 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

93 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

94 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

95 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call Simulating cpu control can work here! kernel execution gpu ready event 51

96 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call Simulating cpu control can work here! kernel execution gpu ready event 51

97 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

98 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

99 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

100 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

101 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution Simulating gpu ready event 51

102 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution Simulating gpu ready event 51

103 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51

104 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

105 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Finished kernel execution call cpu control can work here! kernel execution gpu ready event 51

106 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

107 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

108 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

109 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51

110 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processed by kernel execution call cpu control can work here! kernel execution gpu ready event 51

111 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processed by kernel execution call cpu control can work here! kernel execution gpu ready event 51

112 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processed by Processed by cpu control can work here! kernel execution gpu ready event 51

113 Simultaneous / simulating - results 14 Points Average point difference (score) 1 vs 128 s Game step Depth Average tree depth Game step 52

114 MCTS Possible Scalability Limitations Analysis 53

115 MPI Parallel Scheme Process number 0 controls the game N processes init Other machine i.e. core i7, Fedora Other machine i.e. Phenom, Ubuntu Receive the opponent s move Input data Send the current state of the game to all processes Root process id = 0 n-1 processes broadcast data Think simulate Network simulate simulate All simulations are independent Accumulate results Output data collect data (reduce) Choose the best move and send it to the opponent 54

116 Multi- results First Conclusions Simulations/second No communication bottleneck ~20mln sim/s Average Point Difference ,376 threads Improvement gets worse No of s (112 block x 64 Threads) 55

117 Scalability analysis Findings: Weak scaling of the algorithm - problem s complexity affects the scalability Exploitation/exploration ratio - higher exploitation needed for more trees No communication bottleneck Much more efficient than the version 56

118 Sampling with replacement Probability of having exactly x distinct simulations after n simulations, where m is the total number of possible combinations P(m,n, x) = m n 1 x n x m + n 1 n n D(m,n) x P(m,n, x) x 1 D(m,n) - expected number of distinct samples Small problem size : increased number of repeated samples 57

119 Sampling with replacement If the state-space is small, the impact of the parallelism will be diminished Low problem complexity The problem is simple itself (i.e. small instances of SameGame, TicTacToe) Ending phases of games/puzzles (few steps ahead) Problem s size decreases Depth Game step 58

120 Scalability analysis 50 score difference (1p vs 2p) 256 s 3,670,016 threads and 2048 threads vs sequential MCTS Score Point difference Parallel - Sequential TSUBAME 2.0 TESLA s 0 Losing if below Game step 59

121 Exploration/exploitation in parallel MCTS Trees High exploitation SUM Trees High exploration SUM 60

122 Problems/proposed solutions Problem Solution - my contribution architecture Block parallelism, fast random sequences generation (not presented here) Latency, slow iterations Hybrid - processing Low efficient leaf parallel part of block parallelism Variance based error estimation for decision making Scalability unknown MPI and Implementation and analysis 61

123 How universal and important is the MCTS block-parallel algorithm? MCTS has many applications already New ones are appearing The architecture is likely to follow the trend in the future Programming s may become easier, rather not harder 62

124 Current challenges Further investigation on large scale tree search New applications of MCTS, i.e. TSP New AI and optimization algorithms for Multiple threads per in block parallelism GTC 2010 presentation: Playing Zero-Sum Games on the, NVIDIA, Tic tac toe, Sudoku, 63

125 64

126 Publications Kamil Rocki, Reiji Suda, "Parallel Minimax Tree Searching on ", PPAM 2009 Eighth International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, 15 Sep. (13-16 Sep.), Kamil Rocki, Reiji Suda, "Massively Parallel Monte Carlo Tree Search", 2010 VECPAR Conference, Berkeley, CA (USA) June 22-25, 2010 Kamil Rocki, Reiji Suda, Improving the parallel Monte Carlo Tree Search performance by the standard deviation based error estimation, 3rd International Conference on Machine Learning and Computing (ICMLC 2011) Singapore, February 26-28, 2011 Kamil Rocki, Reiji Suda, "MPI- Monte Carlo Tree Search", IEEE 2011 International Conference on Information and Computer Applications (ICICA), Dubai, UAE, March Mar 2011 Kamil Rocki, "Large-Scale Parallel Monte Carlo Tree Search on ", PhD Forum, 25th IEEE IPDPS, May 16-20, 2011, Anchorage, USA Kamil Rocki, Reiji Suda, "Parallel Monte Carlo Tree Search on ", 11th Scandinavian Conference on Artificial Intelligence, Norwegian University of Science and Technology, Trondheim, Norway May 24th - 26th, 2011 (nomination for the best paper, automatically submitted to a journal, in progress) Kamil Rocki, Reiji Suda, "Parallel Monte Carlo Tree Search Scalability Discussion", 24th Australasian Joint Conference on Artificial Intelligence, 5-8, December, 2011, Perth, Australia 65

Monte Carlo Tree Search PAH 2015

Monte Carlo Tree Search PAH 2015 Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Monte Carlo Tree Search

Monte Carlo Tree Search Monte Carlo Tree Search 2-15-16 Reading Quiz What is the relationship between Monte Carlo tree search and upper confidence bound applied to trees? a) MCTS is a type of UCB b) UCB is a type of MCTS c) both

More information

Monotonicity. Admissible Search: That finds the shortest path to the Goal. Monotonicity: local admissibility is called MONOTONICITY

Monotonicity. Admissible Search: That finds the shortest path to the Goal. Monotonicity: local admissibility is called MONOTONICITY Monotonicity Admissible Search: That finds the shortest path to the Goal Monotonicity: local admissibility is called MONOTONICITY This property ensures consistently minimal path to each state they encounter

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Monte Carlo Tree Search: From Playing Go to Feature Selection

Monte Carlo Tree Search: From Playing Go to Feature Selection Monte Carlo Tree Search: From Playing Go to Feature Selection Michèle Sebag joint work: Olivier Teytaud, Sylvain Gelly, Philippe Rolet, Romaric Gaudel TAO, Univ. Paris-Sud Planning to Learn, ECAI 2010,

More information

High Performance Computing Systems

High Performance Computing Systems High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

An Analysis of Virtual Loss in Parallel MCTS

An Analysis of Virtual Loss in Parallel MCTS An Analysis of Virtual Loss in Parallel MCTS S. Ali Mirsoleimani 1,, Aske Plaat 1, Jaap van den Herik 1 and Jos Vermaseren 1 Leiden Centre of Data Science, Leiden University Niels Bohrweg 1, 333 CA Leiden,

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Artificial Intelligence. Game trees. Two-player zero-sum game. Goals for the lecture. Blai Bonet

Artificial Intelligence. Game trees. Two-player zero-sum game. Goals for the lecture. Blai Bonet Artificial Intelligence Blai Bonet Game trees Universidad Simón Boĺıvar, Caracas, Venezuela Goals for the lecture Two-player zero-sum game Two-player game with deterministic actions, complete information

More information

A Lock-free Multithreaded Monte-Carlo Tree Search Algorithm

A Lock-free Multithreaded Monte-Carlo Tree Search Algorithm A Lock-free Multithreaded Monte-Carlo Tree Search Algorithm Markus Enzenberger and Martin Müller Department of Computing Science, University of Alberta Corresponding author: mmueller@cs.ualberta.ca Abstract.

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Bandit-based Search for Constraint Programming

Bandit-based Search for Constraint Programming Bandit-based Search for Constraint Programming Manuel Loth 1,2,4, Michèle Sebag 2,4,1, Youssef Hamadi 3,1, Marc Schoenauer 4,2,1, Christian Schulte 5 1 Microsoft-INRIA joint centre 2 LRI, Univ. Paris-Sud

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

Algorithm Design Techniques (III)

Algorithm Design Techniques (III) Algorithm Design Techniques (III) Minimax. Alpha-Beta Pruning. Search Tree Strategies (backtracking revisited, branch and bound). Local Search. DSA - lecture 10 - T.U.Cluj-Napoca - M. Joldos 1 Tic-Tac-Toe

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

HEURISTIC SEARCH. 4.3 Using Heuristics in Games 4.4 Complexity Issues 4.5 Epilogue and References 4.6 Exercises

HEURISTIC SEARCH. 4.3 Using Heuristics in Games 4.4 Complexity Issues 4.5 Epilogue and References 4.6 Exercises 4 HEURISTIC SEARCH Slide 4.1 4.0 Introduction 4.1 An Algorithm for Heuristic Search 4.2 Admissibility, Monotonicity, and Informedness 4.3 Using Heuristics in Games 4.4 Complexity Issues 4.5 Epilogue and

More information

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

Monte Carlo Tree Search

Monte Carlo Tree Search Monte Carlo Tree Search Branislav Bošanský PAH/PUI 2016/2017 MDPs Using Monte Carlo Methods Monte Carlo Simulation: a technique that can be used to solve a mathematical or statistical problem using repeated

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

Trees, Trees and More Trees

Trees, Trees and More Trees Trees, Trees and More Trees August 9, 01 Andrew B. Kahng abk@cs.ucsd.edu http://vlsicad.ucsd.edu/~abk/ How You ll See Trees in CS Trees as mathematical objects Trees as data structures Trees as tools for

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

High Performance Computing in C and C++

High Performance Computing in C and C++ High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday

More information

Algorithm Design Techniques. Hwansoo Han

Algorithm Design Techniques. Hwansoo Han Algorithm Design Techniques Hwansoo Han Algorithm Design General techniques to yield effective algorithms Divide-and-Conquer Dynamic programming Greedy techniques Backtracking Local search 2 Divide-and-Conquer

More information

Block-Parallel IDA* for GPUs

Block-Parallel IDA* for GPUs Proceedings of the Tenth International Symposium on Combinatorial Search (SoCS 2017) Block-Parallel IDA* for GPUs Satoru Horie, Alex Fukunaga Graduate School of Arts and Sciences The University of Tokyo

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Foundations of Artificial Intelligence

Foundations of Artificial Intelligence Foundations of Artificial Intelligence 45. AlphaGo and Outlook Malte Helmert and Gabriele Röger University of Basel May 22, 2017 Board Games: Overview chapter overview: 40. Introduction and State of the

More information

State Space Search. Many problems can be represented as a set of states and a set of rules of how one state is transformed to another.

State Space Search. Many problems can be represented as a set of states and a set of rules of how one state is transformed to another. State Space Search Many problems can be represented as a set of states and a set of rules of how one state is transformed to another. The problem is how to reach a particular goal state, starting from

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

CMU-Q Lecture 2: Search problems Uninformed search. Teacher: Gianni A. Di Caro

CMU-Q Lecture 2: Search problems Uninformed search. Teacher: Gianni A. Di Caro CMU-Q 15-381 Lecture 2: Search problems Uninformed search Teacher: Gianni A. Di Caro RECAP: ACT RATIONALLY Think like people Think rationally Agent Sensors? Actuators Percepts Actions Environment Act like

More information

Click to edit Master title style Approximate Models for Batch RL Click to edit Master subtitle style Emma Brunskill 2/18/15 2/18/15 1 1

Click to edit Master title style Approximate Models for Batch RL Click to edit Master subtitle style Emma Brunskill 2/18/15 2/18/15 1 1 Approximate Click to edit Master titlemodels style for Batch RL Click to edit Emma Master Brunskill subtitle style 11 FVI / FQI PI Approximate model planners Policy Iteration maintains both an explicit

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

44.1 Introduction Introduction. Foundations of Artificial Intelligence Monte-Carlo Methods Sparse Sampling 44.4 MCTS. 44.

44.1 Introduction Introduction. Foundations of Artificial Intelligence Monte-Carlo Methods Sparse Sampling 44.4 MCTS. 44. Foundations of Artificial ntelligence May 27, 206 44. : ntroduction Foundations of Artificial ntelligence 44. : ntroduction Thomas Keller Universität Basel May 27, 206 44. ntroduction 44.2 Monte-Carlo

More information

Potential Midterm Exam Questions

Potential Midterm Exam Questions Potential Midterm Exam Questions 1. What are the four ways in which AI is usually viewed? Which of the four is the preferred view of the authors of our textbook? 2. What does each of the lettered items

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Introduction to GPU programming with CUDA

Introduction to GPU programming with CUDA Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

GPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1

GPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1 GPU Architectures for Non-Graphics People GPU Background David Black-Schaffer david.black-schaffer@it.uu.se David Black-Schaffer 1 David Black-Schaffer 2 GPUs: Architectures for Drawing Triangles Fast!

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes

More information

Parallel Programming Programowanie równoległe

Parallel Programming Programowanie równoległe Parallel Programming Programowanie równoległe Lecture 1: Introduction. Basic notions of parallel processing Paweł Rzążewski Grading laboratories (4 tasks, each for 3-4 weeks) total 50 points, final test

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

Introduction to High Performance Computing

Introduction to High Performance Computing Introduction to High Performance Computing Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 25-26

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Parallel Monte-Carlo Tree Search

Parallel Monte-Carlo Tree Search Parallel Monte-Carlo Tree Search Guillaume M.J-B. Chaslot, Mark H.M. Winands, and H. Jaap van den Herik Games and AI Group, MICC, Faculty of Humanities and Sciences, Universiteit Maastricht, Maastricht,

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Monte Carlo Methods; Combinatorial Search

Monte Carlo Methods; Combinatorial Search Monte Carlo Methods; Combinatorial Search Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 22, 2012 CPD (DEI / IST) Parallel and

More information

Parallel Monte Carlo Tree Search from Multi-core to Many-core Processors

Parallel Monte Carlo Tree Search from Multi-core to Many-core Processors Parallel Monte Carlo Tree Search from Multi-core to Many-core Processors S. Ali Mirsoleimani, Aske Plaat, Jaap van den Herik and Jos Vermaseren Leiden Centre of Data Science, Leiden University Niels Bohrweg

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Optimized Scientific Computing:

Optimized Scientific Computing: Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

CME 213 SPRING Eric Darve

CME 213 SPRING Eric Darve CME 213 SPRING 2017 Eric Darve MPI SUMMARY Point-to-point and collective communications Process mapping: across nodes and within a node (socket, NUMA domain, core, hardware thread) MPI buffers and deadlocks

More information

Artificial Intelligence CS 6364

Artificial Intelligence CS 6364 Artificial Intelligence CS 6364 Professor Dan Moldovan Section 4 Informed Search and Adversarial Search Outline Best-first search Greedy best-first search A* search Heuristics revisited Minimax search

More information

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS Department of Computer Science University of Babylon LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS By Faculty of Science for Women( SCIW), University of Babylon, Iraq Samaher@uobabylon.edu.iq

More information

Parallel Computing Introduction

Parallel Computing Introduction Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices

More information

tree follows. Game Trees

tree follows. Game Trees CPSC-320: Intermediate Algorithm Design and Analysis 113 On a graph that is simply a linear list, or a graph consisting of a root node v that is connected to all other nodes, but such that no other edges

More information

High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem

High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem Kamil Rocki, PhD Department of Computer Science Graduate School of Information Science and Technology The University of

More information

Parallel Programming. Parallel algorithms Combinatorial Search

Parallel Programming. Parallel algorithms Combinatorial Search Parallel Programming Parallel algorithms Combinatorial Search Some Combinatorial Search Methods Divide and conquer Backtrack search Branch and bound Game tree search (minimax, alpha-beta) 2010@FEUP Parallel

More information

Chapter 2 Classical algorithms in Search and Relaxation

Chapter 2 Classical algorithms in Search and Relaxation Chapter 2 Classical algorithms in Search and Relaxation Chapter 2 overviews topics on the typical problems, data structures, and algorithms for inference in hierarchical and flat representations. Part

More information

Programmable Graphics Hardware (GPU) A Primer

Programmable Graphics Hardware (GPU) A Primer Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism

More information