Large Scale Parallel Monte Carlo Tree Search on GPU
|
|
- Erik Cole
- 5 years ago
- Views:
Transcription
1 Large Scale Parallel Monte Carlo Tree Search on Kamil Rocki The University of Tokyo Graduate School of Information Science and Technology Department of Computer Science 1
2 Tree search Finding a solution by visiting each node in a tree/graph data structure in a systematic way Typical search problems: Shortest path (i.e. Dijkstra, A*), Traveling salesman problem Min-max depth (plies) SAT solver branching complexity = f(branching, depth) 2
3 Typical Applications Planning, scheduling Discrete optimization Database search Games/puzzles (AI) Games and puzzles are a good testbed for tree-search algorithms: - Complexity - Easy to measure results - The algorithm should work in other domains 3
4 Tree search (games) depth branching complexity = f(branching, depth) 4
5 Problem statement Game tree search Nodes represent <state,actions> pair Look ahead to find the best move (minimizing cost/ maximizing reward) Search strategy - defines the way the nodes are expanded Class of problems - decision making 5
6 Analyzed problem Reversi Game - tree represent a sequence of moves Non-uniform tree (variable children number) Depth up to 60 Average number of leaves Initial position Example final position (A) (B) 6
7 Problems complexity Exhaustive search Assuming only 1 ns per state, (in Reversi): 11 moves ahead ~ 21 s 12 moves ahead ~ 3 m 13 moves ahead ~ 27 m 14 moves ahead ~ 4 h 15 moves ahead ~ 1.45 d 7
8 Parallel systems - vs clock speed has been fairly static recently - 3~3.4Ghz Due to the rise of multi-core machines, the computational power has still been increasing s be far ahead of the s in terms of FLOPS 8
9 vs - Dedicated to computation 9
10 Goal Utilize highly parallel (/+) systems to search trees efficiently Increase in + systems share (TOP500 list) But, programming is challenging 10
11 Taxonomy Single Data stream Multiple Single Instruction stream Multiple Single Instruction Single Data SISD Multiple Instruction Single Data MISD Single Instruction Multiple Data SIMD Multiple Instruction Multiple Data MIMD 11
12 SIMD processing - threads in SIMD groups Flynn s Taxonomy (1966) PU - Processing Unit 12
13 SIMD processing - why a problem? Threads a[i] = b[i] + c[i]; if (a[i] > 0) WAIT else a[i] = a[i]*a[i]; a[i] = 0; Tree search - much more complex 13
14 Thread 0 2nd iteration Thread Inner loop 1 Thread divergence problem 14
15 Thread 0 5th iteration Thread Inner loop 1 4 Thread divergence problem 15
16 Thread 0 6th iteration Thread Inner loop Thread divergence problem 16
17 Processing 1. Data needs to be transferred from to 2. Program is executed 3. Output data is transferred back from to s memory 17
18 Processing 1. Data needs to be transferred from to 2. Program is executed 3. Output data is transferred back from to s memory 18
19 Processing 1. Data needs to be transferred from to 2. Program is executed 3. Output data is transferred back from to s memory Latency 19
20 Hardware Global memory (Big - GBs, all threads), Local memory (max 512kB per thread): slow New s (2010-) also have cache Architecture awareness Registers (per thread - max 32k per MP): fastest Shared memory (per block - max 48kB per MP): fast 20
21 Software Model Host = Device = Thread = basic execution unit = SP Block = batch of threads - sharing MP Grid = batch of blocks = MIMD Warp = scheduling unit = SIMD Programmability Thread Warp Block Grid... SIMD Processing 1 warp = 32 threads (currently) 21
22 programming difficulties Easy to write a simple program - just port the code from - bad performance Algorithms need to be rethought and reimplemented - implementation itself can be challenging ( hardware, CUDA software) Hard to achieve good performance (high parallelism) Memory hierarchy/constraints SIMD processing - warp divergence Limited communication Latency Latency Architecture awareness Programmability 22
23 Monte Carlo Tree Search 23
24 What is Monte Carlo Tree Search (MCTS)? A method for making optimal decisions in AI problems Can theoretically be applied to any domain described by {state, action} Games, i.e. GO, Reversi + some other difficult games - Coulom (2006, 2008) Optimization - Gaudel, Sebag (2010) Decision support systems - Chaslot et al. (2006) Alternative for dynamic programming, Markov models - Teytaud et al. (2008) Alternative for A* IDA* algorithms - Single Player MCTS - Schadd et al (2008) Constraint satisfaction problem - Baba et al. (2011) 24
25 MCTS - Coulom (2006) UCT - Kocsis and Szepesvári (2006) The basic MCTS algorithm is simple C - exploitation/exploration ratio factor, tunable mean value of node (i.e. success/loss ratio) standard UCT formula - selection step total simulations number of s node visited Repeated X s Selection Expansion Expension Simulation Backpropagation The selection function is applied recursively until the end of the tree The selection function is applied recursively One One (or (or more) leaf nodes are nodes created are created One simulated game is is played The The result of this this game game is is in the tree backpropagated in the tree 25
26 Exploitation vs exploration High exploitation (greedy) High exploration Quickly gets to a local optimum Greater chance of finding the global optimum 26
27 MCTS 3/6 2 parts 3/5 2/3 1/3 1/3 Tree building Stored in the memory Simulating 1. Temporary - not remembered 0 2. Done by or 3. The results are used to affect Final result: the tree s expansion strategy 0 or 1 27
28 MCTS components 3/5 Tree construction selection 3/4 1/3 2/3 C - exploitation/exploration ratio factor, tunable mean value of node (i.e. success/loss ratio) total simulations number of s node visited choose the highest evaluation 28
29 MCTS 3/5 3/4 1/3 Tree construction 2/3 expansion 1/2 Simulation I initiate the nodes with values 1 win / 2 simulations 29
30 MCTS 3/5 3/4 1/3 Tree construction 2/3 1/2 simulation Simulation 0 The result is remembered only 30
31 MCTS backpropagation 0/1 3/6 3/5 2/3 1/3 1/3 backpropagation 0/1 Tree construction Simulation 0 backpropagation 0 successes / 1 simulation 31
32 MCTS - Coulom (2006) After X iterations - the decision making step Choose the best option from the root s children Typically based on the average score (number of wins/number of simulations) UCT - Kocsis and Szepesvári (2006) What now? The best choice 32
33 The features of MCTS Aheuristic Asymmetric Tree Growth More iterations stronger algorithm Easy to parallelize No intermediate state evaluation is needed 33
34 Parallel MCTS Schemes Chaslot et al. (2008) Complex, not efficient Easy Efficient 34
35 Parallel MCTS Schemes Performance in GO Chaslot et al. (2008) 35
36 Sequential/leaf parallel MCTS Seen as a optimization problem 36
37 Root parallel MCTS - many starting points Greater chance of reaching the global 37
38 Parallel MCTS on Leaf parallelism - Not efficient, but easy to implement on No divergence problem, high throughput (simulation speed) Root parallelism - Very efficient on, but would be inefficient on One thread per tree Thousands of trees, where to store them?, how to manage them? Divergence problem again Solution? 38
39 Proposed solution: Block parallelism 39
40 Block parallelism (a) + (b) = Block parallelism (c) n trees Weakness: sequential tree management part (proportional to the number of trees) n simulations a. Leaf parallelism b. Root parallelism n = blocks(trees) x threads (simulations at once) Advantage: Works well with SIMD hardware, improves the overall result on 2 levels of parallelization c. Block parallelism 40
41 sequential overhead : stores and manages the trees Repeated X s Selection Expansion Expension Simulation Backpropagation : simulates The selection function is applied recursively until the end of the tree The selection function is applied recursively One One (or (or more) leaf nodes are nodes created are created One simulated game is is played The The result of this this game game is is in the tree backpropagated in the tree More trees processing takes more 4 trees 8 trees Sequential parts Parallel Parallel Parallel Parts Parallel Parallel 41
42 sequential overhead More trees: More spent on the sequential part 1 thread iterates over the trees Sequential parts (Backpropagation) (Selection) (Expansion) Possible solution: more threads per Parallel parts (simulation) 42
43 Analysis Leaf parallelism vs block parallelism vs cpu root parallelism More simulations/s = better score? Block parallelism: number of trees vs their management cost 43
44 Proposed solution - Block parallelism - block parallelism vs leaf parallelism speed Simulations/second x 105 Leaf parallelism (block size = 64) Block parallelism (block size = 32) Block parallelism (block size = 128) Average for 2000 games vs 1 cpu thread 1 - around sim/s is much faster! 112 trees 448 trees Threads 44
45 Win ratio Proposed solution - Block parallelism - block parallelism vs leaf parallelism result Average for 2000 games vs 1 cpu thread 112 trees 448 trees Leaf parallelism (block size = 64) Block parallelism (block size = 32) Block parallelism (block size = 128) 112 trees Threads 45
46 Proposed solution - Block parallelism Number of trees/speed/results More trees = higher score More simulations = higher score More trees = fewer simulations Block size needs to be adjusted 1 ~ s (AI strength) 46
47 Proposed solution - Block parallelism Weakness no 1 Still a part of the algorithm relies on weak leaf parallelism Solution: Variance based error estimation 47
48 Variance based error estimation Calculation based on all gathered samples Works better for larger number number of samples utilizes leaf parallelism, need many samples to work Applied to the decision making step (final selection) best average best lower estimate possible decisions values exact value solid lines - averages dotted lines - lower confidence bounds possible errors 48
49 Variance based error estimation - results 1 Tesla C2050 (8192 threads, 64 Trees) vs 1 (TSUBAME) blocks x 128 threads Points Leaf basic Block basic Leaf + error est Block + error est 49
50 Proposed solution - Block parallelism Weakness no 2 - s latency, shallow trees 1 simulation takes ~ 0.1ms = 10 sims/ms simulations take ~ 100ms = 300 sims/ms But it takes at least 50ms to execute any number of simulations on s MCTS iterations are much faster More iterations, more tree nodes, deeper trees, more look ahead Repeated X s Selection Expansion Expension Simulation Backpropagation The selection function is applied recursively until the end of the tree The selection function is applied recursively One One (or (or more) leaf nodes are nodes created are created One simulated game is is played The The result of this this game game is is in the tree backpropagated in the tree 50
51 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
52 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
53 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
54 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
55 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
56 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
57 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
58 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
59 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
60 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
61 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Simulating Processing cpu control can work here! kernel execution gpu ready event 51
62 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Simulating Processing cpu control can work here! kernel execution gpu ready event 51
63 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
64 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
65 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
66 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
67 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution Simulating gpu ready event 51
68 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution Simulating gpu ready event 51
69 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
70 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
71 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
72 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
73 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution Simulating gpu ready event 51
74 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution Simulating gpu ready event 51
75 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processing cpu control can work here! kernel execution gpu ready event 51
76 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
77 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Finished cpu control can work here! kernel execution gpu ready event 51
78 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
79 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
80 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
81 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
82 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
83 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
84 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
85 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
86 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
87 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
88 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
89 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call Simulating cpu control can work here! kernel execution gpu ready event 51
90 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call Simulating cpu control can work here! kernel execution gpu ready event 51
91 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
92 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
93 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
94 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
95 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call Simulating cpu control can work here! kernel execution gpu ready event 51
96 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call Simulating cpu control can work here! kernel execution gpu ready event 51
97 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
98 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
99 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
100 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
101 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution Simulating gpu ready event 51
102 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution Simulating gpu ready event 51
103 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processing kernel execution call cpu control can work here! kernel execution gpu ready event 51
104 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
105 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Finished kernel execution call cpu control can work here! kernel execution gpu ready event 51
106 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
107 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
108 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
109 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call cpu control can work here! kernel execution gpu ready event 51
110 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processed by kernel execution call cpu control can work here! kernel execution gpu ready event 51
111 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result Processed by kernel execution call cpu control can work here! kernel execution gpu ready event 51
112 Simultaneous / simulating While runs a kernel can work too Increases the tree depth, improves the overall result kernel execution call Processed by Processed by cpu control can work here! kernel execution gpu ready event 51
113 Simultaneous / simulating - results 14 Points Average point difference (score) 1 vs 128 s Game step Depth Average tree depth Game step 52
114 MCTS Possible Scalability Limitations Analysis 53
115 MPI Parallel Scheme Process number 0 controls the game N processes init Other machine i.e. core i7, Fedora Other machine i.e. Phenom, Ubuntu Receive the opponent s move Input data Send the current state of the game to all processes Root process id = 0 n-1 processes broadcast data Think simulate Network simulate simulate All simulations are independent Accumulate results Output data collect data (reduce) Choose the best move and send it to the opponent 54
116 Multi- results First Conclusions Simulations/second No communication bottleneck ~20mln sim/s Average Point Difference ,376 threads Improvement gets worse No of s (112 block x 64 Threads) 55
117 Scalability analysis Findings: Weak scaling of the algorithm - problem s complexity affects the scalability Exploitation/exploration ratio - higher exploitation needed for more trees No communication bottleneck Much more efficient than the version 56
118 Sampling with replacement Probability of having exactly x distinct simulations after n simulations, where m is the total number of possible combinations P(m,n, x) = m n 1 x n x m + n 1 n n D(m,n) x P(m,n, x) x 1 D(m,n) - expected number of distinct samples Small problem size : increased number of repeated samples 57
119 Sampling with replacement If the state-space is small, the impact of the parallelism will be diminished Low problem complexity The problem is simple itself (i.e. small instances of SameGame, TicTacToe) Ending phases of games/puzzles (few steps ahead) Problem s size decreases Depth Game step 58
120 Scalability analysis 50 score difference (1p vs 2p) 256 s 3,670,016 threads and 2048 threads vs sequential MCTS Score Point difference Parallel - Sequential TSUBAME 2.0 TESLA s 0 Losing if below Game step 59
121 Exploration/exploitation in parallel MCTS Trees High exploitation SUM Trees High exploration SUM 60
122 Problems/proposed solutions Problem Solution - my contribution architecture Block parallelism, fast random sequences generation (not presented here) Latency, slow iterations Hybrid - processing Low efficient leaf parallel part of block parallelism Variance based error estimation for decision making Scalability unknown MPI and Implementation and analysis 61
123 How universal and important is the MCTS block-parallel algorithm? MCTS has many applications already New ones are appearing The architecture is likely to follow the trend in the future Programming s may become easier, rather not harder 62
124 Current challenges Further investigation on large scale tree search New applications of MCTS, i.e. TSP New AI and optimization algorithms for Multiple threads per in block parallelism GTC 2010 presentation: Playing Zero-Sum Games on the, NVIDIA, Tic tac toe, Sudoku, 63
125 64
126 Publications Kamil Rocki, Reiji Suda, "Parallel Minimax Tree Searching on ", PPAM 2009 Eighth International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, 15 Sep. (13-16 Sep.), Kamil Rocki, Reiji Suda, "Massively Parallel Monte Carlo Tree Search", 2010 VECPAR Conference, Berkeley, CA (USA) June 22-25, 2010 Kamil Rocki, Reiji Suda, Improving the parallel Monte Carlo Tree Search performance by the standard deviation based error estimation, 3rd International Conference on Machine Learning and Computing (ICMLC 2011) Singapore, February 26-28, 2011 Kamil Rocki, Reiji Suda, "MPI- Monte Carlo Tree Search", IEEE 2011 International Conference on Information and Computer Applications (ICICA), Dubai, UAE, March Mar 2011 Kamil Rocki, "Large-Scale Parallel Monte Carlo Tree Search on ", PhD Forum, 25th IEEE IPDPS, May 16-20, 2011, Anchorage, USA Kamil Rocki, Reiji Suda, "Parallel Monte Carlo Tree Search on ", 11th Scandinavian Conference on Artificial Intelligence, Norwegian University of Science and Technology, Trondheim, Norway May 24th - 26th, 2011 (nomination for the best paper, automatically submitted to a journal, in progress) Kamil Rocki, Reiji Suda, "Parallel Monte Carlo Tree Search Scalability Discussion", 24th Australasian Joint Conference on Artificial Intelligence, 5-8, December, 2011, Perth, Australia 65
Monte Carlo Tree Search PAH 2015
Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationMonte Carlo Tree Search
Monte Carlo Tree Search 2-15-16 Reading Quiz What is the relationship between Monte Carlo tree search and upper confidence bound applied to trees? a) MCTS is a type of UCB b) UCB is a type of MCTS c) both
More informationMonotonicity. Admissible Search: That finds the shortest path to the Goal. Monotonicity: local admissibility is called MONOTONICITY
Monotonicity Admissible Search: That finds the shortest path to the Goal Monotonicity: local admissibility is called MONOTONICITY This property ensures consistently minimal path to each state they encounter
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationMonte Carlo Tree Search: From Playing Go to Feature Selection
Monte Carlo Tree Search: From Playing Go to Feature Selection Michèle Sebag joint work: Olivier Teytaud, Sylvain Gelly, Philippe Rolet, Romaric Gaudel TAO, Univ. Paris-Sud Planning to Learn, ECAI 2010,
More informationHigh Performance Computing Systems
High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationAn Analysis of Virtual Loss in Parallel MCTS
An Analysis of Virtual Loss in Parallel MCTS S. Ali Mirsoleimani 1,, Aske Plaat 1, Jaap van den Herik 1 and Jos Vermaseren 1 Leiden Centre of Data Science, Leiden University Niels Bohrweg 1, 333 CA Leiden,
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationArtificial Intelligence. Game trees. Two-player zero-sum game. Goals for the lecture. Blai Bonet
Artificial Intelligence Blai Bonet Game trees Universidad Simón Boĺıvar, Caracas, Venezuela Goals for the lecture Two-player zero-sum game Two-player game with deterministic actions, complete information
More informationA Lock-free Multithreaded Monte-Carlo Tree Search Algorithm
A Lock-free Multithreaded Monte-Carlo Tree Search Algorithm Markus Enzenberger and Martin Müller Department of Computing Science, University of Alberta Corresponding author: mmueller@cs.ualberta.ca Abstract.
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationBandit-based Search for Constraint Programming
Bandit-based Search for Constraint Programming Manuel Loth 1,2,4, Michèle Sebag 2,4,1, Youssef Hamadi 3,1, Marc Schoenauer 4,2,1, Christian Schulte 5 1 Microsoft-INRIA joint centre 2 LRI, Univ. Paris-Sud
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationAlgorithm Design Techniques (III)
Algorithm Design Techniques (III) Minimax. Alpha-Beta Pruning. Search Tree Strategies (backtracking revisited, branch and bound). Local Search. DSA - lecture 10 - T.U.Cluj-Napoca - M. Joldos 1 Tic-Tac-Toe
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationHEURISTIC SEARCH. 4.3 Using Heuristics in Games 4.4 Complexity Issues 4.5 Epilogue and References 4.6 Exercises
4 HEURISTIC SEARCH Slide 4.1 4.0 Introduction 4.1 An Algorithm for Heuristic Search 4.2 Admissibility, Monotonicity, and Informedness 4.3 Using Heuristics in Games 4.4 Complexity Issues 4.5 Epilogue and
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationMonte Carlo Tree Search
Monte Carlo Tree Search Branislav Bošanský PAH/PUI 2016/2017 MDPs Using Monte Carlo Methods Monte Carlo Simulation: a technique that can be used to solve a mathematical or statistical problem using repeated
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationTrees, Trees and More Trees
Trees, Trees and More Trees August 9, 01 Andrew B. Kahng abk@cs.ucsd.edu http://vlsicad.ucsd.edu/~abk/ How You ll See Trees in CS Trees as mathematical objects Trees as data structures Trees as tools for
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationHigh Performance Computing in C and C++
High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday
More informationAlgorithm Design Techniques. Hwansoo Han
Algorithm Design Techniques Hwansoo Han Algorithm Design General techniques to yield effective algorithms Divide-and-Conquer Dynamic programming Greedy techniques Backtracking Local search 2 Divide-and-Conquer
More informationBlock-Parallel IDA* for GPUs
Proceedings of the Tenth International Symposium on Combinatorial Search (SoCS 2017) Block-Parallel IDA* for GPUs Satoru Horie, Alex Fukunaga Graduate School of Arts and Sciences The University of Tokyo
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More informationLecture 2: CUDA Programming
CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationFoundations of Artificial Intelligence
Foundations of Artificial Intelligence 45. AlphaGo and Outlook Malte Helmert and Gabriele Röger University of Basel May 22, 2017 Board Games: Overview chapter overview: 40. Introduction and State of the
More informationState Space Search. Many problems can be represented as a set of states and a set of rules of how one state is transformed to another.
State Space Search Many problems can be represented as a set of states and a set of rules of how one state is transformed to another. The problem is how to reach a particular goal state, starting from
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationCMU-Q Lecture 2: Search problems Uninformed search. Teacher: Gianni A. Di Caro
CMU-Q 15-381 Lecture 2: Search problems Uninformed search Teacher: Gianni A. Di Caro RECAP: ACT RATIONALLY Think like people Think rationally Agent Sensors? Actuators Percepts Actions Environment Act like
More informationClick to edit Master title style Approximate Models for Batch RL Click to edit Master subtitle style Emma Brunskill 2/18/15 2/18/15 1 1
Approximate Click to edit Master titlemodels style for Batch RL Click to edit Emma Master Brunskill subtitle style 11 FVI / FQI PI Approximate model planners Policy Iteration maintains both an explicit
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More information44.1 Introduction Introduction. Foundations of Artificial Intelligence Monte-Carlo Methods Sparse Sampling 44.4 MCTS. 44.
Foundations of Artificial ntelligence May 27, 206 44. : ntroduction Foundations of Artificial ntelligence 44. : ntroduction Thomas Keller Universität Basel May 27, 206 44. ntroduction 44.2 Monte-Carlo
More informationPotential Midterm Exam Questions
Potential Midterm Exam Questions 1. What are the four ways in which AI is usually viewed? Which of the four is the preferred view of the authors of our textbook? 2. What does each of the lettered items
More informationTDT4260/DT8803 COMPUTER ARCHITECTURE EXAM
Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationHigh Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA
High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationIntroduction to GPU programming with CUDA
Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationGPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1
GPU Architectures for Non-Graphics People GPU Background David Black-Schaffer david.black-schaffer@it.uu.se David Black-Schaffer 1 David Black-Schaffer 2 GPUs: Architectures for Drawing Triangles Fast!
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationUninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall
Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes
More informationParallel Programming Programowanie równoległe
Parallel Programming Programowanie równoległe Lecture 1: Introduction. Basic notions of parallel processing Paweł Rzążewski Grading laboratories (4 tasks, each for 3-4 weeks) total 50 points, final test
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationIntroduction to High Performance Computing
Introduction to High Performance Computing Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 25-26
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationParallel Monte-Carlo Tree Search
Parallel Monte-Carlo Tree Search Guillaume M.J-B. Chaslot, Mark H.M. Winands, and H. Jaap van den Herik Games and AI Group, MICC, Faculty of Humanities and Sciences, Universiteit Maastricht, Maastricht,
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationMonte Carlo Methods; Combinatorial Search
Monte Carlo Methods; Combinatorial Search Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 22, 2012 CPD (DEI / IST) Parallel and
More informationParallel Monte Carlo Tree Search from Multi-core to Many-core Processors
Parallel Monte Carlo Tree Search from Multi-core to Many-core Processors S. Ali Mirsoleimani, Aske Plaat, Jaap van den Herik and Jos Vermaseren Leiden Centre of Data Science, Leiden University Niels Bohrweg
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationOptimized Scientific Computing:
Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationComputer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationCME 213 SPRING Eric Darve
CME 213 SPRING 2017 Eric Darve MPI SUMMARY Point-to-point and collective communications Process mapping: across nodes and within a node (socket, NUMA domain, core, hardware thread) MPI buffers and deadlocks
More informationArtificial Intelligence CS 6364
Artificial Intelligence CS 6364 Professor Dan Moldovan Section 4 Informed Search and Adversarial Search Outline Best-first search Greedy best-first search A* search Heuristics revisited Minimax search
More informationLECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS
Department of Computer Science University of Babylon LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS By Faculty of Science for Women( SCIW), University of Babylon, Iraq Samaher@uobabylon.edu.iq
More informationParallel Computing Introduction
Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices
More informationtree follows. Game Trees
CPSC-320: Intermediate Algorithm Design and Analysis 113 On a graph that is simply a linear list, or a graph consisting of a root node v that is connected to all other nodes, but such that no other edges
More informationHigh Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem
High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem Kamil Rocki, PhD Department of Computer Science Graduate School of Information Science and Technology The University of
More informationParallel Programming. Parallel algorithms Combinatorial Search
Parallel Programming Parallel algorithms Combinatorial Search Some Combinatorial Search Methods Divide and conquer Backtrack search Branch and bound Game tree search (minimax, alpha-beta) 2010@FEUP Parallel
More informationChapter 2 Classical algorithms in Search and Relaxation
Chapter 2 Classical algorithms in Search and Relaxation Chapter 2 overviews topics on the typical problems, data structures, and algorithms for inference in hierarchical and flat representations. Part
More informationProgrammable Graphics Hardware (GPU) A Primer
Programmable Graphics Hardware (GPU) A Primer Klaus Mueller Stony Brook University Computer Science Department Parallel Computing Explained video Parallel Computing Explained Any questions? Parallelism
More information