Region-Based Memory Management for Task Dataflow Models

Size: px
Start display at page:

Download "Region-Based Memory Management for Task Dataflow Models"

Transcription

1 Region-Based Memory Management for Task Dataflow Models Nikolopoulos, D. (2012). Region-Based Memory Management for Task Dataflow Models: First Joint ENCORE & PEPPHER Workshop on Programmability and Portability for Emerging Architectures (EPoPPEA). Held in conjunction with the 7th International Conference on High Performance and Embedded Architectures and Compilers (HIPEAC). Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk. Download date:08. Mar. 2019

2 Region-Based Memory Management for Task Dataflow Models Dimitrios S. Nikolopoulos Joint work with: Spyros Lyberis and Polyvios Pratikakis Computer Architecture and VLSI Systems Laboratory (CARV) Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) EPoPPEA, January 24, 2012 Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

3 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

4 The Problem Manycore Processors and Shared Memory Shared memory can be hard to scale to many cores Cache coherence becomes expensive Causes excessive communication, memory contention Manycore processors that relax shared memory abstraction AMD Opteron: 12 cores, NUMA, MMU per core Cell Broadband Engine: cores, no shared memory GPUs: Thousands of cores, small shared memories between few cores Intel Single Chip Cloud: 48 cores, no shared memory Increased demand for parallel software development Threads and shared memory: accessible, error-prone Message-passing: tedious, experts-only Both are non-deterministic: difficult to debug and understand Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

5 Task-parallelism for distributed and shared memory Available in several contemporary programming models (Sequoia, Cilk, OMPSs) Recursive task-parallelism Parallel programs are composed from nested parallel tasks Annotate each task with its memory footprint Runtime system performs dependency analysis, scheduling, all required data transfers Support region-based memory management Easier to express irregular task footprints Enables use of dynamic data structures Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

6 Benefits Shared memory abstraction Resemble shared memory programming Runtime system hides data transfers among core memories Use memory footprints to transfer required data locally before starting a task Alternatively, schedule a task near the data Message-passing execution semantics Tasks compute on local data Hierarchical scheduling, easier to scale to more cores Remove shared-memory bottleneck No need for complicated SDSM protocol on every access Deterministic Implicit, correct synchronization Repeatable behavior Formal proof Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

7 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

8 Motivation for Regions Task memory footprints are easy to express if the are composed of few objects, contiguous in memory What if the task memory footprint is: A linked list or part of it? A tree or part of it? A graph or part of it? Hard to use many common linked data structures Hard to express irregular algorithms and applications Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

9 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

10 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

11 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

12 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

13 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); region G } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

14 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

15 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

16 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

17 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v1 v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

18 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v1 v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

19 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

20 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } region L region G v1 v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

21 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } region L region G v1 f[g] v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

22 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } f[l] v1 region L region G f[g] v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

23 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } f[l] v1 region L region G f[g] v0 f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

24 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } h[v1] f[l] v1 region L region G f[g] v0 f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

25 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } h[v1] f[l] v1 region L region G f[g] v0 h[v2] f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

26 Deterministic Scheduler Operational Semantics Define small-step operational semantics T, D, S, R p T, D, S, R Memory model: Global address space: the store S includes all memory addresses Distributed memory implementation: each task only accesses memory locations declared in its footprint Scheduling algorithm: Dependency metadata D maintain a task queue per memory location Always possible to spawn a task But, it can only run when at the head of queue for all locations in the footprint Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

27 Deterministic Proof Technique (Pratikakis et. al, MSPC 11) Define sequential operational semantics S, e s S, e Prove sequential equivalence on execution traces Intuitively: Every parallel execution will produce the same value and memory state as the sequential execution More precisely: Given a program e and a finite (terminating) parallel execution trace for e that produces a value v and a memory state S, we can always construct a sequential execution trace for e that also returns v and produces S Proof by induction on the parallel trace We can always reorder steps in the parallel trace to bring the next sequential step to the front of the trace Proof does not require scoped parallelism (sync) Similar to a confluence proof Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

28 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

29 Distributed Memory Management API Object allocation: void *sys_alloc(size_t size, rid_t region); void sys_balloc(size_t size, rid_t region, int num_objects, void **objects); void sys_free(void *ptr); void *sys_realloc(void *ptr, size_t size, rid_t rgn); Region allocation: rid_t sys_ralloc(rid_t parent, char level_hint); void sys_rfree(rid_t region); Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

30 Distributed Memory Implementation by Example A: 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); core0 core1 core2 A, C B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

31 Distributed Memory Implementation by Example A: 0 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue 0(10), 1(20) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); core0 core1 core2 A, C B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

32 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); core0 core1 core2 C A, B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

33 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 0(5), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 core2 C A, B C, D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

34 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 2 A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

35 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue (80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

36 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

37 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 1 2(5), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

38 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 1(20), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 B, C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

39 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue Decode: Issue: Prepare: Run: Collect: (80) 1(10), 1(20) Enqueue task args into alloc() hash from left/right side; Count non-ownership args in dep waiting If dep waiting reaches 0, put task in ready queue along with affinity stats Decide where to run, fetch non-local data Execute code when fetch is finished Update alloc() hash ownership; Update dep waiting for these objects; Maybe goto Issue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

40 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

41 Hierarchical schedulers topology Level 0 00 Level 1 (Level 3: Workers) 10 Level A 2B 2C 2D 2E 2F Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

42 Splitting region trees for hierarchical scheduling root region root L0 scheduler region L 1 M R L1 L2 Ma Mb R1 Ra object L1 scheduler Ma1 Mb1 Mb2 L2 scheduler Ra1 L1 scheduler Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

43 Barnes-Hut, 1.1M bodies, Single scheduler Execution time (seconds) Computation Worker comm Sched comm Barrier wait Simulate Load balancing New local tree Fetch remote tree Serial Number of workers Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

44 Barnes-Hut, 1.1M bodies, Two-level scheduler hierarchy Execution time (seconds) Computation Worker comm Sched comm Barrier wait Simulate Load balancing New local tree Fetch remote tree Serial 1(1) 2(1) 4(1) 8(1) 16(2) 32(4) 64(8) 128(16) 256(32) 512(64) Number of workers (number of L1 schedulers) Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

45 UPC comparison, 16 worker cores Execution time (seconds) UPC, 248-B objects UPC, 504-B objects UPC, 1016-B objects UPC, 2040-B objects Myrmics, 248-B objects Myrmics, 504-B objects Myrmics, 1016-B objects Myrmics, 2040-B objects ,000 30,000 60, , ,000 Number of objects Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

46 UPC comparison, 15, B objects UPC Myrmics, single scheduler Myrmics, hierarchical schedulers Execution time (seconds) Number of workers Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

47 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

48 Conclusions Regions language construct: arbitrary task memory footprints, better productivity shared memory programming abstractions distributed memory execution semantics scalability Work in progress: Distributed memory task-based runtime (Myrmics) Compiler support for regions (SCOOP) Implementation and testing on FPGA prototype (Formic) Acknowledgments: Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

49 CARV (FORTH-ICS) Formic 512-core FPGA Prototype Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25

for Task-Based Parallelism

for Task-Based Parallelism for Task-Based Parallelism School of EEECS, Queen s University of Belfast July Table of Contents 2 / 24 Thread-Based Programming Hard for programmer to reason about thread interleavings and concurrent

More information

DRASync: Distributed Region-based memory Allocation and Synchronization

DRASync: Distributed Region-based memory Allocation and Synchronization : Distributed Region-based memory Allocation and Synchronization Christi Symeonidou FORTH-ICS chsymeon@ics.forth.gr Polyvios Pratikakis FORTH-ICS polyvios@ics.forth.gr Angelos Bilas FORTH-ICS bilas@ics.forth.gr

More information

Myrmics: Scalable, Dependency-aware Task Scheduling on Heterogeneous Manycores

Myrmics: Scalable, Dependency-aware Task Scheduling on Heterogeneous Manycores Myrmics: Scalable, Dependency-aware Task Scheduling on Heterogeneous Manycores Spyros Lyberis, Polyvios Pratikakis, Iakovos Mavroidis Institute of Computer Science, Foundation for Research & Technology

More information

Givy, a software DSM runtime using raw pointers

Givy, a software DSM runtime using raw pointers Givy, a software DSM runtime using raw pointers François Gindraud UJF/Inria Compilation days (21/09/2015) F. Gindraud (UJF/Inria) Givy 21/09/2015 1 / 22 Outline 1 Distributed shared memory systems Examples

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

Task Superscalar: Using Processors as Functional Units

Task Superscalar: Using Processors as Functional Units Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero HotPar, June 2010 Yoav Etsion Senior Researcher Parallel Programming

More information

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel

Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print

More information

Relaxed Memory Consistency

Relaxed Memory Consistency Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

Intel Thread Building Blocks, Part II

Intel Thread Building Blocks, Part II Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization

More information

PGAS: Partitioned Global Address Space

PGAS: Partitioned Global Address Space .... PGAS: Partitioned Global Address Space presenter: Qingpeng Niu January 26, 2012 presenter: Qingpeng Niu : PGAS: Partitioned Global Address Space 1 Outline presenter: Qingpeng Niu : PGAS: Partitioned

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

New Programming Paradigms: Partitioned Global Address Space Languages

New Programming Paradigms: Partitioned Global Address Space Languages Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

An Introduction to Balder An OpenMP Run-time Library for Clusters of SMPs

An Introduction to Balder An OpenMP Run-time Library for Clusters of SMPs An Introduction to Balder An Run-time Library for Clusters of SMPs Sven Karlsson Department of Microelectronics and Information Technology, KTH, Sweden Institute of Computer Science (ICS), Foundation for

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Compositional C++ Page 1 of 17

Compositional C++ Page 1 of 17 Compositional C++ Page 1 of 17 Compositional C++ is a small set of extensions to C++ for parallel programming. OVERVIEW OF C++ With a few exceptions, C++ is a pure extension of ANSI C. Its features: Strong

More information

A programming model for deterministic task parallelism

A programming model for deterministic task parallelism A programming model for deterministic task parallelism Polyvios Pratikakis FORTH-ICS polyvios@ics.forth.gr Hans Vandierendonck Dept. of Electronics and Information Systems, Ghent University hvdieren@elis.ugent.be

More information

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to

More information

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level CLICK TO EDIT MASTER TITLE STYLE Second level THE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU PAUL BLINZER, FELLOW, HSA SYSTEM SOFTWARE, AMD SYSTEM ARCHITECTURE WORKGROUP CHAIR, HSA FOUNDATION

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Compiling Parallel Algorithms to Memory Systems

Compiling Parallel Algorithms to Memory Systems Compiling Parallel Algorithms to Memory Systems Stephen A. Edwards Columbia University Presented at Jane Street, April 16, 2012 (λx.?)f = FPGA Parallelism is the Big Question Massive On-Chip Parallelism

More information

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels

More information

Automated Refinement Checking of Asynchronous Processes. Rajeev Alur. University of Pennsylvania

Automated Refinement Checking of Asynchronous Processes. Rajeev Alur. University of Pennsylvania Automated Refinement Checking of Asynchronous Processes Rajeev Alur University of Pennsylvania www.cis.upenn.edu/~alur/ Intel Formal Verification Seminar, July 2001 Problem Refinement Checking Given two

More information

CS 240A: Shared Memory & Multicore Programming with Cilk++

CS 240A: Shared Memory & Multicore Programming with Cilk++ CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences

Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences By Anasua Bhowmik Manoj Franklin Indian Institute of Science Univ of Maryland Supported by National Science Foundation,

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

MapReduce on the Cell Broadband Engine Architecture. Marc de Kruijf

MapReduce on the Cell Broadband Engine Architecture. Marc de Kruijf MapReduce on the Cell Broadband Engine Architecture Marc de Kruijf Overview Motivation MapReduce Cell BE Architecture Design Performance Analysis Implementation Status Future Work What is MapReduce? A

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Marco Aldinucci Computer Science Dept. - University of Torino - Italy Marco Danelutto, Massimiliano Meneghin,

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Intel Parallel Studio 2011

Intel Parallel Studio 2011 THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Studio 2011 Product Brief Studio 2011 Accelerate Development of Reliable, High-Performance Serial and Threaded Applications for Multicore Studio 2011 is a comprehensive

More information

Applying List Scheduling Algorithms in a Multithreaded Execution Environment

Applying List Scheduling Algorithms in a Multithreaded Execution Environment Applying List Scheduling Algorithms in a Multithreaded Execution Environment C. A. S. Camargo, G. G. H. Cavalheiro, M. L. Pilla, S. A. C. Cavalheiro, L. Foss Universidade Federal de Pelotas (BRA) LUPS

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Marco Danelutto. May 2011, Pisa

Marco Danelutto. May 2011, Pisa Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands Grassroots ASPLOS can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands ASPLOS-17 Doctoral Workshop London, March 4th, 2012 1 Current

More information

Intel Thread Building Blocks

Intel Thread Building Blocks Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa

More information

Sequential Consistency & TSO. Subtitle

Sequential Consistency & TSO. Subtitle Sequential Consistency & TSO Subtitle Core C1 Core C2 data = 0, 1lag SET S1: store data = NEW S2: store 1lag = SET L1: load r1 = 1lag B1: if (r1 SET) goto L1 L2: load r2 = data; Will r2 always be set to

More information

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware

More information

Applying Models of Computation to OpenCL Pipes for FPGA Computing. Nachiket Kapre + Hiren Patel

Applying Models of Computation to OpenCL Pipes for FPGA Computing. Nachiket Kapre + Hiren Patel Applying Models of Computation to OpenCL Pipes for FPGA Computing Nachiket Kapre + Hiren Patel nachiket@uwaterloo.ca Outline Models of Computation and Parallelism OpenCL code samples Synchronous Dataflow

More information

Multi Agent Navigation on GPU. Avi Bleiweiss

Multi Agent Navigation on GPU. Avi Bleiweiss Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance

More information

Energy Efficiency through Significance-Based Computing

Energy Efficiency through Significance-Based Computing Energy Efficiency through Significance-Based Computing Nikolopoulos, D. S., Vandierendonck, H., Bellas, N., Antonopoulos, C. D., Lalis, S., Karakonstantis, G.,... Naumann, U. (2014). Energy Efficiency

More information

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access Martin Grund Agenda Key Question: What is the memory hierarchy and how to exploit it? What to take home How is computer memory organized.

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CS 5220: Shared memory programming. David Bindel

CS 5220: Shared memory programming. David Bindel CS 5220: Shared memory programming David Bindel 2017-09-26 1 Message passing pain Common message passing pattern Logical global structure Local representation per processor Local data may have redundancy

More information

Foundations of the C++ Concurrency Memory Model

Foundations of the C++ Concurrency Memory Model Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model

More information

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem. Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which

More information

IBM Power Multithreaded Parallelism: Languages and Compilers. Fall Nirav Dave

IBM Power Multithreaded Parallelism: Languages and Compilers. Fall Nirav Dave 6.827 Multithreaded Parallelism: Languages and Compilers Fall 2006 Lecturer: TA: Assistant: Arvind Nirav Dave Sally Lee L01-1 IBM Power 5 130nm SOI CMOS with Cu 389mm 2 2GHz 276 million transistors Dual

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve Designing Memory Consistency Models for Shared-Memory Multiprocessors Sarita V. Adve Computer Sciences Department University of Wisconsin-Madison The Big Picture Assumptions Parallel processing important

More information

Message-Passing Shared Address Space

Message-Passing Shared Address Space Message-Passing Shared Address Space 2 Message-Passing Most widely used for programming parallel computers (clusters of workstations) Key attributes: Partitioned address space Explicit parallelization

More information

Call Paths for Pin Tools

Call Paths for Pin Tools , Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that

More information

Model-Driven Verifying Compilation of Synchronous Distributed Applications

Model-Driven Verifying Compilation of Synchronous Distributed Applications Model-Driven Verifying Compilation of Synchronous Distributed Applications Sagar Chaki, James Edmondson October 1, 2014 MODELS 14, Valencia, Spain Copyright 2014 Carnegie Mellon University This material

More information

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Stream Computing using Brook+

Stream Computing using Brook+ Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture

More information

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

PRACE Autumn School Basic Programming Models

PRACE Autumn School Basic Programming Models PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries

More information

Efficient Smith-Waterman on multi-core with FastFlow

Efficient Smith-Waterman on multi-core with FastFlow BioBITs Euromicro PDP 2010 - Pisa Italy - 17th Feb 2010 Efficient Smith-Waterman on multi-core with FastFlow Marco Aldinucci Computer Science Dept. - University of Torino - Italy Massimo Torquati Computer

More information

POSIX Threads and OpenMP tasks

POSIX Threads and OpenMP tasks POSIX Threads and OpenMP tasks Jimmy Aguilar Mena February 16, 2018 Introduction Pthreads Tasks Two simple schemas Independent functions # include # include void f u n c t i

More information

OpenMP 3.0 Tasking Implementation in OpenUH

OpenMP 3.0 Tasking Implementation in OpenUH Open64 Workshop @ CGO 09 OpenMP 3.0 Tasking Implementation in OpenUH Cody Addison Texas Instruments Lei Huang University of Houston James (Jim) LaGrone University of Houston Barbara Chapman University

More information

Written Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming

Written Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming Written Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming Nicolas Bettenburg 1 Universitaet des Saarlandes, D-66041 Saarbruecken, nicbet@studcs.uni-sb.de Abstract. As traditional

More information

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Task-parallel reductions in OpenMP and OmpSs

Task-parallel reductions in OpenMP and OmpSs Task-parallel reductions in OpenMP and OmpSs Jan Ciesko 1 Sergi Mateo 1 Xavier Teruel 1 Vicenç Beltran 1 Xavier Martorell 1,2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya {jan.ciesko,sergi.mateo,xavier.teruel,

More information

The Hierarchical SPMD Programming Model

The Hierarchical SPMD Programming Model The Hierarchical SPMD Programming Model Amir Ashraf Kamil Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2011-28 http://www.eecs.berkeley.edu/pubs/techrpts/2011/eecs-2011-28.html

More information

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation

More information

Types. Type checking. Why Do We Need Type Systems? Types and Operations. What is a type? Consensus

Types. Type checking. Why Do We Need Type Systems? Types and Operations. What is a type? Consensus Types Type checking What is a type? The notion varies from language to language Consensus A set of values A set of operations on those values Classes are one instantiation of the modern notion of type

More information

The Java Memory Model

The Java Memory Model Jeremy Manson 1, William Pugh 1, and Sarita Adve 2 1 University of Maryland 2 University of Illinois at Urbana-Champaign Presented by John Fisher-Ogden November 22, 2005 Outline Introduction Sequential

More information

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Basic usage of OpenSHMEM 2 Outline Concept and Motivation Remote Read and Write Synchronisation Implementations OpenSHMEM Summary 3 Philosophy of the talks In

More information

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Topics 1. FPGA overview 2. Motivating application classes 3. Host pipes 4. Some data 2 FPGA: Fine-grained Massive Parallelism Intel Stratix 10 FPGA: Over 5 Million

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA) Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA) Sponsored by SRC and NSF as a Part of Multicore Chip Design

More information

Shared Memory Consistency Models: A Tutorial

Shared Memory Consistency Models: A Tutorial Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster Contents Overview Uniprocessor Review Sequential Consistency Relaxed

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information