Region-Based Memory Management for Task Dataflow Models
|
|
- Donald Hunter
- 5 years ago
- Views:
Transcription
1 Region-Based Memory Management for Task Dataflow Models Nikolopoulos, D. (2012). Region-Based Memory Management for Task Dataflow Models: First Joint ENCORE & PEPPHER Workshop on Programmability and Portability for Emerging Architectures (EPoPPEA). Held in conjunction with the 7th International Conference on High Performance and Embedded Architectures and Compilers (HIPEAC). Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk. Download date:08. Mar. 2019
2 Region-Based Memory Management for Task Dataflow Models Dimitrios S. Nikolopoulos Joint work with: Spyros Lyberis and Polyvios Pratikakis Computer Architecture and VLSI Systems Laboratory (CARV) Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH) EPoPPEA, January 24, 2012 Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
3 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
4 The Problem Manycore Processors and Shared Memory Shared memory can be hard to scale to many cores Cache coherence becomes expensive Causes excessive communication, memory contention Manycore processors that relax shared memory abstraction AMD Opteron: 12 cores, NUMA, MMU per core Cell Broadband Engine: cores, no shared memory GPUs: Thousands of cores, small shared memories between few cores Intel Single Chip Cloud: 48 cores, no shared memory Increased demand for parallel software development Threads and shared memory: accessible, error-prone Message-passing: tedious, experts-only Both are non-deterministic: difficult to debug and understand Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
5 Task-parallelism for distributed and shared memory Available in several contemporary programming models (Sequoia, Cilk, OMPSs) Recursive task-parallelism Parallel programs are composed from nested parallel tasks Annotate each task with its memory footprint Runtime system performs dependency analysis, scheduling, all required data transfers Support region-based memory management Easier to express irregular task footprints Enables use of dynamic data structures Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
6 Benefits Shared memory abstraction Resemble shared memory programming Runtime system hides data transfers among core memories Use memory footprints to transfer required data locally before starting a task Alternatively, schedule a task near the data Message-passing execution semantics Tasks compute on local data Hierarchical scheduling, easier to scale to more cores Remove shared-memory bottleneck No need for complicated SDSM protocol on every access Deterministic Implicit, correct synchronization Repeatable behavior Formal proof Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
7 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
8 Motivation for Regions Task memory footprints are easy to express if the are composed of few objects, contiguous in memory What if the task memory footprint is: A linked list or part of it? A tree or part of it? A graph or part of it? Hard to use many common linked data structures Hard to express irregular algorithms and applications Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
9 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
10 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
11 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
12 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
13 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); v0->right = ralloc (R, sizeof (Object)); region G } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
14 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
15 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
16 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
17 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v1 v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
18 Example Region-based Memory Management region G, L, R; Object v0; init () { G = newregion(); L = newsubregion(g); R = newsubregion(g); v1 v0 = ralloc (G, sizeof (Object)); v0->left = ralloc (L, sizeof (Object)); region L v0->right = ralloc (R, sizeof (Object)); region G } v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
19 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
20 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } region L region G v1 v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
21 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } region L region G v1 f[g] v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
22 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } f[l] v1 region L region G f[g] v0 v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
23 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } f[l] v1 region L region G f[g] v0 f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
24 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } h[v1] f[l] v1 region L region G f[g] v0 f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
25 Example Tasks and Regions f(object v0) { if (done) return; spawn f(v0->left) [ inout region L]; spawn f(v0->right) [ inout region R]; spawn h(v0->left) [ inout v0->left ]; spawn h(v0->right) [inout v0->right ]; } h[v1] f[l] v1 region L region G f[g] v0 h[v2] f[r] v2 region R Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
26 Deterministic Scheduler Operational Semantics Define small-step operational semantics T, D, S, R p T, D, S, R Memory model: Global address space: the store S includes all memory addresses Distributed memory implementation: each task only accesses memory locations declared in its footprint Scheduling algorithm: Dependency metadata D maintain a task queue per memory location Always possible to spawn a task But, it can only run when at the head of queue for all locations in the footprint Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
27 Deterministic Proof Technique (Pratikakis et. al, MSPC 11) Define sequential operational semantics S, e s S, e Prove sequential equivalence on execution traces Intuitively: Every parallel execution will produce the same value and memory state as the sequential execution More precisely: Given a program e and a finite (terminating) parallel execution trace for e that produces a value v and a memory state S, we can always construct a sequential execution trace for e that also returns v and produces S Proof by induction on the parallel trace We can always reorder steps in the parallel trace to bring the next sequential step to the front of the trace Proof does not require scoped parallelism (sync) Similar to a confluence proof Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
28 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
29 Distributed Memory Management API Object allocation: void *sys_alloc(size_t size, rid_t region); void sys_balloc(size_t size, rid_t region, int num_objects, void **objects); void sys_free(void *ptr); void *sys_realloc(void *ptr, size_t size, rid_t rgn); Region allocation: rid_t sys_ralloc(rid_t parent, char level_hint); void sys_rfree(rid_t region); Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
30 Distributed Memory Implementation by Example A: 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); core0 core1 core2 A, C B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
31 Distributed Memory Implementation by Example A: 0 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue 0(10), 1(20) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); core0 core1 core2 A, C B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
32 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 0 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); core0 core1 core2 C A, B D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
33 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 0(5), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 core2 C A, B C, D Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
34 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 2 A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
35 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue (80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
36 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
37 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 1 2(5), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
38 Distributed Memory Implementation by Example A: 1 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue 0 1(20), 2(80) A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); core0 core1 A, B core2 B, C, D # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
39 Distributed Memory Implementation by Example A: 1 0 B: 1 C: 2 D: 2 alloc() hash dep waiting ready queue Decode: Issue: Prepare: Run: Collect: (80) 1(10), 1(20) Enqueue task args into alloc() hash from left/right side; Count non-ownership args in dep waiting If dep waiting reaches 0, put task in ready queue along with affinity stats Decide where to run, fetch non-local data Execute code when fetch is finished Update alloc() hash ownership; Update dep waiting for these objects; Maybe goto Issue A = malloc(10); B = malloc(20); C = malloc(5); D = malloc(80); # task in A, out B task1(a, B); # task in C, out D task2(c, D); # task in B, inout D task3(b, D); task2 (C, D) {... # task inout D task4 (D); # wait on D } Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
40 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
41 Hierarchical schedulers topology Level 0 00 Level 1 (Level 3: Workers) 10 Level A 2B 2C 2D 2E 2F Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
42 Splitting region trees for hierarchical scheduling root region root L0 scheduler region L 1 M R L1 L2 Ma Mb R1 Ra object L1 scheduler Ma1 Mb1 Mb2 L2 scheduler Ra1 L1 scheduler Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
43 Barnes-Hut, 1.1M bodies, Single scheduler Execution time (seconds) Computation Worker comm Sched comm Barrier wait Simulate Load balancing New local tree Fetch remote tree Serial Number of workers Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
44 Barnes-Hut, 1.1M bodies, Two-level scheduler hierarchy Execution time (seconds) Computation Worker comm Sched comm Barrier wait Simulate Load balancing New local tree Fetch remote tree Serial 1(1) 2(1) 4(1) 8(1) 16(2) 32(4) 64(8) 128(16) 256(32) 512(64) Number of workers (number of L1 schedulers) Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
45 UPC comparison, 16 worker cores Execution time (seconds) UPC, 248-B objects UPC, 504-B objects UPC, 1016-B objects UPC, 2040-B objects Myrmics, 248-B objects Myrmics, 504-B objects Myrmics, 1016-B objects Myrmics, 2040-B objects ,000 30,000 60, , ,000 Number of objects Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
46 UPC comparison, 15, B objects UPC Myrmics, single scheduler Myrmics, hierarchical schedulers Execution time (seconds) Number of workers Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
47 Outline 1 Introduction Contribution 2 Memory Management with Nested Regions 3 Task and Region Semantics By example Formal Definition 4 Distributed Memory Regions 5 Early Experimental Results 6 Conclusions Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
48 Conclusions Regions language construct: arbitrary task memory footprints, better productivity shared memory programming abstractions distributed memory execution semantics scalability Work in progress: Distributed memory task-based runtime (Myrmics) Compiler support for regions (SCOOP) Implementation and testing on FPGA prototype (Formic) Acknowledgments: Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
49 CARV (FORTH-ICS) Formic 512-core FPGA Prototype Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA / 25
for Task-Based Parallelism
for Task-Based Parallelism School of EEECS, Queen s University of Belfast July Table of Contents 2 / 24 Thread-Based Programming Hard for programmer to reason about thread interleavings and concurrent
More informationDRASync: Distributed Region-based memory Allocation and Synchronization
: Distributed Region-based memory Allocation and Synchronization Christi Symeonidou FORTH-ICS chsymeon@ics.forth.gr Polyvios Pratikakis FORTH-ICS polyvios@ics.forth.gr Angelos Bilas FORTH-ICS bilas@ics.forth.gr
More informationMyrmics: Scalable, Dependency-aware Task Scheduling on Heterogeneous Manycores
Myrmics: Scalable, Dependency-aware Task Scheduling on Heterogeneous Manycores Spyros Lyberis, Polyvios Pratikakis, Iakovos Mavroidis Institute of Computer Science, Foundation for Research & Technology
More informationGivy, a software DSM runtime using raw pointers
Givy, a software DSM runtime using raw pointers François Gindraud UJF/Inria Compilation days (21/09/2015) F. Gindraud (UJF/Inria) Givy 21/09/2015 1 / 22 Outline 1 Distributed shared memory systems Examples
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationTask Superscalar: Using Processors as Functional Units
Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero HotPar, June 2010 Yoav Etsion Senior Researcher Parallel Programming
More informationEfficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel
Aalborg Universitet Efficient Resource Allocation on a Dynamic Simultaneous Multithreaded Architecture Ortiz-Arroyo, Daniel Publication date: 2006 Document Version Early version, also known as pre-print
More informationRelaxed Memory Consistency
Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More informationIntel Thread Building Blocks, Part II
Intel Thread Building Blocks, Part II SPD course 2013-14 Massimo Coppola 25/03, 16/05/2014 1 TBB Recap Portable environment Based on C++11 standard compilers Extensive use of templates No vectorization
More informationPGAS: Partitioned Global Address Space
.... PGAS: Partitioned Global Address Space presenter: Qingpeng Niu January 26, 2012 presenter: Qingpeng Niu : PGAS: Partitioned Global Address Space 1 Outline presenter: Qingpeng Niu : PGAS: Partitioned
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationOverview: The OpenMP Programming Model
Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,
More informationNew Programming Paradigms: Partitioned Global Address Space Languages
Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationAn Introduction to Balder An OpenMP Run-time Library for Clusters of SMPs
An Introduction to Balder An Run-time Library for Clusters of SMPs Sven Karlsson Department of Microelectronics and Information Technology, KTH, Sweden Institute of Computer Science (ICS), Foundation for
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationCompositional C++ Page 1 of 17
Compositional C++ Page 1 of 17 Compositional C++ is a small set of extensions to C++ for parallel programming. OVERVIEW OF C++ With a few exceptions, C++ is a pure extension of ANSI C. Its features: Strong
More informationA programming model for deterministic task parallelism
A programming model for deterministic task parallelism Polyvios Pratikakis FORTH-ICS polyvios@ics.forth.gr Hans Vandierendonck Dept. of Electronics and Information Systems, Ghent University hvdieren@elis.ugent.be
More informationConcurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis
oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to
More informationCLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level
CLICK TO EDIT MASTER TITLE STYLE Second level THE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU PAUL BLINZER, FELLOW, HSA SYSTEM SOFTWARE, AMD SYSTEM ARCHITECTURE WORKGROUP CHAIR, HSA FOUNDATION
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationCompiling Parallel Algorithms to Memory Systems
Compiling Parallel Algorithms to Memory Systems Stephen A. Edwards Columbia University Presented at Jane Street, April 16, 2012 (λx.?)f = FPGA Parallelism is the Big Question Massive On-Chip Parallelism
More informationGeneral Purpose GPU Programming (1) Advanced Operating Systems Lecture 14
General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14 Lecture Outline Heterogenous multi-core systems and general purpose GPU programming Programming models Heterogenous multi-kernels
More informationAutomated Refinement Checking of Asynchronous Processes. Rajeev Alur. University of Pennsylvania
Automated Refinement Checking of Asynchronous Processes Rajeev Alur University of Pennsylvania www.cis.upenn.edu/~alur/ Intel Formal Verification Seminar, July 2001 Problem Refinement Checking Given two
More informationCS 240A: Shared Memory & Multicore Programming with Cilk++
CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2015-16 Massimo Coppola 08/04/2015 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationEvaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences
Evaluation of a Speculative Multithreading Compiler by Characterizing Program Dependences By Anasua Bhowmik Manoj Franklin Indian Institute of Science Univ of Maryland Supported by National Science Foundation,
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationMapReduce on the Cell Broadband Engine Architecture. Marc de Kruijf
MapReduce on the Cell Broadband Engine Architecture Marc de Kruijf Overview Motivation MapReduce Cell BE Architecture Design Performance Analysis Implementation Status Future Work What is MapReduce? A
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationEvaluating the Portability of UPC to the Cell Broadband Engine
Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationOpenMP 4.0/4.5. Mark Bull, EPCC
OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all
More informationEfficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed
Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Marco Aldinucci Computer Science Dept. - University of Torino - Italy Marco Danelutto, Massimiliano Meneghin,
More informationParallel and Distributed Computing
Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering
More informationOpenMP 4.0. Mark Bull, EPCC
OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!
More informationIntel Parallel Studio 2011
THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Studio 2011 Product Brief Studio 2011 Accelerate Development of Reliable, High-Performance Serial and Threaded Applications for Multicore Studio 2011 is a comprehensive
More informationApplying List Scheduling Algorithms in a Multithreaded Execution Environment
Applying List Scheduling Algorithms in a Multithreaded Execution Environment C. A. S. Camargo, G. G. H. Cavalheiro, M. L. Pilla, S. A. C. Cavalheiro, L. Foss Universidade Federal de Pelotas (BRA) LUPS
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationMarco Danelutto. May 2011, Pisa
Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationGrassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands
Grassroots ASPLOS can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands ASPLOS-17 Doctoral Workshop London, March 4th, 2012 1 Current
More informationIntel Thread Building Blocks
Intel Thread Building Blocks SPD course 2017-18 Massimo Coppola 23/03/2018 1 Thread Building Blocks : History A library to simplify writing thread-parallel programs and debugging them Originated circa
More informationSequential Consistency & TSO. Subtitle
Sequential Consistency & TSO Subtitle Core C1 Core C2 data = 0, 1lag SET S1: store data = NEW S2: store 1lag = SET L1: load r1 = 1lag B1: if (r1 SET) goto L1 L2: load r2 = data; Will r2 always be set to
More informationOpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs
www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE
More informationLecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)
Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationCILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY
CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware
More informationApplying Models of Computation to OpenCL Pipes for FPGA Computing. Nachiket Kapre + Hiren Patel
Applying Models of Computation to OpenCL Pipes for FPGA Computing Nachiket Kapre + Hiren Patel nachiket@uwaterloo.ca Outline Models of Computation and Parallelism OpenCL code samples Synchronous Dataflow
More informationMulti Agent Navigation on GPU. Avi Bleiweiss
Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance
More informationEnergy Efficiency through Significance-Based Computing
Energy Efficiency through Significance-Based Computing Nikolopoulos, D. S., Vandierendonck, H., Bellas, N., Antonopoulos, C. D., Lalis, S., Karakonstantis, G.,... Naumann, U. (2014). Energy Efficiency
More informationMemory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund
Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access Martin Grund Agenda Key Question: What is the memory hierarchy and how to exploit it? What to take home How is computer memory organized.
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCS 5220: Shared memory programming. David Bindel
CS 5220: Shared memory programming David Bindel 2017-09-26 1 Message passing pain Common message passing pattern Logical global structure Local representation per processor Local data may have redundancy
More informationFoundations of the C++ Concurrency Memory Model
Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model
More informationNOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.
Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which
More informationIBM Power Multithreaded Parallelism: Languages and Compilers. Fall Nirav Dave
6.827 Multithreaded Parallelism: Languages and Compilers Fall 2006 Lecturer: TA: Assistant: Arvind Nirav Dave Sally Lee L01-1 IBM Power 5 130nm SOI CMOS with Cu 389mm 2 2GHz 276 million transistors Dual
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationDesigning Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve
Designing Memory Consistency Models for Shared-Memory Multiprocessors Sarita V. Adve Computer Sciences Department University of Wisconsin-Madison The Big Picture Assumptions Parallel processing important
More informationMessage-Passing Shared Address Space
Message-Passing Shared Address Space 2 Message-Passing Most widely used for programming parallel computers (clusters of workstations) Key attributes: Partitioned address space Explicit parallelization
More informationCall Paths for Pin Tools
, Xu Liu, and John Mellor-Crummey Department of Computer Science Rice University CGO'14, Orlando, FL February 17, 2014 What is a Call Path? main() A() B() Foo() { x = *ptr;} Chain of function calls that
More informationModel-Driven Verifying Compilation of Synchronous Distributed Applications
Model-Driven Verifying Compilation of Synchronous Distributed Applications Sagar Chaki, James Edmondson October 1, 2014 MODELS 14, Valencia, Spain Copyright 2014 Carnegie Mellon University This material
More informationGLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)
DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationStream Computing using Brook+
Stream Computing using Brook+ School of Electrical Engineering and Computer Science University of Central Florida Slides courtesy of P. Bhaniramka Outline Overview of Brook+ Brook+ Software Architecture
More informationMultithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson
Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
More informationPRACE Autumn School Basic Programming Models
PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries
More informationEfficient Smith-Waterman on multi-core with FastFlow
BioBITs Euromicro PDP 2010 - Pisa Italy - 17th Feb 2010 Efficient Smith-Waterman on multi-core with FastFlow Marco Aldinucci Computer Science Dept. - University of Torino - Italy Massimo Torquati Computer
More informationPOSIX Threads and OpenMP tasks
POSIX Threads and OpenMP tasks Jimmy Aguilar Mena February 16, 2018 Introduction Pthreads Tasks Two simple schemas Independent functions # include # include void f u n c t i
More informationOpenMP 3.0 Tasking Implementation in OpenUH
Open64 Workshop @ CGO 09 OpenMP 3.0 Tasking Implementation in OpenUH Cody Addison Texas Instruments Lei Huang University of Houston James (Jim) LaGrone University of Houston Barbara Chapman University
More informationWritten Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming
Written Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming Nicolas Bettenburg 1 Universitaet des Saarlandes, D-66041 Saarbruecken, nicbet@studcs.uni-sb.de Abstract. As traditional
More informationto automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu
Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu
More informationTable of Contents. Cilk
Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages
More informationTask-parallel reductions in OpenMP and OmpSs
Task-parallel reductions in OpenMP and OmpSs Jan Ciesko 1 Sergi Mateo 1 Xavier Teruel 1 Vicenç Beltran 1 Xavier Martorell 1,2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya {jan.ciesko,sergi.mateo,xavier.teruel,
More informationThe Hierarchical SPMD Programming Model
The Hierarchical SPMD Programming Model Amir Ashraf Kamil Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2011-28 http://www.eecs.berkeley.edu/pubs/techrpts/2011/eecs-2011-28.html
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationTypes. Type checking. Why Do We Need Type Systems? Types and Operations. What is a type? Consensus
Types Type checking What is a type? The notion varies from language to language Consensus A set of values A set of operations on those values Classes are one instantiation of the modern notion of type
More informationThe Java Memory Model
Jeremy Manson 1, William Pugh 1, and Sarita Adve 2 1 University of Maryland 2 University of Illinois at Urbana-Champaign Presented by John Fisher-Ogden November 22, 2005 Outline Introduction Sequential
More informationSINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM
SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Basic usage of OpenSHMEM 2 Outline Concept and Motivation Remote Read and Write Synchronisation Implementations OpenSHMEM Summary 3 Philosophy of the talks In
More informationMichael Kinsner, Dirk Seynhaeve IWOCL 2018
Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Topics 1. FPGA overview 2. Motivating application classes 3. Host pipes 4. Some data 2 FPGA: Fine-grained Massive Parallelism Intel Stratix 10 FPGA: Over 5 Million
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationExploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)
Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA) Sponsored by SRC and NSF as a Part of Multicore Chip Design
More informationShared Memory Consistency Models: A Tutorial
Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster Contents Overview Uniprocessor Review Sequential Consistency Relaxed
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More information