Advanced Multiprocessor Programming Project Topics and Requirements

Size: px

Start display at page:

Download "Advanced Multiprocessor Programming Project Topics and Requirements"

Aldous Greene
6 years ago
Views:

1 Advanced Multiprocessor Programming Project Topics and Requirements Jesper Larsson Trä TU Wien May 5th, 2017 J. L. Trä AMP SS17, Projects 1 / 21

2 Projects Goal: Get practical, own experience with concurrent algorithms and data structures, including their performance, and obstacles to obtaining the performance that may naively be expected. Projects done in one- or two-person groups Content and effort of the project independent of group size Each group selects (only!) one project from the list Implementation of material from the lecture But allowed to be creative, and bring in own ideas Implementation in C, C++ (with native threads or pthreads), or other But if you choose other, you are on your own Allowed (and encouraged) to study and use additional papers (as well as internet information), but state clearly your sources! J. L. Trä AMP SS17, Projects 2 / 21

3 Timeline Today: Announcement of project topics and rules We can have regular Q&A (Mondays, Thursdays) Project status presentation, all groups: Thursday Deadline for hand-in: Monday, , at midnight Exam from June 30th to July 4th (sign up in TISS) J. L. Trä AMP SS17, Projects 3 / 21

4 All projects Test for correctness rst, use assertions where possible, start sequentially, then gradually increase the number of threads Dene a good benchmark to measure: latency (time per operation), throughput (number of operations in some given time slot), fairness; as function of the number of threads Compare to well-chosen baseline Use good experimental practice to be able to make well-founded claims that some implementation is better than another (see HPC lecture). Repeat experiment a large number of times (> 30), report averages with condence intervals State properties of the algorithms and give worst-case bounds where possible J. L. Trä AMP SS17, Projects 4 / 21

5 Benchmarking Many papers use throughput as the measure of performance. Throughput is hoped/epected to scale (linearly?) with the number of threads/cores. For some benchmarks and ideas, see Vincent Gramoli: More than you ever wanted to know about synchronization: Synchrobench, measuring the impact of the synchronization on concurrent algorithms. PPOPP 2015: 1-10 J. L. Trä AMP SS17, Projects 5 / 21

6 Expectations and requirements Ecient and correct (!) implementation Good theoretical analysis (not necessarily formal proofs) (Invariants, linearizability, progress guarantees) Good benchmark analysis Short document (English or German) 6-10 pages excluding plots and source code Statement of problem and expectations Description of data structure Theoretical analysis Benchmark (results, how obtained) Code must be available, part of hand-in (explain how to compile and run) Project status presentation, short, all (on Thursday ) Final project presentation, examination (per group) J. L. Trä AMP SS17, Projects 6 / 21

7 Machines For benchmarks, use TU Wien Parallel Computing group systems saturn (48-core AMD), possibly (by need) mars (80-core Intel) and ceres (64-core, 8-way = 512 thread Fujitsu/Sparc) Develop gradually, start with own system at home if possible Good practice so that others can reproduce ndings: State properties of machine, compiler, environment (required!). E.g., gcc version (Debian ) J. L. Trä AMP SS17, Projects 7 / 21

8 What to hand in Your solution/hand-in consists of the report describing problems and solutions and benchmark analysis with plots/tables, and the source code, including, if necessary, Makefile, README and other things necessary to compile and run the code. Either report or README should give instructions for compiling and running The hand-in must be uploaded in TUWEL as a single.zip-,.tgz-, or.tar.gz-le. Name the le clearly: your names followed by _amp_project and the number of the project you choose. J. L. Trä AMP SS17, Projects 8 / 21

9 Project 1: Register Locks Implement: Filter-lock (generalized Peterson) Tournament tree of 2-thread Peterson locks (see exercise) Lamport Bakery, Herlihy-Shavit version Lamport Bakery, Lamport's original version Which is better? For baseline performance, compare to the following locks: pthreads or native C11 locks, simple test-and-set lock, simple test-and-test-and-set lock Challenge: Memory behavior. Ensure that memory (register) updates become visible in required order! Explain what happens if not (Peterson). J. L. Trä AMP SS17, Projects 9 / 21

10 Project 2: Bounded-timestamp register locks (I) Implement: Aravind's solution (Alex A. Aravind: Yet Another Simple Solution for the Concurrent Programming Control Problem. IEEE Trans. Parallel Distrib. Syst. 22(6): , 2011) Black-white Bakery (Gadi Taubenfeld: The Black-White Bakery Algorithm and Related Bounded-Space, Adaptive, Local-Spinning and FIFO Algorithms. DISC 2004: 56-70) Lamport's Bakery (lecture version or original) Which is better? For baseline performance, compare to the following locks: pthreads or native C11 locks, simple test-and-set lock, simple test-and-test-and-set lock Challenge: Memory behavior. Ensure that memory (register) updates become visible in required order! J. L. Trä AMP SS17, Projects 10 / 21

11 Project 3: Bounded-timestamp register locks (II) Implement: Szymanski's solution (Boleslaw K. Szymanski: A simple solution to Lamport's concurrent programming problem with linear wait. ICS 1988: ) Black-white Bakery (Gadi Taubenfeld: The Black-White Bakery Algorithm and Related Bounded-Space, Adaptive, Local-Spinning and FIFO Algorithms. DISC 2004: 56-70) Lamport's Bakery (lecture version or original) Which is better? For baseline performance, compare to the following locks: pthreads or native C11 locks, simple test-and-set lock, simple test-and-test-and-set lock Challenge: Memory behavior. Ensure that memory (register) updates become visible in required order! J. L. Trä AMP SS17, Projects 11 / 21

12 Project 4: Double word CAS The universal hardware instruction CAS (compare-and-swap) can be used to implement generalized CAS operations that work on several locations atomically. Implement either: A k-cas from Victor Luchangco, Mark Moir, Nir Shavit: Nonblocking k-compare-single-swap. Theory Comput. Syst. 44(1): (2009) An n-cas from Timothy L. Harris, Keir Fraser, Ian A. Pratt: A Practical Multi-word Compare-and-Swap Operation. DISC 2002: or combinations thereof. Compare performance to the baseline CAS instruction of your machine. How does the performance change with increasing number of threads? J. L. Trä AMP SS17, Projects 12 / 21

13 Project 5: Queue locks Implement: Array lock CLH Lock MCS Lock from the lectures. Which is better? For baseline performance, compare to the following locks: pthreads or native C11 locks, simple test-and-set lock, simple test-and-test-and-set lock. Challenge: Memory management! J. L. Trä AMP SS17, Projects 13 / 21

14 Project 6: Hierarchical locks Implement, on your own, either: Milind Chabbi, John M. Mellor-Crummey: Contention-conscious, locality-preserving locks. PPOPP 2016: 22:1-22:14 David Dice, Virendra J. Marathe, Nir Shavit: Lock Cohorting: A General Technique for Designing NUMA Locks. TOPC 1(2): 13 (2015) Which is better? For baseline performance, compare to the following locks: pthreads or native C11 locks, simple test-and-set lock, simple test-and-test-and-set lock. J. L. Trä AMP SS17, Projects 14 / 21

15 Project 7: Reader-writers locks Be creative: Find a good recent suggestion from the literature for a readers-writers lock, and implement it. Compare to native C11 or pthreads baseline. An example: Irina Calciu, David Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, Nir Shavit: NUMA-aware reader-writer locks. PPOPP 2013: J. L. Trä AMP SS17, Projects 15 / 21

16 Project 8: List-based set (I) Implement (from the lecture): List-based set with ne-grained locks List-based set with optimistic synchronization List-based set with lazy synchronization Lock-free list based-set Compare to baseline implementation with global lock on all set operations. Which are better, for which operations? Discuss possible experimental designs. Challenge: Memory management! J. L. Trä AMP SS17, Projects 16 / 21

17 Project 9: List-based set (II) Implement: List-based set with ne-grained locks Binary tree with ne-grained locks (with wait-free contains?) (develop your own algorithm, be creative) Compare to baseline implementation with global lock, and ecient, sequential implementation. Compare this baseline also to the performance of the list-based set. Is there an advantage to using a good, sequential data structure? Look in the literature for ne-grained or lock-free algorithms for tree-data structures. Bonus: Implement what is useful or looks interesting J. L. Trä AMP SS17, Projects 17 / 21

18 Project 10: Queues and stacks Implement: Unbounded, lock-free queue Unbounded, lock-free stack Elimination back-o stack Creative alternative: Lock-free queue by Morrison (Adam Morrison, Yehuda Afek: Fast concurrent queues for x86 processors. PPOPP 2013: ) Compare to baseline-implementations with global lock (or more ne-grained locking, if you can come up with a good implementation; for the queue use the two-lock implementation from the lecture) Challenge: memory management! J. L. Trä AMP SS17, Projects 18 / 21

19 Project 11: Wait-free stack? Implement: Unbounded, lock-free stack The wait-free stack from Seep Goel, Pooja Aggarwal, Smruti R. Sarangi: A Wait-Free Stack. ICDCIT 2016: Compare to baseline-implementations with global lock (or more ne-grained locking, if you can come up with a good implementation; for the queue use the two-lock implementation from the lecture) Challenge: Memory management! J. L. Trä AMP SS17, Projects 19 / 21

20 Project 12: Ecient FAA queue Implement an ecient queue using fetch-and-add (and CAS, why?) from Chaoran Yang, John M. Mellor-Crummey: A wait-free queue as fast as fetch-and-add. PPOPP 2016: 16:1-16:13 Compare to baseline-implementations with global lock or more ne-grained locking. J. L. Trä AMP SS17, Projects 20 / 21

21 Project 13: Skip-lists Implement: Lock-based lazy skip-list Lock-free skip-list For baseline performance, compare to sequential skip-list (own implementation!) with global locks on all set operations Challenge: Memory management! J. L. Trä AMP SS17, Projects 21 / 21

Advanced Multiprocessor Programming: Locks

Advanced Multiprocessor Programming: Locks Martin Wimmer, Jesper Larsson Träff TU Wien 20th April, 2015 Wimmer, Träff AMP SS15 1 / 31 Locks we have seen so far Peterson Lock ilter Lock Bakery Lock These