Milind Kulkarni Research Statement
|
|
- Morris Lyons
- 5 years ago
- Views:
Transcription
1 Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers have struggled with easing the burden of writing parallel programs. While these efforts have met with success in some domains dense linear algebra and SQL programming being two well-known examples writing efficient parallel code is still largely the purview of expert programmers. One of the great challenges facing the programming languages community is to make parallel programming accessible and effective for average programmers. I believe the key to making parallel programming accessible is to hide as much complexity as possible behind intuitive abstractions which capture important information about parallelism and locality. These abstractions can then be exploited by reusable libraries and run-time systems written by expert programmers, allowing most programmers to write algorithms in an intuitive, nearly sequential style. My research has focused on discovering useful and natural abstractions for writing irregular programs programs that manipulate pointer-based data structures such as trees and graphs and developing the compiler techniques and run-time systems needed to exploit those abstractions. This research has opened up a number of additional directions that I would like to explore: (1) building new systems that exploit program semantics to enhance parallelism and locality; (2) developing modeling tools that allow programmers to better understand the parallelism in algorithms and systems; (3) using autotuning techniques to allow parallel programs to adapt to novel architectures; and (4) exploring new application domains which offer new challenges for parallel programming. Tackling these problems are important steps towards solving the problem of parallel programming. Research approach My research tends to adopt the following pattern: (1) study interesting problems arising in important real-world applications; (2) find general patterns in those problems; (3) develop abstractions that capture those general patterns; (4) produce efficient implementations of those abstractions. This approach has served me well throughout my research, as demonstrated by my development of the Galois system for optimistic parallelization [1]: (1) Study: I begin projects by carefully studying important applications from a variety of domains. By understanding specific applications, I can find out where current approaches for optimization or parallelization fall short, and why. To drive this search for interesting problems, I have collaborated with researchers from a variety of application domains, in fields ranging from computational geometry to graphics to data mining. The genesis of the Galois system came from studying two real-world algorithms from these domains, Delaunay mesh refinement and agglomerative clustering. (2) Generalize: Armed with a deep understanding of applications, and having identified particular problems to solve, the next step is to generalize. For example, computation in Delaunay mesh refinement is structured as processing elements from a worklist in an arbitrary order, a pattern that appears in a variety of irregular applications. While worklist items often exhibit a complex pattern of dependences, there is nevertheless parallelism to be exploited by processing
2 independent worklist elements concurrently. I call this pattern of parallelism amorphous data parallelism. This type of parallelism is an ideal target for speculative parallelization. (3) Abstract: These generalizations allow me to develop abstractions which capture important program behavior. Good abstractions possess two key properties: they should be intuitive, and they should expose useful program semantics. For example, amorphous data parallelism can be expressed through the use of optimistic iterators, which highlight the opportunity for parallelism in a program. Despite having simple sequential semantics, these iterators expose ordering properties that are otherwise hidden and provide a hint that speculative parallelization can be profitable. The abstractions I develop are heavily informed by my experience with real-world applications. For example, I realized that existing speculative parallelization techniques such as threadlevel speculation would detect a number of benign dependences in irregular applications. Because violating a benign dependence does not affect correctness, parallelism can be improved by ignoring such dependences, provided they can be identified. I used the notion of semantic commutativity, an abstraction that precisely exposes the object semantics required for exact dependence checking, to produce object libraries that allow significant amounts of concurrency during speculative execution. (4) Implement: The final step of a research project is to produce efficient implementations of the abstractions. This can encompass any number of techniques, from compiler transformations to run-time systems. The Galois system comprises a software run-time that can parallelize programs written using optimistic iterators and object libraries which leverage semantic commutativity to perform precise dependence checking. This implementation achieved low overhead and good scalability, proving that the abstractions I developed could be efficiently supported and exploited. Prior Research My past research reflects the approach outlined above. The abstractions for expressing and exploiting amorphous data parallelism described above formed the basis of the Galois system. This initial work showed that it was possible to write irregular programs in a straightforward, nearly sequential manner and still achieve useful parallelism. Two further contributions of my research have been developing locality abstractions for irregular data structures [2] and scheduling abstractions for amorphous data parallelism [3], and integrating these abstractions into the Galois system. This research has met with significant industry interest Intel and IBM have contributed funding and continuing to collaborate with industry is a key point in my research agenda. Locality Abstractions To truly unlock the potential parallelism in amorphous data-parallel programming requires attending to locality. Achieving locality is the key to high-performance parallel programs. Naive parallel implementations of irregular algorithms can suffer from poor cache locality (e.g., because computations scheduled for a single processor access data from all regions of a data structure), and, similarly, naively running locality-preserving sequential implementations in parallel may result in high contention (e.g., because computations scheduled simultaneously on multiple processors require accessing the same region of a data structure). This interplay between locality and parallelism is especially problematic in irregular programs, as there is no well-defined notion of locality in irregular data structures such as graphs or trees. The problem is obvious: How can a programmer exploit locality in a data structure that doesn t seem to have any?
3 I answered this question in [2] by proposing an abstraction that captures semantic locality in irregular data structures. Semantic locality refers to locality that arises in an irregular data structure due to the semantics of its access patterns. For example, in a graph, a node is semantically local to its neighbors, as from a given node you can access its neighbors. Note that this locality is preserved regardless of the implementation of the graph. To capture semantic locality, I introduced the abstraction of partitioning: irregular data structures are logically partitioned, with the property that regions of the data structure in the same partition are semantically local (and vice versa). I showed how to exploit this partitioning to improve parallelism, by scheduling computations affecting different partitions on different processors; to improve locality, by scheduling computations within a single partition to exploit temporal locality; and to reduce the overhead of conflict detection, by replacing precise conflict detection with locks on partitions. Thus, I showed that a simple, intuitive abstraction can allow locality to be successfully exploited for irregular data structures. Scheduling Abstractions My experience with the Galois system and partition-based scheduling made it clear that scheduling, the assignment of work to processors, has an enormous effect on performance. Unfortunately, the behavior of a given schedule is highly application dependent, and the space of possible schedules is vast. In [3] I developed a scheduling framework for describing computation schedules for amorphous data-parallel programs. This framework is built around three abstractions, which together fully describe a given schedule: (i) clustering, which specifies chunks of work that should be executed on a single processor; (ii) labeling, which assigns clusters of work to particular processors; and (iii) ordering, which determines what order a processor executes its assigned work. I showed that this framework is general: it can be instantiated to produce all the schedules used in data-parallel frameworks such as OpenMP, as well as the partition-based computation scheduling used in [2]. I also showed that the framework is useful: I gave instantiations of the framework which produced novel schedules that gave greater performance than existing schedules. This work demonstrates that a small number of simple abstractions suffice to describe the vast space of schedules that can be applied to parallel, irregular applications. Future Research Short term My short term research goals fall into two categories: (1) finding new ways to exploit program semantics to produce efficient parallel programs, and (2) giving programmers new tools to model the parallelism in algorithms and profile parallel implementations of those algorithms. Along the first line of inquiry, partitioning information may be valuable when assigning threads to cores in a hierarchical architecture. Intuitively, threads that are likely to communicate with one another should be assigned to cores that enjoy low-latency communication through mechanisms such as shared caches. Partitioning information, as well as partition-aware scheduling, can allow a system to make intelligent assumptions about communication patterns, even for irregular programs. In managed languages such as Java, it may be possible to use partitioning information during garbage collection to re-layout data structures, turning the exposed semantic locality into spatial locality. The scheduling framework I developed in [3] is descriptive; it does not mandate a particular implementation of schedules. I plan to develop a language for schedules that will allow programmers to specify in a declarative manner the scheduling properties they want, as in systems like OpenMP. Given this specification, a compiler can generate a scheduler which will be used within the Galois run-time. This compiler machinery will enable autotuning: a meta-
4 compiler can automatically generate a number of potential schedulers for an application and evaluate them on a test input, choosing the schedule that performs best for a given architecture. I have also recently become interested in modeling the behavior of amorphous dataparallel programs. I wrote a tool called ParaMeter to begin investigating the parallelism available in such programs [4]. ParaMeter estimates parallelism by finding a maximal independent set of work at each step in a computation, providing an upper bound on the amount of parallelism in a program. The current version of ParaMeter makes simplifying assumptions about how long work takes (each piece of work takes unit time) and communication costs (no cost). I plan to extend ParaMeter to provide more accurate models of parallelism by accounting for work irregularity and communication behavior. I believe this will be a useful tool not only to the programming languages and systems community, but also to the algorithms community, as it will provide insight into the expected parallel performance of irregular algorithms. Long term An interesting pattern that I have noticed in my work is that many of the abstractions I developed allow irregular programs to be transformed in much the same way that regular, densematrix programs are. The optimistic iterators I proposed are the basic parallel loop construct, analogous to DO-ALL loops in languages like Fortran, and techniques like partition-aware scheduling are analogous to loop-tiling in matrix codes. It may be possible to lift other high-level program transformations from the world of regular programs to the world of irregular programs. For example, consider loop interchange, which can improve locality in matrix codes by changing traversal order. What does loop interchange mean when applied to an irregular program consisting of repeated traversals of an irregular data structure (a pattern that appears in, e.g., n- body codes)? A transformation analogous to loop interchange, when applied to such a code, might produce a reordered sequence of partial traversals that can be grouped together to promote locality. Are these types of transformations always legal? Is there a general way to express such transformations? Autotuning techniques may be more broadly applicable in programs written at a suitable level of abstraction. As long as the abstractions are well defined, it may be possible to automatically search a space of possible instantiations of those abstractions to choose the best possible concrete implementation of a program. I am especially interested in dynamic autotuning, where the parameters of a program are changed at run time in response to input characteristics or runtime behavior. To me, the key question when thinking about future research is identifying new and exciting application domains. I think that there are several emerging areas that are the target of a lot of research, and will require substantial amounts of parallelism. Computational biology brings software analysis to bear on massive data sets. A number of algorithms common in computational biology are irregular in nature, such as Survey Propagation for solving SAT problems. What new techniques will be needed to parallelize and optimize irregular algorithms that work with vast amounts of data? How can speculation techniques like Galois be brought to bear on applications that require distributed memory architectures? On a lighter note, games are always on the cutting edge of the performance curve, and the algorithms underlying high-performance games will hence require parallelism. While some tasks such as shading are inherently parallelizable, many are more difficult. Maintaining game state requires tracking the position and behavior of thousands of objects, each of which can interact with others; simultaneously updating the states of these game objects fits naturally into the framework of amorphous data parallelism. What sorts of systems are needed to parallelize game algorithms while adhering to real-time constraints? Can hardware usually devoted to graphics be leveraged to improve the performance of other gaming algorithms?
5 References [1] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew. Optimistic Parallelism Requires Abstractions. In Programming Languages Design and Implementation, June [2] Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew. Optimistic Parallelism Benefits From Data Partitioning. In Architectural Support for Programming Languages and Operating Systems, March [3] Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew. Scheduling Strategies for Optimistic Parallelization of Irregular Programs. In Symposium on Parallelism in Algorithms and Architectures, June [4] Milind Kulkarni, Martin Burtscher, R. Inkulu, Keshav Pingali and Calin Cascaval. How Much Parallelism is There in Irregular Applications? In Principles and Practices of Parallel Programming, February 2009 (to appear).
Scheduling Issues in Optimistic Parallelization
Scheduling Issues in Optimistic Parallelization Milind Kulkarni and Keshav Pingali University of Texas at Austin Department of Computer Science Austin, TX {milind, pingali}@cs.utexas.edu Abstract Irregular
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationDITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee
DITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee Documents initiate and record business change. It is easy to map some business
More informationParallel Programming Interfaces
Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general
More informationFast Agglomerative Clustering for Rendering
Fast Agglomerative Clustering for Rendering Bruce Walter, Kavita Bala, Cornell University Milind Kulkarni, Keshav Pingali University of Texas, Austin Clustering Tree Hierarchical data representation Each
More informationParallel Programming in the Age of Ubiquitous Parallelism. Keshav Pingali The University of Texas at Austin
Parallel Programming in the Age of Ubiquitous Parallelism Keshav Pingali The University of Texas at Austin Parallelism is everywhere Texas Advanced Computing Center Laptops Cell-phones Parallel programming?
More informationOptimistic Parallelism Benefits from Data Partitioning
Optimistic Parallelism Benefits from Data Partitioning Abstract Milind Kulkarni, Keshav Pingali Department of Computer Science The University of Texas at Austin {milind, pingali}@cs.utexas.edu Recent studies
More informationCS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010
Parallel Programming Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code Mary Hall September 23, 2010 1 Observations from the Assignment Many of you are doing really well Some more are doing
More informationAutomatically Enhancing Locality for Tree Traversals with Traversal Splicing
Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering 2-15-2012 Automatically Enhancing Locality for Tree Traversals with Traversal Splicing Youngjoon Jo Electrical
More informationAdministration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationLonestar: A Suite of Parallel Irregular Programs
Lonestar: A Suite of Parallel Irregular Programs Milind Kulkarni a, Martin Burtscher a,călin Caşcaval b, and Keshav Pingali a a The University of Texas at Austin b IBM T.J. Watson Research Center Abstract
More informationParallel Inclusion-based Points-to Analysis
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali, Institute for Computational Engineering and Sciences, University of Texas, Austin, TX Dept. of Computer Science.
More informationEnhancing Locality for Recursive Traversals of Recursive Structures
Enhancing Locality for Recursive Traversals of Recursive Structures Youngjoon Jo and Milind Kulkarni School of Electrical and Computer Engineering Purdue University {yjo,milind}@purdue.edu Abstract While
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationAdministration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice
CSE 392/CS 378: High-performance Computing: Principles and Practice Administration Professors: Keshav Pingali 4.126 ACES Email: pingali@cs.utexas.edu Jim Browne Email: browne@cs.utexas.edu Robert van de
More informationSYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS. Donald Nguyen, Keshav Pingali
SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS Donald Nguyen, Keshav Pingali TERMINOLOGY irregular algorithm opposite: dense arrays work on pointered data structures like graphs shape of the
More informationSummary: Open Questions:
Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization
More informationParallelizing Irregular Algorithms: A Pattern Language
Parallelizing Irregular Algorithms: A Pattern Language Pedro Monteiro CITI, Departamento de Informática Faculdade de Ciências e Tecnologia Universidade Nova de Lisboa 2829-516 Caparica, Portugal +351 212
More informationParallel Graph Partitioning on Multicore Architectures
Parallel Graph Partitioning on Multicore Architectures Xin Sui 1, Donald Nguyen 1, and Keshav Pingali 1,2 1 Department of Computer Science, University of Texas, Austin 2 Institute for Computational Engineering
More informationLightcuts. Jeff Hui. Advanced Computer Graphics Rensselaer Polytechnic Institute
Lightcuts Jeff Hui Advanced Computer Graphics 2010 Rensselaer Polytechnic Institute Fig 1. Lightcuts version on the left and naïve ray tracer on the right. The lightcuts took 433,580,000 clock ticks and
More informationINTRODUCTION. Chapter GENERAL
Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which
More informationOptimistic Parallelism Requires Abstractions
Optimistic Parallelism Requires Abstractions Abstract Milind Kulkarni, Keshav Pingali Department of Computer Science, University of Texas, Austin. {milind, pingali}@cs.utexas.edu Irregular applications,
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationGrand Central Dispatch
A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model
More informationMulticore Computing and Scientific Discovery
scientific infrastructure Multicore Computing and Scientific Discovery James Larus Dennis Gannon Microsoft Research In the past half century, parallel computers, parallel computation, and scientific research
More informationAdministration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationCS229 Project: TLS, using Learning to Speculate
CS229 Project: TLS, using Learning to Speculate Jiwon Seo Dept. of Electrical Engineering jiwon@stanford.edu Sang Kyun Kim Dept. of Electrical Engineering skkim38@stanford.edu ABSTRACT We apply machine
More informationMapping Vector Codes to a Stream Processor (Imagine)
Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream
More informationA Characterization of Shared Data Access Patterns in UPC Programs
IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview
More informationParallel Programming in the Age of Ubiquitous Parallelism. Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin
Parallel Programming in the Age of Ubiquitous Parallelism Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin Parallel Programming in the Age of Ubiquitous Parallelism Andrew Lenharth
More informationParallelism. Parallel Hardware. Introduction to Computer Systems
Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,
More informationFoundations of the C++ Concurrency Memory Model
Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationSomething to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:
Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base
More informationChapter 15: Transactions
Chapter 15: Transactions! Transaction Concept! Transaction State! Implementation of Atomicity and Durability! Concurrent Executions! Serializability! Recoverability! Implementation of Isolation! Transaction
More informationAmorphous Data-parallelism in Irregular Algorithms
Amorphous Data-parallelism in Irregular Algorithms Keshav Pingali 1,2, Milind Kulkarni 2, Donald Nguyen 1, Martin Burtscher 2, Mario Mendez-Lojo 2, Dimitrios Prountzos 1, Xin Sui 3, Zifei Zhong 1 1 Department
More informationEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management
More informationLab 1c: Collision detection
Concepts in Multicore Programming February 9, 2010 Massachusetts Institute of Technology 6.884 Charles E. Leiserson Handout 6 Lab 1c: Collision detection In this lab, you will use Cilk++ to parallelize
More informationOptimize Data Structures and Memory Access Patterns to Improve Data Locality
Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationTransactions. Prepared By: Neeraj Mangla
Transactions Prepared By: Neeraj Mangla Chapter 15: Transactions Transaction Concept Transaction State Concurrent Executions Serializability Recoverability Implementation of Isolation Transaction Definition
More informationControl flow graphs and loop optimizations. Thursday, October 24, 13
Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationCh 1: The Architecture Business Cycle
Ch 1: The Architecture Business Cycle For decades, software designers have been taught to build systems based exclusively on the technical requirements. Software architecture encompasses the structures
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationClass Analysis for Testing of Polymorphism in Java Software
Class Analysis for Testing of Polymorphism in Java Software Atanas Rountev Ana Milanova Barbara G. Ryder Rutgers University, New Brunswick, NJ 08903, USA {rountev,milanova,ryder@cs.rutgers.edu Abstract
More informationParallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader Challenges of Design Verification Contemporary hardware
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationAnalytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationParallel Graph Partitioning on Multicore Architectures
Parallel Graph Partitioning on Multicore Architectures Xin Sui 1, Donald Nguyen 1, Martin Burtscher 2, and Keshav Pingali 1,3 1 Department of Computer Science, University of Texas at Austin 2 Department
More information/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17
601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17 5.1 Introduction You should all know a few ways of sorting in O(n log n)
More informationTHE GALOIS SYSTEM: OPTIMISTIC PARALLELIZATION OF IRREGULAR PROGRAMS
THE GALOIS SYSTEM: OPTIMISTIC PARALLELIZATION OF IRREGULAR PROGRAMS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the
More informationOverview: Memory Consistency
Overview: Memory Consistency the ordering of memory operations basic definitions; sequential consistency comparison with cache coherency relaxing memory consistency write buffers the total store ordering
More information1 Overview, Models of Computation, Brent s Theorem
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 1, 4/3/2017. Scribed by Andreas Santucci. 1 Overview,
More informationProgramming Models for Supercomputing in the Era of Multicore
Programming Models for Supercomputing in the Era of Multicore Marc Snir MULTI-CORE CHALLENGES 1 Moore s Law Reinterpreted Number of cores per chip doubles every two years, while clock speed decreases Need
More informationMulti-core Computing Lecture 2
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date
More informationXcelerated Business Insights (xbi): Going beyond business intelligence to drive information value
KNOWLEDGENT INSIGHTS volume 1 no. 5 October 7, 2011 Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value Today s growing commercial, operational and regulatory
More informationTransactions These slides are a modified version of the slides of the book Database System Concepts (Chapter 15), 5th Ed
Transactions These slides are a modified version of the slides of the book Database System Concepts (Chapter 15), 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available
More informationRelaxed Memory-Consistency Models
Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency
More informationFractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures
Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin
More informationSAT, SMT and QBF Solving in a Multi-Core Environment
SAT, SMT and QBF Solving in a Multi-Core Environment Bernd Becker Tobias Schubert Faculty of Engineering, Albert-Ludwigs-University Freiburg, 79110 Freiburg im Breisgau, Germany {becker schubert}@informatik.uni-freiburg.de
More informationUNIT I. Introduction
UNIT I Introduction Objective To know the need for database system. To study about various data models. To understand the architecture of database system. To introduce Relational database system. Introduction
More informationICOM 5016 Database Systems. Chapter 15: Transactions. Transaction Concept. Chapter 15: Transactions. Transactions
ICOM 5016 Database Systems Transactions Chapter 15: Transactions Amir H. Chinaei Department of Electrical and Computer Engineering University of Puerto Rico, Mayagüez Slides are adapted from: Database
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationRubicon: Scalable Bounded Verification of Web Applications
Joseph P. Near Research Statement My research focuses on developing domain-specific static analyses to improve software security and reliability. In contrast to existing approaches, my techniques leverage
More informationis easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology
Preface The idea of improving software quality through reuse is not new. After all, if software works and is needed, just reuse it. What is new and evolving is the idea of relative validation through testing
More informationStatement of Research for Taliver Heath
Statement of Research for Taliver Heath Research on the systems side of Computer Science straddles the line between science and engineering. Both aspects are important, so neither side should be ignored
More informationChapter 13: Transactions
Chapter 13: Transactions Transaction Concept Transaction State Implementation of Atomicity and Durability Concurrent Executions Serializability Recoverability Implementation of Isolation Transaction Definition
More informationRATCOP: Relational Analysis Tool for Concurrent Programs
RATCOP: Relational Analysis Tool for Concurrent Programs Suvam Mukherjee 1, Oded Padon 2, Sharon Shoham 2, Deepak D Souza 1, and Noam Rinetzky 2 1 Indian Institute of Science, India 2 Tel Aviv University,
More informationto automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu
Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu
More informationJIVE: Dynamic Analysis for Java
JIVE: Dynamic Analysis for Java Overview, Architecture, and Implementation Demian Lessa Computer Science and Engineering State University of New York, Buffalo Dec. 01, 2010 Outline 1 Overview 2 Architecture
More informationNotes and Comments for [1]
Notes and Comments for [1] Zhang Qin July 14, 007 The purpose of the notes series Good Algorithms, especially for those natural problems, should be simple and elegant. Natural problems are those with universal
More informationTransactions. Lecture 8. Transactions. ACID Properties. Transaction Concept. Example of Fund Transfer. Example of Fund Transfer (Cont.
Transactions Transaction Concept Lecture 8 Transactions Transaction State Implementation of Atomicity and Durability Concurrent Executions Serializability Recoverability Implementation of Isolation Chapter
More informationStory so far. Parallel Data Structures. Parallel data structure. Working smoothly with Galois iterators
Story so far Parallel Data Structures Wirth s motto Algorithm + Data structure = Program So far, we have studied parallelism in regular and irregular algorithms scheduling techniques for exploiting parallelism
More informationAUTOMATIC VECTORIZATION OF TREE TRAVERSALS
AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th, 2013 Youngjoon Jo 2 Commodity processors and SIMD Commodity processors
More informationParallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)
Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationBusiness Rules Extracted from Code
1530 E. Dundee Rd., Suite 100 Palatine, Illinois 60074 United States of America Technical White Paper Version 2.2 1 Overview The purpose of this white paper is to describe the essential process for extracting
More information20762B: DEVELOPING SQL DATABASES
ABOUT THIS COURSE This five day instructor-led course provides students with the knowledge and skills to develop a Microsoft SQL Server 2016 database. The course focuses on teaching individuals how to
More informationParallel Programming Must Be Deterministic by Default
Parallel Programming Must Be Deterministic by Default Robert L. Bocchino Jr., Vikram S. Adve, Sarita V. Adve and Marc Snir University of Illinois at Urbana-Champaign {bocchino,vadve,sadve,snir}@illinois.edu
More informationAdaptive Assignment for Real-Time Raytracing
Adaptive Assignment for Real-Time Raytracing Paul Aluri [paluri] and Jacob Slone [jslone] Carnegie Mellon University 15-418/618 Spring 2015 Summary We implemented a CUDA raytracer accelerated by a non-recursive
More informationAdaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >
Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization
More informationOracle Developer Studio 12.6
Oracle Developer Studio 12.6 Oracle Developer Studio is the #1 development environment for building C, C++, Fortran and Java applications for Oracle Solaris and Linux operating systems running on premises
More informationQuestion 1: What is a code walk-through, and how is it performed?
Question 1: What is a code walk-through, and how is it performed? Response: Code walk-throughs have traditionally been viewed as informal evaluations of code, but more attention is being given to this
More informationNovel Lossy Compression Algorithms with Stacked Autoencoders
Novel Lossy Compression Algorithms with Stacked Autoencoders Anand Atreya and Daniel O Shea {aatreya, djoshea}@stanford.edu 11 December 2009 1. Introduction 1.1. Lossy compression Lossy compression is
More informationMorph Algorithms on GPUs
Morph Algorithms on GPUs Rupesh Nasre 1 Martin Burtscher 2 Keshav Pingali 1,3 1 Inst. for Computational Engineering and Sciences, University of Texas at Austin, USA 2 Dept. of Computer Science, Texas State
More informationFADA : Fuzzy Array Dataflow Analysis
FADA : Fuzzy Array Dataflow Analysis M. Belaoucha, D. Barthou, S. Touati 27/06/2008 Abstract This document explains the basis of fuzzy data dependence analysis (FADA) and its applications on code fragment
More informationAOSA - Betriebssystemkomponenten und der Aspektmoderatoransatz
AOSA - Betriebssystemkomponenten und der Aspektmoderatoransatz Results obtained by researchers in the aspect-oriented programming are promoting the aim to export these ideas to whole software development
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationMemory Hierarchy Management for Iterative Graph Structures
Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More information1 Publishable Summary
1 Publishable Summary 1.1 VELOX Motivation and Goals The current trend in designing processors with multiple cores, where cores operate in parallel and each of them supports multiple threads, makes the
More informationINTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...
INTRODUCTION... 2 WHAT IS DATA MINING?... 2 HOW TO ACHIEVE DATA MINING... 2 THE ROLE OF DARWIN... 3 FEATURES OF DARWIN... 4 USER FRIENDLY... 4 SCALABILITY... 6 VISUALIZATION... 8 FUNCTIONALITY... 10 Data
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More information