Milind Kulkarni Research Statement

Size: px

Start display at page:

Download "Milind Kulkarni Research Statement"

Morris Lyons
5 years ago
Views:

1 Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers have struggled with easing the burden of writing parallel programs. While these efforts have met with success in some domains dense linear algebra and SQL programming being two well-known examples writing efficient parallel code is still largely the purview of expert programmers. One of the great challenges facing the programming languages community is to make parallel programming accessible and effective for average programmers. I believe the key to making parallel programming accessible is to hide as much complexity as possible behind intuitive abstractions which capture important information about parallelism and locality. These abstractions can then be exploited by reusable libraries and run-time systems written by expert programmers, allowing most programmers to write algorithms in an intuitive, nearly sequential style. My research has focused on discovering useful and natural abstractions for writing irregular programs programs that manipulate pointer-based data structures such as trees and graphs and developing the compiler techniques and run-time systems needed to exploit those abstractions. This research has opened up a number of additional directions that I would like to explore: (1) building new systems that exploit program semantics to enhance parallelism and locality; (2) developing modeling tools that allow programmers to better understand the parallelism in algorithms and systems; (3) using autotuning techniques to allow parallel programs to adapt to novel architectures; and (4) exploring new application domains which offer new challenges for parallel programming. Tackling these problems are important steps towards solving the problem of parallel programming. Research approach My research tends to adopt the following pattern: (1) study interesting problems arising in important real-world applications; (2) find general patterns in those problems; (3) develop abstractions that capture those general patterns; (4) produce efficient implementations of those abstractions. This approach has served me well throughout my research, as demonstrated by my development of the Galois system for optimistic parallelization [1]: (1) Study: I begin projects by carefully studying important applications from a variety of domains. By understanding specific applications, I can find out where current approaches for optimization or parallelization fall short, and why. To drive this search for interesting problems, I have collaborated with researchers from a variety of application domains, in fields ranging from computational geometry to graphics to data mining. The genesis of the Galois system came from studying two real-world algorithms from these domains, Delaunay mesh refinement and agglomerative clustering. (2) Generalize: Armed with a deep understanding of applications, and having identified particular problems to solve, the next step is to generalize. For example, computation in Delaunay mesh refinement is structured as processing elements from a worklist in an arbitrary order, a pattern that appears in a variety of irregular applications. While worklist items often exhibit a complex pattern of dependences, there is nevertheless parallelism to be exploited by processing

2 independent worklist elements concurrently. I call this pattern of parallelism amorphous data parallelism. This type of parallelism is an ideal target for speculative parallelization. (3) Abstract: These generalizations allow me to develop abstractions which capture important program behavior. Good abstractions possess two key properties: they should be intuitive, and they should expose useful program semantics. For example, amorphous data parallelism can be expressed through the use of optimistic iterators, which highlight the opportunity for parallelism in a program. Despite having simple sequential semantics, these iterators expose ordering properties that are otherwise hidden and provide a hint that speculative parallelization can be profitable. The abstractions I develop are heavily informed by my experience with real-world applications. For example, I realized that existing speculative parallelization techniques such as threadlevel speculation would detect a number of benign dependences in irregular applications. Because violating a benign dependence does not affect correctness, parallelism can be improved by ignoring such dependences, provided they can be identified. I used the notion of semantic commutativity, an abstraction that precisely exposes the object semantics required for exact dependence checking, to produce object libraries that allow significant amounts of concurrency during speculative execution. (4) Implement: The final step of a research project is to produce efficient implementations of the abstractions. This can encompass any number of techniques, from compiler transformations to run-time systems. The Galois system comprises a software run-time that can parallelize programs written using optimistic iterators and object libraries which leverage semantic commutativity to perform precise dependence checking. This implementation achieved low overhead and good scalability, proving that the abstractions I developed could be efficiently supported and exploited. Prior Research My past research reflects the approach outlined above. The abstractions for expressing and exploiting amorphous data parallelism described above formed the basis of the Galois system. This initial work showed that it was possible to write irregular programs in a straightforward, nearly sequential manner and still achieve useful parallelism. Two further contributions of my research have been developing locality abstractions for irregular data structures [2] and scheduling abstractions for amorphous data parallelism [3], and integrating these abstractions into the Galois system. This research has met with significant industry interest Intel and IBM have contributed funding and continuing to collaborate with industry is a key point in my research agenda. Locality Abstractions To truly unlock the potential parallelism in amorphous data-parallel programming requires attending to locality. Achieving locality is the key to high-performance parallel programs. Naive parallel implementations of irregular algorithms can suffer from poor cache locality (e.g., because computations scheduled for a single processor access data from all regions of a data structure), and, similarly, naively running locality-preserving sequential implementations in parallel may result in high contention (e.g., because computations scheduled simultaneously on multiple processors require accessing the same region of a data structure). This interplay between locality and parallelism is especially problematic in irregular programs, as there is no well-defined notion of locality in irregular data structures such as graphs or trees. The problem is obvious: How can a programmer exploit locality in a data structure that doesn t seem to have any?

3 I answered this question in [2] by proposing an abstraction that captures semantic locality in irregular data structures. Semantic locality refers to locality that arises in an irregular data structure due to the semantics of its access patterns. For example, in a graph, a node is semantically local to its neighbors, as from a given node you can access its neighbors. Note that this locality is preserved regardless of the implementation of the graph. To capture semantic locality, I introduced the abstraction of partitioning: irregular data structures are logically partitioned, with the property that regions of the data structure in the same partition are semantically local (and vice versa). I showed how to exploit this partitioning to improve parallelism, by scheduling computations affecting different partitions on different processors; to improve locality, by scheduling computations within a single partition to exploit temporal locality; and to reduce the overhead of conflict detection, by replacing precise conflict detection with locks on partitions. Thus, I showed that a simple, intuitive abstraction can allow locality to be successfully exploited for irregular data structures. Scheduling Abstractions My experience with the Galois system and partition-based scheduling made it clear that scheduling, the assignment of work to processors, has an enormous effect on performance. Unfortunately, the behavior of a given schedule is highly application dependent, and the space of possible schedules is vast. In [3] I developed a scheduling framework for describing computation schedules for amorphous data-parallel programs. This framework is built around three abstractions, which together fully describe a given schedule: (i) clustering, which specifies chunks of work that should be executed on a single processor; (ii) labeling, which assigns clusters of work to particular processors; and (iii) ordering, which determines what order a processor executes its assigned work. I showed that this framework is general: it can be instantiated to produce all the schedules used in data-parallel frameworks such as OpenMP, as well as the partition-based computation scheduling used in [2]. I also showed that the framework is useful: I gave instantiations of the framework which produced novel schedules that gave greater performance than existing schedules. This work demonstrates that a small number of simple abstractions suffice to describe the vast space of schedules that can be applied to parallel, irregular applications. Future Research Short term My short term research goals fall into two categories: (1) finding new ways to exploit program semantics to produce efficient parallel programs, and (2) giving programmers new tools to model the parallelism in algorithms and profile parallel implementations of those algorithms. Along the first line of inquiry, partitioning information may be valuable when assigning threads to cores in a hierarchical architecture. Intuitively, threads that are likely to communicate with one another should be assigned to cores that enjoy low-latency communication through mechanisms such as shared caches. Partitioning information, as well as partition-aware scheduling, can allow a system to make intelligent assumptions about communication patterns, even for irregular programs. In managed languages such as Java, it may be possible to use partitioning information during garbage collection to re-layout data structures, turning the exposed semantic locality into spatial locality. The scheduling framework I developed in [3] is descriptive; it does not mandate a particular implementation of schedules. I plan to develop a language for schedules that will allow programmers to specify in a declarative manner the scheduling properties they want, as in systems like OpenMP. Given this specification, a compiler can generate a scheduler which will be used within the Galois run-time. This compiler machinery will enable autotuning: a meta-

4 compiler can automatically generate a number of potential schedulers for an application and evaluate them on a test input, choosing the schedule that performs best for a given architecture. I have also recently become interested in modeling the behavior of amorphous dataparallel programs. I wrote a tool called ParaMeter to begin investigating the parallelism available in such programs [4]. ParaMeter estimates parallelism by finding a maximal independent set of work at each step in a computation, providing an upper bound on the amount of parallelism in a program. The current version of ParaMeter makes simplifying assumptions about how long work takes (each piece of work takes unit time) and communication costs (no cost). I plan to extend ParaMeter to provide more accurate models of parallelism by accounting for work irregularity and communication behavior. I believe this will be a useful tool not only to the programming languages and systems community, but also to the algorithms community, as it will provide insight into the expected parallel performance of irregular algorithms. Long term An interesting pattern that I have noticed in my work is that many of the abstractions I developed allow irregular programs to be transformed in much the same way that regular, densematrix programs are. The optimistic iterators I proposed are the basic parallel loop construct, analogous to DO-ALL loops in languages like Fortran, and techniques like partition-aware scheduling are analogous to loop-tiling in matrix codes. It may be possible to lift other high-level program transformations from the world of regular programs to the world of irregular programs. For example, consider loop interchange, which can improve locality in matrix codes by changing traversal order. What does loop interchange mean when applied to an irregular program consisting of repeated traversals of an irregular data structure (a pattern that appears in, e.g., n- body codes)? A transformation analogous to loop interchange, when applied to such a code, might produce a reordered sequence of partial traversals that can be grouped together to promote locality. Are these types of transformations always legal? Is there a general way to express such transformations? Autotuning techniques may be more broadly applicable in programs written at a suitable level of abstraction. As long as the abstractions are well defined, it may be possible to automatically search a space of possible instantiations of those abstractions to choose the best possible concrete implementation of a program. I am especially interested in dynamic autotuning, where the parameters of a program are changed at run time in response to input characteristics or runtime behavior. To me, the key question when thinking about future research is identifying new and exciting application domains. I think that there are several emerging areas that are the target of a lot of research, and will require substantial amounts of parallelism. Computational biology brings software analysis to bear on massive data sets. A number of algorithms common in computational biology are irregular in nature, such as Survey Propagation for solving SAT problems. What new techniques will be needed to parallelize and optimize irregular algorithms that work with vast amounts of data? How can speculation techniques like Galois be brought to bear on applications that require distributed memory architectures? On a lighter note, games are always on the cutting edge of the performance curve, and the algorithms underlying high-performance games will hence require parallelism. While some tasks such as shading are inherently parallelizable, many are more difficult. Maintaining game state requires tracking the position and behavior of thousands of objects, each of which can interact with others; simultaneously updating the states of these game objects fits naturally into the framework of amorphous data parallelism. What sorts of systems are needed to parallelize game algorithms while adhering to real-time constraints? Can hardware usually devoted to graphics be leveraged to improve the performance of other gaming algorithms?

5 References [1] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew. Optimistic Parallelism Requires Abstractions. In Programming Languages Design and Implementation, June [2] Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew. Optimistic Parallelism Benefits From Data Partitioning. In Architectural Support for Programming Languages and Operating Systems, March [3] Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew. Scheduling Strategies for Optimistic Parallelization of Irregular Programs. In Symposium on Parallelism in Algorithms and Architectures, June [4] Milind Kulkarni, Martin Burtscher, R. Inkulu, Keshav Pingali and Calin Cascaval. How Much Parallelism is There in Irregular Applications? In Principles and Practices of Parallel Programming, February 2009 (to appear).

Scheduling Issues in Optimistic Parallelization

Scheduling Issues in Optimistic Parallelization Milind Kulkarni and Keshav Pingali University of Texas at Austin Department of Computer Science Austin, TX {milind, pingali}@cs.utexas.edu Abstract Irregular