Girija J. Narlikar Forbes Avenue girija

Similar documents
A Framework for Space and Time Efficient Scheduling of Parallelism

Nomair A. Naeem. Personal Data. Education. Teaching Experience. Course Instructor/Sessional

Stavros Nikolaou. 413 Gates Hall URL: Ithaca, NY, 14853, USA Date of Birth: December, 1987

Brian F. Cooper. Distributed systems, digital libraries, and database systems

30 Eleanor Dr., # 908, Ottawa, ON, Canada, K2E 7E5 +1 (613)

Pthreads for Dynamic Parallelism

Greg T. Harber Faculty Vita (September 1, August 31, 2013)

ELLIOT ANSHELEVICH. Department of Computer Science Cornell University Upson Hall 5139 Ithaca, NY 14853

Amy Babay April 2018

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

Yihan Sun RESEARCH INTEREST EDUCATION PUBLICATIONS

CURRICULUM VITAE. Discipline University /Board Year % Secured. Indian Institute of Technology (IIT), Kharagpur. (NIT), Rourkela

Joe Michael Kniss December 2005

J I N G H A I R A O. Institute for Software Research School of Computer Science Carnegie Mellon University 5000 Forbes Ave Pittsburgh, PA 15213

CURRICULUM VITÆ. Naama Kraus B.Sc. in Computer Science and Mathematics, Bar-Ilan University, Cum Laude GPA: 90.

Proceedings of MASPLAS'01 The Mid-Atlantic Student Workshop on Programming Languages and Systems IBM Watson Research Centre April 27, 2001

DEPARTMENT OF COMPUTER SCIENCE

ADAM CHLIPALA. Research interests. Education. Refereed conference papers

Resume. Techniques. Mail ID: Contact No.: S.No. Position held Organisation From To. AU PG Center, Vizianagaram

Design Patterns for Real-Time Computer Music Systems

Offloading Java to Graphics Processors

Parallel Implementation of 3D FMA using MPI

Multi-core Computing Lecture 2

Ranveer Chandra. Computer Networks, Mobile Computing, Distributed Systems, Sensor Networks, Pervasive Computing

EDUCATION RESEARCH EXPERIENCE

Program Proposal for a Direct Converted Program. BS in COMPUTER SCIENCE

Dr. Spencer Sevilla Postdoctoral Researcher, University of Washington

John Clements Department of Computer Science Cal Poly State University 1 Grand Street San Luis Obispo, CA (805)

Software-Controlled Multithreading Using Informing Memory Operations

Kenneth A. Lambert. Department of Computer Science Washington and Lee University Parmly Hall 406 Lexington, VA Phone: (540)

JOYCE JIYOUNG WHANG EDUCATION

The Fox Project: Advanced Development of Systems Software

Project Participants

Ph.D. with Distinction in Computer Science Advisor: Prof. Micha Sharir Dissertation: Geometric Arrangements: Substructures and Algorithms

Amy Babay November Doctor of Philosophy in Computer Science September 2018

CFDnet: Computational Fluid Dynamics on the Internet

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Imani Palmer ipalmer2.web.engr.illinois.edu

Towards a Portable Cluster Computing Environment Supporting Single System Image

Graduate Student Orientation

AREAS OF SPECIALIZATION. Digital Image Processing Computer Vision Pattern Recognition Image Retrieval Image Reconstruction Face Recognition

Bisection Debugging. 1 Introduction. Thomas Gross. Carnegie Mellon University. Preliminary version

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

JOYCE JIYOUNG WHANG. June 2008 May 2010: Undergraduate Research Assistant, Department of Computer Science and Engineering, Ewha Womans University.

Thesis: An Extensible, Self-Tuning, Overlay-Based Infrastructure for Large-Scale Stream Processing and Dissemination Advisor: Ugur Cetintemel

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

MAGNO QUEIROZ Curriculum Vitae

Dukpa Kim FIELDS OF INTEREST. Econometrics, Time Series Econometrics ACADEMIC POSITIONS

Scheduling Threads for Low Space Requirement and Good Locality

Comparing the Parix and PVM parallel programming environments

The ECE Curriculum. Prof. Bruce H. Krogh Associate Dept. Head.

Yalin Xiong. Senior Member of Technical Staff, Jet Propulsion Lab, Pasadena, CA. February now.

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

AYAN MONDAL ayan.mondal/

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Anat Levin : Postdoctoral Associate, MIT CSAIL. Advisor: Prof William T. Freeman.

Interaction of JVM with x86, Sparc and MIPS

A Test Suite for High-Performance Parallel Java

A Comparison of Relativistic and Reader-Writer Locking Approaches to Shared Data Access

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

2. Department of Computer Science, Yale University

Verification and Parallelism in Intro CS. Dan Licata Wesleyan University

Object Oriented Concepts and Programming (CSC244) By Dr. Tabbasum Naz

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

Yi Qiao yqiao

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Anshul Gandhi Department of Computer Science Stony Brook University anshul

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Jun Li, Ph.D. School of Computing and Information Sciences Phone:

15-740/ Computer Architecture

Minimum-Cost Spanning Tree. as a. Path-Finding Problem. Laboratory for Computer Science MIT. Cambridge MA July 8, 1994.

The Grid-Occam Project

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

WHY PARALLEL PROCESSING? (CE-401)

Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

BOSTON UNIVERSITY Metropolitan College MET CS342 Data Structures with Java Dr. V.Shtern (Fall 2011) Course Syllabus

Compilers and Compiler-based Tools for HPC

905 Eagle Heights, Apt D Phone : Madison-WI Web : pages.cs.wisc.edu/~csverma

CPEG 852 Advanced Topics in Computing Systems The Dataflow Model of Computation

QuickGuide for SEAS CS Students (New Requirements Beginning Fall 2012)

Department of Computer Science

Chapter 3 Parallel Software

Master Course in Computer Science Orientation day

Dr. Ahmed K. Sadek 2. Experience

Harp-DAAL for High Performance Big Data Computing

Fast Primitives for Irregular Computations on the NEC SX-4

Online Course Evaluation. What we will do in the last week?

CS 378 (Spring 2003) Linux Kernel Programming. Yongguang Zhang. Copyright 2003, Yongguang Zhang

Financial Planning Institute of Southern Africa SETTING THE STANDARD. Continuous Professional Development (Cpd) Policy

Please consult the Department of Engineering about the Computer Engineering Emphasis.

Tuncay Tekle Curriculum Vitæ

CS307: Operating Systems

Designing Issues For Distributed Computing System: An Empirical View

Konstantinos Krommydas, Ph.D.

Selection-based Weak Sequential Consistency Models for. for Distributed Shared Memory.

Storage-Based Convergence Between HPC and Big Data

Transcription:

Girija J. Narlikar CMU Computer Science Dept. girija@cs.cmu.edu 5000 Forbes Avenue http://www.cs.cmu.edu/ girija Pittsburgh, PA 15213 Phone: (412) 268-3337 [O] Fax: (412) 268-5576 (412) 802-0459 [H] Research Interests Multithreading, distributed systems and applications, scheduling, parallel languages and systems, design and implementation of practical algorithms, performance modeling, information retrieval systems (for mining, and web indexing and searching). Education Carnegie Mellon University August 1993 present. Ph.D. in Computer Science expected February 1999. Dissertation: Space-Efficient Multithreading. Research advisor: Guy Blelloch. Thesis Committee: Guy Blelloch, Thomas Gross, Bruce Maggs, and Charles Leiserson (MIT). M.S. in Computer Science received May 1995. Indian Institute of Technology, Bombay. August 1989 May 1993. B.Tech. in Computer Science and Engineering received May 1993. Research Experience Carnegie Mellon University September 1993 present. Graduate Student Researcher in the SCANDAL project with Prof. Guy Blelloch. The project is involved in developing applicative and portable parallel languages such as NESL, and fast parallel algorithms. My contributions have focused mainly on space-efficient multithreading, NESL implementation, and -body algorithms: Designed provably space and time efficient scheduling algorithms for multithreading systems as part of my Ph.D. thesis. Also built fast runtime systems based on these algorithms. Summary of my thesis research is on page 4. Built a portable interpreter for the NESL programming language. Designed and implemented a Java-based graphics library to make NESL programs available as applets over the web. Analyzed, implemented and compared three popular -body algorithms for different levels of accuracy, numbers of particles, and types of forces. Conducted an experimental comparison of message passing and distributed shared memory on a network of workstations for a variety of applications, as part of an operating systems project. NEC Research Institute June 1998 and Summer 1997. Consultant and Summer Intern with Satish Rao. Designed and implemented a distributed object system on a cluster of PCs with a fast interconnect. The goal of the library was to combine the convenience of a global name space for objects with the efficiency and predictability of bulk synchronous programming. DEC Systems Research Center Summer 1995. Summer Intern with Chandramohan Thekkath. Designed and implemented low-overhead runtime optimizations for a distributed shared memory library on a network of Alphas, and evaluated their effectiveness for a set of applications. Indian Institute of technology, Bombay August 1992-April 1993. Senior Thesis with Dhananjay Dhamdhere. Formalized an algorithm for Incremental Dataflow Analysis. 1

Center for Development of Advanced Computing (CDAC), India Summer 1992. Summer Intern. Ported gcc to a transputer platform (T-800) and compared performance with the native compiler. Indian Institute of technology, Bombay Spring 1992. Designed and implemented a query database to computerize the Central Stores Purchasing System at IIT Bombay. Teaching Experience Carnegie Mellon University Fall 1995. Teaching Assistant for senior-level Operating Systems class. Guided students in the implementation of a kernel (including virtual memory management) and a file system. Helped create and grade exams and projects. Carnegie Mellon University Spring 1994. Teaching Assistant for undergraduate Data Structures and Complexity class. Taught four recitation classes a week, helped create and grade exams. Indian Institute of Technology, Bombay Fall 1991. Teaching Assistant for introductory programming class. Conducted recitations and graded exams and projects. Refereed Publications G. Narlikar and G. Blelloch. Space efficient implementation of nested parallelism, accepted for publication in ACM Transactions on Programming Languages and Systems (TOPLAS). Earlier version appeared in Proc. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), June 1997. G. Narlikar. Scheduling threads for low space requirement and good locality, to appear in Proc. ACM Symposium on Parallel Algorithms and Architectures (SPAA), June 1999. M. Goudreau, K. Lang, G. Narlikar and S. Rao. BOS is Boss: A Case for Bulk-Synchronous Object Systems, to appear in Proc. ACM Symposium on Parallel Algorithms and Architectures (SPAA), June 1999. G. Narlikar and G. Blelloch. Pthreads for dynamic and irregular parallelism, Proc. SC98: High Performance Networking and Computing, November 1998. (Best student paper award.) J. Hardwick, G. Narlikar, and J. Sipelstein. Interactive Simulations on the Web: Compiling NESL into Java, Concurrency: Practice and Experience, Vol.9(11):1075-1089, November 1997. Also appeared in Proc. ACM Workshop on Java for Science and Engineering Computation, June 1997. G. Blelloch, P. Gibbons, Y. Matias, and G. Narlikar. Space efficient scheduling of parallelism with synchronization variables, Proc. ACM Symposium on Parallel Algorithms and Architectures (SPAA), June 1997. G. Blelloch and G. Narlikar. A practical comparison of -body algorithms, Parallel Algorithms. Series in Discrete Mathematics and Theoretical Computer Science, Volume 30, 1997. Other Publications G. Narlikar. A Parallel, Multithreaded Decision Tree Builder, Tech. Report CMU-CS-98-184, December 1998. G. Narlikar and G. Blelloch. Pthreads for dynamic parallelism, Tech. Report CMU-CS-98-114 (expanded version of SC98 paper), April 1998. G. Narlikar and G. Blelloch. A framework for space and time efficient scheduling of parallelism, Tech. Report CMU-CS-96-197, December 1996. G. Blelloch and G. Narlikar. A comparison of two -body algorithms, Proc. DIMACS Implementation Challenge Workshop, October 1994. 2

Technical Presentations Pthreads for irregular and dynamic parallelism, SC98: High Performance Networking and Computing, November 1998. Also presented at Sun Microsystems, March 1998 and the NOW Project Retreat, January 1998. Space-efficient multithreading, Yale Multithreading Workshop, June 1998. Interactive Simulations on the Web: Compiling NESL into Java, ACM 1997 Workshop on Java for Science and Engineering Computation, June 1997. Space efficient implementation of nested parallelism, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), June 1997, Las Vegas. Also presented at DEC Systems Research Center, October 1997. Parallel -body methods, guest lecture for graduate level Parallel Computing class at Carnegie Mellon University, March 1997. A comparison of two -body algorithms, DIMACS Implementation Challenge Workshop, October 1994, Rutgers University. Honors Best Student Paper Award at SC98: High Performance Networking and Computing, November 1998. Graduate Student Member of Sigma Xi Scientific Research Society, June 1997 present. Graduate Research Fellowship, CMU Computer Science Department, August 1993 present. Deshmukh Gold Medal for graduating with the highest GPA in the Computer Science and Engineering Department at IIT Bombay, May 1993. National Talent Scholarship awarded by the Government of India, 1987-1993. Other activities Member of Technical Papers Committee, SC99. Reviewer for the Journal of the ACM, Theory of Computing Systems, Currency: Practice and Experience, IPPS, PPoPP, SPAA. Student member of ACM, IEEE, IEEE Computer Society. Volunteer for International Graduate Student Orientation at CMU, 1998. Organized free classical music concerts as a coordinator of SPIC-MACAY, 1996-1997. Enjoy hiking, tennis, racketball, running, cooking. References Prof. Guy Blelloch CMU Computer Science Dept. 5000 Forbes Avenue 545 Technology Square Pittsburgh, PA 15213. Cambridge, MA 02139. Prof. Bruce Maggs MIT Lab. for Computer Science (412) 268-6245 (617) 253-6035 Guy.Blelloch@cs.cmu.edu bmm@lcs.mit.edu Dr. Satish Rao NEC Research Institute 4 Independence Way 5000 Forbes Avenue Princeton, NJ 08540. Pittsburgh, PA 15213. Prof. Thomas Gross CMU Computer Science Dept. (609) 951-2714 (412) 268-7661 satish@research.nj.nec.com Thomas.Gross@cs.cmu.edu 3

Summary of Research Activities 1. Thesis research: space efficient multithreading Programs with irregular or dynamic parallelism benefit from the use of lightweight, fine-grained threads. Lightweight threads significantly simplify the programmer s job, allow for automatic load balancing, and dynamically adapt to a changing number of processors. However, the program depends heavily on the threads implementation for high performance. Therefore, unless the threads scheduler is carefully implemented, the program may end up using too much space, or suffer poor performance due to excessive memory allocation, high scheduling overheads, or poor locality. My thesis research has focused on two aspects of thread scheduling: the design and analysis of provably spaceefficient scheduling algorithms for fine-grained multithreaded programs, and the engineering and evaluation of fast, multithreading runtime systems based on these algorithms. (a) Space-efficient scheduling algorithms I designed two online, asynchronous scheduling algorithms for multithreading systems that support nested parallelism. The algorithms are provably space and time efficient, that is, they provide low upper bounds on the space and time requirements of a multithreaded computation. The algorithms guarantee that a nested parallel program with serial space requirement and critical path length (depth) requires at most space on processors. This is a significant improvement over previous asynchronous, space-efficient schedulers that guarantee an upper bound of, since is small compared to for most parallel programs. My algorithms maintain this low space bound by prioritizing threads according to their sequential execution order, and temporarily stalling big allocations of space. To ensure scalability for both the algorithms, I also parallelized the scheduler itself, and analyzed the space and time requirements including scheduling overheads. Both algorithms provide a user-adjustable trade-off between space and time performance which I analyze and experimentally verify. The second algorithm improves upon the first by also trying to schedule threads close in the computation graph on the same processor, to result in good locality and low scheduling contention. In joint work with Phillip Gibbons and Yossi Matias from Bell Labs, and my advisor Guy Blelloch, we designed the first asynchronous, space-efficient scheduling algorithm for parallel languages with synchronization variables. Such languages include languages with futures such as Multilisp or Cool, as well as other languages like Id. (b) Multithreading runtime systems To determine how useful my space-efficient scheduling algorithms were in practice for realistic benchmarks, I implemented them in the context of two separate runtime systems. First, I built a runtime system that executed C programs, with the threads written in a continuation-passing style, on an SGI Power Challenge. The system used one of my asynchronous scheduling algorithms to schedule the threads. The results of executing parallel programs on this system show that my scheduling algorithm significantly reduces memory usage compared to previous space-efficient techniques, without compromising performance. Although POSIX threads (Pthreads) have become a popular standard for shared memory programming, I found the current Pthread schedulers unsuited for executing programs with fine-grained, dynamic parallelism. I therefore implemented both my space-efficient scheduling algorithms as part of a popular Pthreads package on Solaris. Although the algorithms guarantee upper bounds for nested parallel programs, they also appear to work well in practice for program with arbitrary synchronizations, such as locks or condition variables. This is the first spaceefficient system that supports a functionality as general as that of Pthreads. I used several parallel programs, including numerical codes, physical simulations, and a data classifier, to evaluate the schedulers. The programs, written in a fine-grained style, were simpler to code than their hand-partitioned, coarse-grained counterparts, and yet provided equivalent performance. However, my space-efficient schedulers were required in place of the Pthread library s original scheduler to achieve this performance. 4

2. Distributed systems Although my thesis research has focused on parallel computing on shared memory systems, I have also worked on various aspects of distributed computing. (a) Bulk-synchronous distributed objects Bulk-synchronous libraries provide a simple and efficient model for parallel programming with predictable performance. However, the programmer has to manage all the data movement at a low level, similar to the messagepassing style. In collaboration with Satish Rao (NEC Research Institute), I designed and implemented a distributed object layer over a bulk-synchronous library on a network of PCs. The object layer provides a global object name space that simplifies the task of programming, and uses software caching to minimize communication and programming effort. Since it is built over a bulk-synchronous library, the object system can also provide efficient and predictable performance. We have implemented a number of applications using the object layer to demonstrate its applicability and efficiency. (b) Distributed shared memory I worked with Chandramohan Thekkath (DEC Systems Research Center) on adding runtime optimizations to an existing distributed shared memory (DSM) library called SAM. SAM uses objects to share data between processors, and provides primitives for asynchronous access to these shared objects. Although the programming model is fairly user-friendly, performance can be slowed down by a large number of asynchronous requests for remote data. We designed runtime optimizations to reduce communication, and to overlap communication with computation. We took advantage of the fact that data access patterns for many applications change slowly over time. For example, we dynamically kept track of the set of active readers for each DSM object, so that on writes to the object, its copies in the caches of the active readers could be updated before the next read request was made. Other optimizations include modifications to the cache coherence policy depending on the object access patterns. The effectiveness of these optimizations was experimentally verified using existing parallel benchmarks written in SAM. 3. -body algorithms This joint work with Guy Blelloch compares three popular algorithms for the 3D -body problem: the Barnes- Hut algorithm, Greengard s Fast Multipole Method (FMM), and the Parallel Multipole Tree Algorithm (PMTA), to determine which of the algorithms performs best in practice. Although FMM has a better asymptotic running time ( instead of for uniform distributions), the algorithm is more complicated and it is not immediately clear above what values of it performs better in practice. This is the first research that analyzes and compares the relative performances of the three popular -body algorithms. We first experimentally studied the dependence of accuracy on the variable parameters in the three algorithms. We then mathematically analyzed and compared the floating point operation counts of the algorithms at similar levels of accuracy, for both charged and uncharged random distributions. At a high level of accuracy, we found that the FMM performed the least number of operations for!, assuming both charged and uncharged distributions of points. However, at a lower level of accuracy, for uncharged distributions, the FMM did not outperform Barnes-Hut even for "#!$. For charged distributions of particles, both the FMM and PMTA were comparable at low accuracy. 4. Compilers (a) Compiling Nesl into Java The SCANDAL project in CMU is involved with the implementation of a high level, applicative language called NESL. Currently, it uses a C-based vector library as its backend; porting NESL to a new platform therefore involves porting this vector library. NESL is well-suited to code algorithm animations. However, since the NESL implementation was limited to X11-based unix systems, we required an insecure X11 connection to make the animations available over the web. In joint work with Jonathan Hardwick and Jay Sipelstein at CMU, we designed and implemented an alternate system for compiling NESL into Java. In addition to increasing the portability 5

of NESL, this system has enabled us to make existing simulations and algorithm animations available in applet form on the web. We compared the performance of the new system using a set of benchmarks on both PCs and workstations. Current Java virtual machines running the generated code achieve about half the performance of a native implementation of NESL. We found that the use of Java as an intermediate language is a viable way to improve the portability of existing high-level programming languages for scientific computation and visualization. (b) Incremental dataflow analysis As part of my senior thesis under the guidance of Dhananjay Dhamdhere (IIT), I worked on an algorithm for incremental dataflow analysis. The goal was to minimize the time required by the optimization phase of a compiler when small modifications are made to the program. 6