SEJITS: High Performance for Productivity Applications

Size: px

Start display at page:

Download "SEJITS: High Performance for Productivity Applications"

Annis Walsh
5 years ago
Views:

1 SEJITS: High Performance for Productivity Applications Shoaib Kamil & many others Parlab End of Project Celebration May 30, 2013

2 Correctness Where Did We Start? Personal Health Image Retrieval Hearing, Music Dwarfs Speech Parallel Browser Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Static Verification Parallel Libraries Parallel Frameworks Type Systems Efficiency Sketching Languages Autotuners Legacy Communication & Schedulers Code Synch. Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Intel Multicore/GPGPU RAMP Manycore 2 Directed Testing Dynamic Checking Debugging with Replay 2

3 Serial Callee Parallel Callee The Big Problems Serial Caller Parallel Caller? Knew how to coordinate serial caller with serial or parallel callee How can we do the other two, in a way understandable to non-expert programmers? 3

4 The Big Problems Layering is good for engineering, bad for performance App 1 App 2 App 3 Dense Sparse Graph Trav. Multicore GPU Cloud 4

Inspiration: Auto-tuned DSL for Stencils Began with a large productivity-programmer-created codebase: new climate code Use code transformation to go from productive to

5 Inspiration: Auto-tuned DSL for Stencils Began with a large productivity-programmer-created codebase: new climate code Use code transformation to go from productive to efficient code.f95 Engines Generators Engines.f95.c.cu.c.h Reference Implementation Parse Internal Abstract Syntax Tree Representation Strategy Serial Parallel Victoria Falls GTX280 Code FORTRAN C with pthreads CUDA Transformation & Code Generation with high-level knowledge Myriad of equivalent, optimized, implementations (plus test harness) Search in context of specific problem Best performing implementation and configuration parameters OpenMP Comparison Expected Peak serial reference [IPDPS10] 5

6 Inspiration: ActiveRecord Product.find_by_price(95) Product.all.where( quantity >?, 0).where(:color => red ) Ruby on Rails ORM Complex queries generated from simple chains of method invocations Productivity programmer doesn t need to know SQL 6

7 Execute Phase Compile Phase SEJITS In A Nutshell Program Non-DSL Code Code in DSL A Code in DSL B Data Code in DSL A DSL Codegen Data Interpreter External Compiler Dynamic Link Library Result 7

8 Case Study: Sepya Stencils Embedded in Python with Auto-tuning Simple, declarative source for fast stencil computations on multicore/multisocket CPUs class My1DKernel(StencilKernel): def init (self): super(mykernel, self). init () # set the neighbors set 1 as the point # before and the point after self.set_neighbors(1, [[1],[-1]]) # the actual kernel function def kernel(self, out_grid, in_grid): for x in out_grid.interior_points(): for y in self.neighbors(in_grid, x, 1): out_grid[x] += 0.5 * in_grid[y] 8

9 Case Study: Sepya 93% of Roofline Peak 2.2x faster than state-of-the-art 9

10 Asp is SEJITS for Python Library for building auto-tuned DSELs Introspection of Python Code generation via templates or tree transformations Auto-tuning support Data structure interoperability Automatic compiler detection & demand-driven compilation with caching Ongoing development & support Screencasts available [SciPy11, PPoPP12] 10

11 Copperhead Data parallel compiler for Python subset from copperhead import * import numpy as def axpy(a, x, y): return [a * xi + yi for xi, yi in zip(x, y)] x = np.arange(100, dtype=np.float64) y = np.arange(100, dtype=np.float64) with places.gpu0: gpu = axpy(2.0, x, y) with places.openmp: cpu = axpy(2.0, x, y) Bryan Catanzaro et al (NVIDIA Research) [GTC12] 11

12 Ongoing Work Knowledge Discovery Toolbox (John Gilbert, Adam Lugowski, and Aydin Buluc, UCSB & LBNL) [IPDPS13] PyCASP: Python-based Content Analysis Using Specialization (Katya Gonina) [ASRU11] Bag of Little Bootstraps (Peter Birsinger, Richard Xia, et al) [SciPy12] Communication-Avoiding Recursive Algorithms (David Eliahu, Omer Spillinger, Oded Schwartz, Ben Lipshitz, Jim Demmel) [IPDPS13] Communication-Avoiding Solvers (Jeffrey Morlan) [iwapt12] Three Fingered Jack (David Sheffield) [HotPar13] ASPIRE Project 12

13 SEJITS Participants Bryan Catanzaro, Kurt Keutzer, Katya Gonina, Jeffrey Morlan, Armando Fox, Richard Xia, Henry Cook, Peter Birsinger, David Patterson, Katherine Yelick, James Demmel, Tim Mattson, Henry Gabb, David Sheffield, Michael Anderson, Juan Vargas, Jessica Miller, Scott Beamer, Omer Spillinger, David Eliahu, Aakash Prasad, Oded Schwartz, Ben Lipshitz, David Howard, Derrick Coetzee, Tayfun Elmas, Koushik Sen 13

14 Demo & Testimonial Demo: Derrick Coetzee showing a simple image processing kernel Testimonial: Tim Mattson, Intel 14

Code Generators for Stencil Auto-tuning

Code Generators for Stencil Auto-tuning Shoaib Kamil with Cy Chan, Sam Williams, Kaushik Datta, John Shalf, Katherine Yelick, Jim Demmel, Leonid Oliker Diagnosing Power/Performance Correctness Where this