LLMORE: Mapping and Optimization Framework

Size: px

Start display at page:

Download "LLMORE: Mapping and Optimization Framework"

Roger Wade
6 years ago
Views:

1 LORE: Mapping and Optimization Framework Michael Wolf, MIT Lincoln Laboratory 11 September 2012 This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Distribution Statement A: Approved for public release, distribution is unlimited. (9/10/2012). NOT APPROVED FOR PUBLIC RELEASE

2 Architectures Applications Overview of Mapping and Optimization Challenges FPGA SAR Secure Communication Cyber Database operations Chips Mapping and Optimization Small hybrid systems Clusters Challenges: Realistic simulations of applications Support for diverse languages/numerical libraries Support for diverse devices and architectures Supercomputers Data warehouses Michael Wolf - 2

3 LORE LORE is MIT Lincoln Laboratory s Mapping and Optimization Runtime Environment Parallel framework/environment for Optimizing data to processor mapping for parallel applications Simulating and optimizing new (and existing) architectures Generating performance data (runtime, power, etc.) Code generation and execution for target architectures pmapper SMaRT/MORE LORE pmapper: Matlab, dense operations, simple architecture model, executor, serial SMaRT/MORE: Matlab, sparse operations (limited data size), architecture model, no executor, serial LORE: multiple language support, sparse/dense operations, architecture model, executor, parallel, robust software pmapper patent issued: Method and apparatus performing automatic mapping for multiprocessor system Three generations of mapping and optimization Michael Wolf - 3

4 Key Features of LORE Support for multiple languages and numerical libraries Ability to solve large problems Written in C++ and runs in parallel Fit larger problems into memory, reduces time to solution Support for dense and sparse linear algebra operations Production quality research software Easy to use interfaces Designed to support future algorithms/packages/languages LORE is NOT an autoparallelizing compiler Will not generate optimized parallel code for any set of (serial or parallel) instructions Data layouts optimized in context of maps Michael Wolf - 4

5 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 5

6 LORE Framework Overview LORE input User Code LORE Parameters Architectural Model Key/Novel Features Application code to simulator Data mapping optimization for user code Support for multiple languages and libraries Ability to solve large problems Production quality software LORE LORE output 1. Set of op=mized maps or 2. Performance data or 3. Op=mized architecture(s) or 4. Generated code or 5. Results from run on target architecture Output: One or more Michael Wolf - 6

7 LORE Design Overview LORE Input LORE Output Parse Manager Parser Language Interface Map Manager Map Converter Builder Map Builder Analyzer and Op=mizer Core Functionality Code Generator Run=me Engine Machine Interface LORE Output = abstract syntax tree Michael Wolf - 7

8 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 8

LORE Usage 1: Map Optimization y = A x y = A x LORE LORE produces set of optimized maps for parallel variables specified in user code Matrix-vector product example LORE

9 LORE Usage 1: Map Optimization y = A x y = A x LORE LORE produces set of optimized maps for parallel variables specified in user code Matrix-vector product example LORE computes map for dense matrix LORE computes map for two vectors LORE optimizes data mapping to improve parallel performance of key computational kernels Michael Wolf - 9

10 Map Optimization: Input Dense Matrix-Vector Product User Code: User code Parser Mapper Mapped Optimized Maps y = A x Analyzer and Optimizer LORE y=ax Input from application: user code for dense matrix-vector product Michael Wolf - 10

11 Map Optimization: Representation Dense Matrix-Vector Product SB User code y=ax Parser Mapper Mapped Optimized Maps MV Analyzer and Optimizer LORE PVar: y PVar: A PVar: x Parser converts user code into abstract syntax tree (), which is input language/numerical library neutral Michael Wolf - 11

12 Map Optimization: Mapping Dense Matrix-Vector Product SB User code Parser Mapper Mapped Optimized Maps MV LORE Analyzer and Optimizer PVar: y PVar: A PVar: x Vector Map P0: {0,1} P1: {2,3} Matrix Map P0: {0,1} P1: {2,3} Vector Map P0: {0,1} P1: {2,3} LORE computes map for each parallel variable in Michael Wolf - 12

optimized maps for parallel variables New maps used to redistribute vector and matrix

13 Map Optimization: Output Dense Matrix-Vector Product x y A x y A User code Parser Mapper Mapped Optimized Maps LORE Analyzer and Optimizer y = A x Optimized y=ax LORE output: optimized maps for parallel variables New maps used to redistribute vector and matrix data Optimized matrix-vector product calculated with new data distributions Michael Wolf - 13

14 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 14

15 LORE Usage 2: Performance Evaluation y = A x Performance data LORE M1 P1 P2 M LORE simulates user code on specified architecture to produce performance evaluation metrics Michael Wolf - 15

16 Performance Evaluation: Input Dense Matrix-Vector Product User Code: Simulator Performance Data y = A x User Code, Arch. Model Parser Mapper Mapped MI Code Generator MI code y=ax LORE Performance Evaluator Analyzer and Optimizer Architecture Model: MI = Machine Independent M1 P1 P2 M2 Input from application: user code for dense matrix-vector product, architecture model Michael Wolf - 16

17 Performance Evaluation: Representation Dense Matrix-Vector Product Simulator Performance Data SB User Code, Arch. Model y=ax Parser Mapper Mapped MI Code Generator MI code Performance Evaluator MV LORE Analyzer and Optimizer MI = Machine Independent PVar: y PVar: A PVar: x Parser converts user code into abstract syntax tree (), which is input language/numerical library neutral Michael Wolf - 17

18 Performance Evaluation: Mapping Dense Matrix-Vector Product Simulator Performance Data SB User Code, Arch. Model Parser Mapper Mapped MI Code Generator MI code MV Performance Evaluator LORE Analyzer and Optimizer MI = Machine Independent PVar: y PVar: A PVar: x Vector Map P0: {0,1} P1: {2,3} Matrix Map P0: {0,1} P1: {2,3} Vector Map P0: {0,1} P1: {2,3} LORE computes map for each parallel variable in Michael Wolf - 18

19 Performance Evaluation: Machine Independent Code Mapped SB MV Architecture Model Simulator Output PVar: y Vector Map P0: {0,1} P1: {2,3} PVar: A Matrix Map P0: {0,1} P1: {2,3} PVar: x Vector Map P0: {0,1} P1: {2,3} M1 P1 P2 M2 Input Parser Mapper Mapped MI Code Generator MI code Performance Evaluator MI Code Generator read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 LORE Analyzer and Optimizer y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 MI = Machine Independent send x 0 send x 1 read A 0,1 read A 1,0 y 0 += A 0,1 x 1 y 1 += A 1,0 x 0 MI Code write y 0 write y 1 Mapped and architecture model used to generate machine independent code Michael Wolf - 19

20 Machine Independent Code read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 P0 y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 P1 Computation Memory access Communication send x 0 send x 1 Flow graph of operations read A 0,1 read A 1,0 y 0 += A 0,1 x 1 y 1 += A 1,0 x 0 # rows = 2 # processors = 2 write y 0 write y 1 Machine independent code for matrix-vector product Michael Wolf - 20

Performance Evaluation: Output read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 Simulator Performance Data send x 0 send x 1 Input Parser Mapper Mapped MI

21 Performance Evaluation: Output read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 Simulator Performance Data send x 0 send x 1 Input Parser Mapper Mapped MI Code Generator MI code read A 0,1 y 0 += A 0,1 x 1 read A 1,0 y 1 += A 1,0 x 0 Performance Evaluator MI Code write y 0 write y 1 LORE Analyzer and Optimizer Simulator Performance Data Simulation of user code on target architecture Michael Wolf - 21

22 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 22

POEM Photonic Switching Technology Ring Resonator

Parameter Characterization Architecture Design Proposed

ptimized E mbedded M icroprocessor Fielded Systems

Framework Processing Chains for Military-Critical

Surveillance Handheld PDA Field Devices POEM will bridge

23 POEM Photonic Switching Technology Ring Resonator Technology Photonics Layered on Silicon Architecture Parameter Characterization Architecture Design Proposed Photonic Architectures Photonic Innovation P hotonically O ptimized E mbedded M icroprocessor Fielded Systems Application Analysis Simulation, Mapping, and Optimization Framework Processing Chains for Military-Critical Applications Handheld DNA Sequencing Device UAV Surveillance Handheld PDA Field Devices POEM will bridge the gap between innovations in chip-scale photonics and fielded military-critical systems. It will offer a complete architecture design and analysis for numerous real-world military-critical applications. Michael Wolf - 23

24 LORE s Role in POEM Program Photonic Switching Technology Ring Resonator Technology Photonics Layered on Silicon Architecture Parameter Characterization Processing Chains for Military-Critical Applications Photonic Innovation Architecture Design Proposed Photonic Architectures P hotonically O ptimized Application Analysis E mbedded M icroprocessor Simulation, Mapping, and Optimization Framework Fielded Systems Handheld DNA Sequencing Device UAV Surveillance Handheld PDA Field Devices LORE used to study chip-scale photonics and its impact on applications Michael Wolf - 24

LORE and POEM: Applications SAR Secure Communication Cyber Database operations LORE LORE provides framework for analyzing POEM applications LORE supports many key numerical kernels (FFT, sparse

25 LORE and POEM: Applications SAR Secure Communication Cyber Database operations LORE LORE provides framework for analyzing POEM applications LORE supports many key numerical kernels (FFT, sparse matrix-vector product, vector updates, etc.) Applications supported through composition of these kernels Easy to extend to analyze new applications Initially analyzing synthetic aperture radar (SAR) application LORE enables the analysis of many applications on many different architectures (existing and proposed) Michael Wolf - 25

LORE and POEM: Architectures Application Electronic Mesh LORE Photonic Bus LORE supports simulation of applications on POEM architectures (e.g.

26 LORE and POEM: Architectures Application Electronic Mesh LORE Photonic Bus LORE supports simulation of applications on POEM architectures (e.g., electronic mesh and photonic bus) Framework for simulating user code LORE simulator for understanding big picture trends Interface to third party simulators (e.g., PhoenixSim) for higher fidelity performance data LORE enables the analysis of many applications on many different architectures (existing and proposed) Michael Wolf - 26

LORE and POEM: Optimization of Maps y = A x y = A x LORE Optimization of maps Good application data to processor mapping crucial to achieving peak parallel performance on target machines Not

27 LORE and POEM: Optimization of Maps y = A x y = A x LORE Optimization of maps Good application data to processor mapping crucial to achieving peak parallel performance on target machines Not difficult for SAR applications (simple maps sufficient) Challenging for applications with irregular communication (DNA sequence analysis and sparse matrix computations) LORE provides automatic map optimization Michael Wolf - 27

28 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 28

29 LORE and Performance Evaluation LORE output LORE input User Code Architectural Model Parser LORE Mapper Mapped Analyzer and Optimizer Simulator MI Code Generator MI code Performance Evaluator Performance data: time (s), GFLOPS LORE used to produce performance evaluation data for 2D FFT, an important kernel in SAR processing chain Michael Wolf - 29

30 POEM Architecture Models Electronic Mesh Photonic Bus M1 P1 P2 P3 P4 M2 M1 M2 M3 M4 P1 P2 P3 P4 P5 P9 P6 P10 P7 P11 P8 P12 P=processor/core M=Shared memory =local memory P5 P9 P6 P10 P7 P11 P8 P12 M3 P13 P14 P15 P16 M4 P13 P14 P15 P16 Two architectures modeled (electronic mesh and photonic bus Processors same, shared memory same Network parameters (latency, bandwidth) set to allow for apple to apples comparison between networks LORE framework allows direct comparison of electronic mesh and photonic bus architectures Michael Wolf - 30

31 Preliminary POEM Simulations 2D FFT: FFT, CT, FFT 2D FFT: FFT, SCA, FFT 1D FFT for each row of dense matrix FFT Read FFT Write FFT Read FFT For data locality of FFT CT Read Transposed Write CT SCA Write 1D FFT for each column of dense matrix EM=Electronic mesh PB=Photonic bus SCA = Synchronous coalesced access FFT Read FFT Write Architectures simulated: EM FFT Read FFT Write Architectures simulated: PB Initially, simulated 2D FFT kernel Important kernel for SAR applications Michael Wolf - 31

32 Photonic Synchronous Coalesced Access Network (PSCAN) PSCAN leverages the differences between electronic and photonic interconnect to achieve large efficiency gains in critical operations Utilizes distance independence to rapidly re-organize spatially separate data Large gains in efficiency even when bandwidth is equalized MI M 1 SCA write Synthesizes matrix row-to-column transpose/write into a single transaction by interleaving data from spatially separate data producers Novel ISA construct enabled by a highly synchronous photonic waveguide DRIVE0 DRIVE1 DRIVE2 DRIVE3 P-CLK 0 Data is not driven time DRAM Address info driven by proc 3 Credit: Dave Whelihan, MITLL Michael Wolf - 32

Preliminary LORE Results GOPS 10 11 Performance of 2D FFT Operation 10 10 43% of Ideal Experiment: Number of memory controllers: 4 Matrix size: 4096 x 4096 Equalize bandwidth to memory in photonic

33 Preliminary LORE Results GOPS Performance of 2D FFT Operation % of Ideal Experiment: Number of memory controllers: 4 Matrix size: 4096 x 4096 Equalize bandwidth to memory in photonic and electronic models Run traditional 2D FFT with block transpose for electronic mesh Run 2D FFT with SCA for PSCAN Scale number of processors Number of Cores Electronic PSCAN Ideal Examine performance of photonics and electronics in an increasingly workstarved environment PSCAN architecture with SCA yields significant speed-up over electronic mesh architectures using block transpose Michael Wolf - 33

Data Reorganization Bottleneck in Modern Manycore Systems 1 Time Spent Reorganizing Data for 2D FFT Block Transpose SCA Percentage 0.9 0.8 0.7 0.6 0.5 0.4 0.

1 0 4 8 16 32 64 128 256 512 1024 2048 4096 Number of Cores Block Transpose SCA As the number of cores in modern CMPs grows, the data reorganization becomes an increasing performance

34 Data Reorganization Bottleneck in Modern Manycore Systems 1 Time Spent Reorganizing Data for 2D FFT Block Transpose SCA Percentage Data Reorganization FFT CT FFT Read FFT Write Read Transpose Write Read FFT Write FFT CT FFT Read FFT SCA Read FFT Write Number of Cores Block Transpose SCA As the number of cores in modern CMPs grows, the data reorganization becomes an increasing performance bottleneck. Performance is limited by the inability to keep processors supplied with data. SCA operation performs data reorganization step of 2D FFT more efficiently as number of processors grows, alleviating traditional block transpose bottleneck Michael Wolf - 34

35 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 35

36 LORE Next Steps Extend architecture model, LORE simulator Support memory hierarchy Model network contention more accurately Support for external simulators E.g., PhoenixSim (Columbia), SST (Sandia) Needed for higher fidelity simulations Better power modeling (e.g., dynamic power) Additional parallel numerical library/language support Additional languages: Matlab, Python Additional libraries: e.g., VSIPL++/PVTOL Additional kernels: e.g., Sparse matrix operations Code generator and runtime engine for execution/emulation on target architectures Michael Wolf - 36

37 Summary LORE: Parallel framework/environment for Optimizing data to processor mapping for parallel applications Simulating and optimizing new (and existing) architectures LORE used to compare photonic and electronic architectures of interest to POEM project Support for many applications (through composition of kernels) Support for different architectures Preliminary results for 2D FFT kernel Support thesis that photonics can improve performance for these applications SCA instruction mitigates performance impact of corner turn in 2D FFT operation LORE allows for exploration of architecture design space in context of real application constraints Michael Wolf - 37

38 Acknowledgements LORE Michelle Beard Anna Klein Sanjeev Mohindra Julie Mullen Eric Robinson Nadya Bliss (Manager) Minna Song (MIT student intern) POEM Dave Whelihan Jeff Hughes Scott Sawyer Michael Wolf - 38

Instruction Set Extensions for Photonic Synchronous Coalesced Access

Instruction Set Extensions for Photonic Synchronous Coalesced Access Paul Keltcher, David Whelihan, Jeffrey Hughes September 12, 2013 This work is sponsored by Defense Advanced Research Projects Agency