EA-Based Generation of Compiler Heuristics for Polymorphous Computing Architectures

Similar documents
Creating a Scalable Microprocessor:

MIT Laboratory for Computer Science

MIT Laboratory for Computer Science

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06.

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

TRIPS: Extending the Range of Programmable Processors

Computer Architecture s Changing Definition

The Nios II Family of Configurable Soft-core Processors

Versatile Tiled-Processor Architectures: The Raw Approach

Portland State University ECE 588/688. Dataflow Architectures

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

A Characterization of High Performance DSP Kernels on the TRIPS Architecture

Master of Engineering Preliminary Thesis Proposal For Prototyping Research Results. December 5, 2002

Extending Course-Grained Reconfigurable Arrays with Multi-Kernel Dataflow

Architecture. Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R.

Networks-on-Chip Power Reduction by Encoding Scheme

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures James Lebak Hank Hoffmann Janice McMahon MIT Lincoln Laboratory

Lecture 4: RISC Computers

A Streaming Virtual Machine for GPUs

Constructing Virtual Architectures on Tiled Processors. David Wentzlaff Anant Agarwal MIT

Analysis of the TRIPS Architecture. Dipak Chand Boyed

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

A Stream Compiler for Communication-Exposed Architectures

Register Organization and Raw Hardware. 1 Register Organization for Media Processing

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Evaluation of Stream Virtual Machine on Raw Processor

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Intel: Driving the Future of IT Technologies. Kevin C. Kahn Senior Fellow, Intel Labs Intel Corporation

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

Performance Improvement by N-Chance Clustered Caching in NoC based Chip Multi-Processors

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

A Data-Parallel Genealogy: The GPU Family Tree

CMPT 379 Compilers. Anoop Sarkar.

A Reconfigurable Architecture for Load-Balanced Rendering

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Cisco APIC Enterprise Module Simplifies Network Operations

Chapter 7 The Potential of Special-Purpose Hardware

CS 475: Parallel Programming Introduction

ECE 574 Cluster Computing Lecture 15

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

WHY PARALLEL PROCESSING? (CE-401)

GreenDroid: An Architecture for the Dark Silicon Age

1.3 Data processing; data storage; data movement; and control.

ReFLEX: Block Atomic Execution on Conventional ISA Cores

Adaptive Scientific Software Libraries

Practical Malware Analysis

Implementation and Evaluation of a Dynamically Routed Processor Operand Network

TRIPS: A Distributed Explicit Data Graph Execution (EDGE) Microprocessor

2D-VLIW: An Architecture Based on the Geometry of Computation

Lecture 4: RISC Computers

Ibex: A Space Situational Awareness Data Fusion Program

Simone Campanoni Loop transformations

Lecture 9: MIMD Architectures

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

The CPU Design Kit: An Instructional Prototyping Platform. for Teaching Processor Design. Anujan Varma, Lampros Kalampoukas

Fundamentals of Quantitative Design and Analysis

The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs

EECS4201 Computer Architecture

Multiple Choice Type Questions

CS Programming In C

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

Two s Complement Review. Two s Complement Review. Agenda. Agenda 6/21/2011

Embedded Computing Platform. Architecture and Instruction Set

Memory Prefetching for the GreenDroid Microprocessor. David Curran December 10, 2012

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Compilers and Compiler-based Tools for HPC

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

ECE 261: Full Custom VLSI Design

Special Course on Computer Architecture

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Hardware/Software Co-design

Lecture 4: Instruction Set Design/Pipelining

Unit 9 : Fundamentals of Parallel Processing

Topics in Object-Oriented Design Patterns

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Lecture 9: MIMD Architecture

Intel Enterprise Processors Technology

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures

New Advances in Micro-Processors and computer architectures

PACE: Power-Aware Computing Engines

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

ECE 669 Parallel Computer Architecture

TrajStore: an Adaptive Storage System for Very Large Trajectory Data Sets

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

The Two-dimensional Superscalar GAP Processor Architecture

Cache Justification for Digital Signal Processors

Micro-programmed Control Ch 15

Programs. Function main. C Refresher. CSCI 4061 Introduction to Operating Systems

Programming with MPI. Pedro Velho

Machine Instructions vs. Micro-instructions. Micro-programmed Control Ch 15. Machine Instructions vs. Micro-instructions (2) Hardwired Control (4)

Micro-programmed Control Ch 15

Advances in Compilers

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Transcription:

EA-Based Generation of Compiler Heuristics for Polymorphous Computing Architectures Laurence D. Merkle, Michael C. McClurg, Matt G. Ellis, Tyler G. Hicks-Wright Department of Computer Science and Software Engineering Rose-Hulman Institute of Technology 5500 Wabash Ave, CM-103, Terre Haute, IN merkle@rose-hulman.edu This work was supported by the Advanced Computing Architectures Branch, Information Directorate, Air Force Research Laboratory, Air Force Materiel Command, USAF, under grant FA8750-05-1-0019. MSAEC July 8, 2006 Seattle, WA

Outline Research Objective PCA Background DARPA PCA Program MIT RAW Project UT Austin TRIPS Project Technical Progress Parallel EAs for PCAs Taxonomy of EAs in Compilers EA-Generation of TRIPS Compiler Heuristics Future Directions

Research Objective Enhance Raw and TRIPS compilers to produce more efficient executable code Approach: Adopt optimization view of compilation Hybridize EC techniques with existing algorithms

Processing Diversity 1. Breadth of processing 2. Uniform performance Mission Agility 1. In mission (retargetable) 2. Mission to - mission 3. New threat scenario Architectural Flexibility 1. High efficiency selectable virtual machines 2. Mission Adaptation (Constraints e.g. energy) 3. Portability/Evolvability PCA Objectives GOAL Four Classes 1. Signal/Data Processing 2. Information 3. Knowledge 4. Intelligence Competitive with best-in-class msec Days Months Two+ Major Morph States N Minor Morph States Two Layer Morphware Specialized Class Selectable Virtual Machines DSP Class PCA technology supports agile processing as as a function of of dynamic mission objectives and constraints P e r f o r m a n c e PCA Morph Space Architecture Space PPC Class Breadth Of PCA Server Class

Tile-based Architectures Context: Ongoing improvements in manufacturing technology Wire delays becoming more significant relative to gate delays Leading PCA efforts achieve dynamic responsiveness and scalability through use of tile-based architectures

PCA Hardware Overview ISI/Raytheon/Mercury: MONARCH/MCHIP MIT: RAW (Early Prototype) Stanford: Smart Memories MIT/LL: Early Testbed University of Texas/IBM: TRIPS Prototype Production

RAW Architecture (Agarwal, et al.) Raw Architecture Workstation (RAW) fully exposes low-level details of architecture to compiler Allows compiler or software to optimize resource allocation for each application Compiler generates traditional machine instructions and switch instructions for each tile Two orders of magnitude better performance than traditional processors in simulations of certain applications

Presented at PCA PI Meeting, 18 Aug 04, Monterey, CA The Raw Architecture 8 Static Router Fetch Unit Divide the silicon into an array of identical, programmable tiles A signal can get through a small amount of logic and to the next tile in one cycle Compute Processor Fetch Unit Tiles connected by software-exposed on-chip interconnect Scalar Operand Network [HPCA 03] Compute Processor Data Cache

TRIPS Architecture (Burger, et al.) Grid Processor Architectures Composed of tightly coupled array of ALUs connected by thin network Producer instruction outputs delivered directly as consumer instruction inputs Example of Explicit Data Graph Execution (EDGE) Architecture Tera-op Reliable and Intelligently Adaptive Processing System (TRIPS) One or more grid processors working in parallel Sensor network monitors application behavior, feeds back to runtime system, application, and compiler

Presented at PCA PI Meeting, 18 Aug 04, Monterey, CA TRIPS Chip Floorplan TRIPS/PCA review August 18, 2004 10

Compiling for the TRIPS Architecture (Burger, et al.) Difficult multicriteria optimization problem. Compiler must be able to Identify basic blocks Partition basic blocks into hyperblocks Map hyperblocks to tiles Map each operation to an execution unit Spatial scheduling affects both concurrency and communications delays Compiler currently employs a greedy approximation algorithm

Technical Progress: Parallel EAs for PCAs Enabling steps Toolchains obtained from developers Correct installations verified Parallel evolutionary algorithms Island-model implementations designed and implemented for both architectures Empirical evaluations in progress

Technical Progress: Taxonomy of EAs in Compilers Identified and described opportunities for additional compiler optimizations General classification of methods for application of EC to compilation Raw no automatic scheduling to optimize Programmer must partition source code EC no more and no less applicable to Raw than to any MIPS compiler TRIPS compiler partitions instructions into hyperblocks

Classification of Methods for Application of EC to Compilation Compiler Algorithms Effectiveness Application execution time Unconstrained search for best overall compiler Efficiency Compiler execution time Can be explicitly considered as an objective function Reproducibility (w/o controlling random number generator) Compilation Compiler Parameters Regular and nearly unconstrained search space Can be explicitly considered as an objective function Compilation Compile Time Unconstrained search for fastest executable Too slow for use in development environment Execution Schedule Time Dynamic search for fastest execution Essentially unchanged None

Technical Progress: EA-Generation of TRIPS Compiler Optimization Heuristics EA-based generation of more effective TRIPS compiler heuristics through integration of Finch Meta-Optimization Framework TRIPS C compiler Metrics Hyperblock count Instructions per hyperblock Ratio of hyperblocks to instructions per hyperblock Static vs. dynamic SPEC2000 Benchmarks PCA Community Benchmarks

Finch Meta-optimization Framework (Stephenson, et al.) Compiler heuristic to be optimized replaced by indirect invocation of EAprovided heuristic EA searches space of heuristics that have the required signature Candidate heuristics are evaluated by invoking compiler, then using objective function to evaluate compiler output EA executes on master processor, compiler and objective function execute on slave processors Benchmarks (e.g. GMTI) Finch (Evolutionary Algorithm) Objective function (e.g. instructions per hyperblock) Compiler flags (e.g. -inl 10) Compiler (e.g. Scale) Compiler component (e.g. in-lining priority function) Compiler output (e.g. assembly code)

TRIPS C Compiler Scalable Compiler for Analytical Experiments (Scale) Developed by Department of Computer Science, University of Massachusetts, Amherst Implemented by shell script wrappers around Scale compiler Modular compiler, easily modified for different output platforms Emphasis on hyperblock formation for TRIPS and other platforms

Scale Compiler Data Flow Diagram

Example Code int main(int argc, char** argv) { printf("hello world\n"); return 0; }.app-file "test.tcc.c".data.align 8 _V2:.ascii "Hello world\n\000".text.global main.bbegin main read $t0, $g1 read $t1, $g2 addi $t2, $t0, -96 sd -88($t0), $t1 S[0] entera $t3, _V2 enterb $t4, main$1 callo printf addi $t2, $t0, -96 sd -88($t0), $t1 S[0] entera $t3, _V2 enterb $t4, main$1. callo printf addi $t2, $t0, -96 sd -88($t0), $t1 S[0] entera $t3, _V2 enterb $t4, main$1 callo printf write $g1, $t2 write $g2, $t4 write $g3, $t3.bend.bbegin main$1 read $t0, $g1 movi $t1, 0 ld $t2, 8($t0) L[0] addi $t3, $t0, 96 ret $t2 write $g1, $t3 write $g2, $t2 write $g3, $t1.bend

Application of Finch to Scale Ported Scale to Finch environment Converted Finch to MPI-based communication Identified candidate heuristic functions (Re-)construction of abstract syntax tree Construction of control flow graph Standard posttranslation optimizations Benchmarks (e.g. GMTI) Finch (Evolutionary Algorithm) Objective function (e.g. instructions per hyperblock) Compiler flags (e.g. -inl 10) Compiler (e.g. Scale) Compiler component (e.g. in-lining priority function) Compiler output (e.g. assembly code)

In-Lining Technique Results: GMTI Benchmark 20 18 16 Instructions per hyperblock 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 ln(max code bloat)

References Agarwal, A. Raw Computation. Scientific American, August, 1999. Burger, D., S. Keckler, K. McKinley, M. Dahlin, L. John, C. Lin, C. Moore, J. Burrill, R. McDonald, W. Yoder, and the TRIPS Team. Scaling to the End of Silicon with EDGE Architectures. IEEE Computer, July 2004, pp. 44-55. Goldberg, D. The Design of Innovation. Kluwer Academic Publishers, Boston, 2002. Stephenson, et al. Meta Optimization: Improving Compiler Heuristics with Machine Learning. MIT. Scale project website (http://osl-www.cs.umass.edu/scale), Scale Compiler Group, Department of Computer Science, University of Massachusetts, Amherst. Taylor, M., J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffmann, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen M. Frank, S. Amarasinghe and A. Agarwal. The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs. IEEE Micro, Mar/Apr 2002.