ThinLTO. A Fine-Grained Demand-Driven Infrastructure. Teresa Johnson, Xinliang David Li

Similar documents
Incremental Linking with Gold

Bringing link-time optimization to the embedded world: (Thin)LTO with Linker Scripts

Linux multi-core scalability

Decoupled Software Pipelining in LLVM

lld: A Fast, Simple and Portable Linker Rui Ueyama LLVM Developers' Meeting 2017

The pkgsrc wrapper framework

Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance

CS2141 Software Development using C/C++ Libraries

SYZYGY A Framework for Scalable Cross-Module IPO

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Clang - the C, C++ Compiler

Porting OpenVMS to x Update

2012 LLVM Euro - Michael Spencer. lld. Friday, April 13, The LLVM Linker

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Building opensuse with link-time optimizations. Jan Hubička and Martin Liška SUSElabs

Porting LLVM to a Next Generation DSP

BEAMJIT, a Maze of Twisty Little Traces

Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading

An Experimental Study of Rapidly Alternating Bottleneck in n-tier Applications

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

CS5460/6460: Operating Systems. Lecture 21: Shared libraries. Anton Burtsev March, 2014

Database Management and Tuning

Gold. Ian Lance Taylor Google. What? Why? How? Gold. Performance. Features. Future. Ian Lance Taylor Google. Who? June 17, 2008

Accelerated Library Framework for Hybrid-x86

PROGRAM COMPILATION MAKEFILES. Problem Solving with Computers-I

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Layers of an information system. Design strategies.

GFS: The Google File System

Designing your BI Architecture

LLVM for the future of Supercomputing

KOTLIN/NATIVE + CLANG, TRAVEL NOTES NIKOLAY IGOTTI, JETBRAINS

L25: Modern Compiler Design Exercises

independent compilation and Make

LLVM Summer School, Paris 2017

The Google File System

Optimizing Flash-based Key-value Cache Systems

BEAMJIT: An LLVM based just-in-time compiler for Erlang. Frej Drejhammar

The Software Stack: From Assembly Language to Machine Code

A User-level Secure Grid File System

Using Cache Line Coloring to Perform Aggressive Procedure Inlining

Oasis: An Active Storage Framework for Object Storage Platform

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

Bipul Sinha, Amit Ganesh, Lilian Hobbs, Oracle Corp. Dingbo Zhou, Basavaraj Hubli, Manohar Malayanur, Fannie Mae

LDetector: A low overhead data race detector for GPU programs

Current Topics in OS Research. So, what s hot?

Portable Native Client

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system

Towards Automated Collection of Application-Level Data Provenance

Extensible Network Security Services on Software Programmable Router OS. David Yau, Prem Gopalan, Seung Chul Han, Feng Liang

Release Notes. S32 Design Studio for ARM v1.1

Oracle Advanced Compression. An Oracle White Paper June 2007

Improving I/O Bandwidth With Cray DVS Client-Side Caching

COS 318: Operating Systems

SmartHeap for Multi-Core

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

The MapReduce Abstraction

Portable Power/Performance Benchmarking and Analysis with WattProf

Towards General-Purpose Resource Management in Shared Cloud Services

A New Key-Value Data Store For Heterogeneous Storage Architecture

M2 Instruction Set Architecture

llvmc2 - New LLVM Compiler Driver

Assembly Language Programming Linkers

Prelinker Usage for MIPS Cores CELF Jamboree #11 Tokyo, Japan, Oct 27, 2006

The ARES High-level Intermediate Representation

An Oracle White Paper February Optimizing Storage for Oracle PeopleSoft Applications

LLVM & LLVM Bitcode Introduction

AN LLVM INSTRUMENTATION PLUG-IN FOR SCORE-P

DEVIRTUALIZATION IN LLVM

Europeana Core Service Platform

Adobe Acrobat Connect Pro 7.5 and VMware ESX Server

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

SABLEJIT: A Retargetable Just-In-Time Compiler for a Portable Virtual Machine p. 1

Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

Dell EMC CIFS-ECS Tool

Building Binary Optimizer with LLVM

Software Ecosystem for Arm-based HPC

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Parallelizing Windows Operating System Services Job Flows

LPC17 - Supporting newer toolchains in the kernel. Bernhard Bero Rosenkränzer

Reducing the Overhead of Dynamic Compilation

Dynamic Metadata Management for Petabyte-scale File Systems

The SGI Pro64 Compiler Infrastructure - A Tutorial

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Adaptive Spatiotemporal Node Selection in Dynamic Networks

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

An Empirical Study of High Availability in Stream Processing Systems

Interprocedural Variable Liveness Analysis for Function Signature Recovery

Virtuozzo Containers

Exploitation of instruction level parallelism

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

SCALABLE TRAJECTORY DESIGN WITH COTS SOFTWARE. x8534, x8505,

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

Lecture 3: Instruction Set Architecture

Parley: Federated Virtual Machines

CS425 Computer Systems Architecture

LLVM-based Communication Optimizations for PGAS Programs

Summary: Open Questions:

Transcription:

ThinLTO A Fine-Grained Demand-Driven Infrastructure Teresa Johnson, Xinliang David Li tejohnson,davidxl@google.com

Outline CMO Background ThinLTO Motivation and Overview ThinLTO Details Build System Integration LLVM Prototype Status Preliminary Experimental Results Next Steps

Cross-Module Optimization (CMO) Background x.c y.c z.c Monolithic LTO (Link-Time Optimization): x.ir y.ir z.ir Cross Module Optimizations and Real Linking Frontend parsing and IR generations are all done in parallel Cross Module Optimization is done at link time via a linker plugin CMO pass consumes lots of memory and is generally not parallel CMO is done in linker process Doesn t scale to very large applications lto.o a.out

CMO Background (cont) x.c y.c z.c Monolithic LTO with Parallel BE: x.ir y.ir z.ir Cross Module Optimizations and Real Linking CMO still performed in serial Intramodule optimization and code generation done in parallel (thread or process level) Example: HPUX Itanium compiler (SYZYGY framework) BE BE lto.o a.out

CMO Background (cont) x.c y.c z.c LTO with WHOPR (gcc): x.ir y.ir z.ir IPA, Partition, IR rewrite BE invocation and real link temp1.ir BE temp2.ir BE Frontend parsing and IR generations are all done in parallel Backend compilations are done in parallel Inline decisions/analysis made in serial IPA Inline transformations in parallel within partitions during backend compilations Requires materializing clones into partitions, increasing serial IR I/O overhead Summary based IPA is done in serial Partitioning is done in serial IR reading and rewriting is done in serial temp1.o a.out temp2.o

Parallel CMO: What if we can... Module Importing x.c y.c z.c Fully Parallel CMO: + BE + BE + BE End-2-end compilation in parallel Cross Module Optimization is enabled for each compilation Each module is extended with other modules as if they were from headers (Note most of the performance comes from cross module inlining) x.o y.o z.o linking a.out

Lightweight (L) approach Profile Data x.c y.c z.c Fully Parallel CMO In L mode (gcc/google): + BE + BE + BE x.o y.o z.o linking Source importing is pre-computed (using dynamic call graph from profile) Source compilations are extended with auxiliary modules during parsing Module groups are usually small or capped Captures most of the full LTO gain Compilations are fully parallel! a.out

Problems with L mode Need profile data to compute module group Importing done at module level limits the scalability and performance headroom Duplicated parsing/compilation of auxiliary modules Doesn t tolerate stale profile

Full Parallel CMO: A new model x.c x.ir y.c y.ir z.c z.ir Delay importing until IR files have been generated Allows fine grained importing at function level greatly increasing the number of useful functions that can be imported -- liberating more performance No more duplicated parsing in (But how do they synchronize & communicate?) x.o y.o z.o linking a.out

ThinLTO: An Implementation of the New Model x.c y.c z.c... Profile Data (Optional) x.ir+func y.ir+func z.ir+func Super Thin Linker Plugin + Linker Func Map Global Analysis Summary (optional) x.o y.o z.o a.out

ThinLTO: An Implementation of the New Model x.c y.c z.c... Profile Data (Optional) IR includes summary data and function position (can be in ELF symtab) x.ir+func y.ir+func z.ir+func Super Thin Linker Plugin + Linker Func Map Global Analysis Summary (optional) x.o y.o z.o a.out

ThinLTO: An Implementation of the New Model x.c x.ir+func y.c y.ir+func z.c z.ir+func... Profile Data (Optional) Super Thin Linker Plugin + Linker IR includes summary data and function position (can be in ELF symtab) To enable deman in backend compilation, the ThinLTO plugin simply aggregates a global function map Func Map Global Analysis Summary (optional) x.o y.o z.o a.out

ThinLTO: An Implementation of the New Model x.c x.ir+func y.c y.ir+func z.c z.ir+func... Super Thin Linker Plugin + Linker Func Map Profile Data (Optional) IR includes summary data and function position (can be in ELF symtab) To enable deman in backend compilation, the ThinLTO plugin simply aggregates a global function map Enables backend to do importing at function granularity: minimizing memory footprint, IO/networking overhead Global Analysis Summary (optional) x.o y.o z.o a.out

ThinLTO: An Implementation of the New Model x.c x.ir+func y.c y.ir+func z.c z.ir+func... Super Thin Linker Plugin + Linker Func Map Global Analysis Summary (optional) Profile Data (Optional) IR includes summary data and function position (can be in ELF symtab) To enable deman in backend compilation, the ThinLTO plugin simply aggregates a global function map Enables backend to do importing at function granularity: minimizing memory footprint, IO/networking overhead Function importing is based on function summary, optional global analysis summary, and profile data, with a priority queue to maximize benefits x.o y.o z.o a.out

ThinLTO: An Implementation of the New Model x.c x.ir+func y.c y.ir+func z.c z.ir+func... Profile Data (Optional) Super Thin Linker Plugin + Linker Func Map Global Analysis Summary (optional) ThinLTO plugin does very minimal work No IPA by default No IR reading, partitioning, and IR rewriting, so minimal I/O It can scale to programs of any size, and allow on machines with tiny memory footprints and without significantly increasing time (requirements for enabling by default) For single node ThinLTO build, the BE parallel processes will be launched in the linker process by the plugin x.o y.o z.o a.out

ThinLTO Advantages Highly Scalable Thin plugin layer does not require large memory and is extremely fast Fully parallelizable backend Transparent Similar to classic LTO, via linker plugin Doesn t require profile (unlike L) High Performance Close to full LTO Peak optimization can use profile and/or more heavyweight IPA Flexible Friendly to both single build machine and distributed build systems Possible to enable by default at -O2!

ThinLTO Phase 1: IR and Function Summary Generation x.c y.c z.c Profile Data (Optional) x.ir+func y.ir+func z.ir+func Super Thin Linker Plugin + Linker Func Map Global Analysis Summary (optional) x.o y.o z.o a.out

ThinLTO Phase 1: IR and Function Summary Generation Generate IR files in parallel E.g. bitcode as in a normal LLVM -flto -c compile Generate function table Maps functions to their offsets in the bitcode file Generate function summary data to guide later import decisions Included in the function table E.g. function attributes such as size, branch count, etc Optional profile data summary (e.g. function hotness) How to represent function /summary table? Metadata? LLVM IR? ELF section? Discussed later...

ThinLTO Phase 2: Thin Linker Plugin Layer x.c y.c z.c... Profile Data (Optional) x.ir+func y.ir+func z.ir+func Super Thin Linker Plugin + Linker Func Map Global Analysis Summary (optional) x.o y.o z.o a.out

ThinLTO Phase 2: Thin Linker Plugin Layer Launched transparently via linker plugin (as with LTO) Simply combine per-module function es into a thin archive On disk hash table or AR format if function es in ELF Optional heavier-weight IPA added for (non-default) peak optimization levels and saved in the combined summary Exclude functions from map if unlikely to benefit from inlining E.g. very large, unlikely/cold, duplicate COMDAT copies, etc Can aggressively prune import candidates (In L only ~5-10% of functions in hot imported modules are actually needed) Generate backend compile Makefile For single node build invoke parallel make and resume final link

ThinLTO Phase 3: Parallel BE with Demand Driven x.c y.c z.c... Profile Data (Optional) x.ir+func y.ir+func z.ir+func Super Thin Linker Plugin + Linker Func Map Global Analysis Summary (optional) x.o y.o z.o a.out

ThinLTO Phase 3: Parallel BE with Demand Driven Iterative lazy function importing Priority determined by summary and callsite context Use in combined function map to import function efficiently Global value symbol linking as functions imported Static promotion and renaming (similar to L) Uniqued names after linking must be consistent across modules Minimize required promotions with careful import strategy Lazy debug import IPA/CMO on extended module Afterwards discard non-inlined imported functions, except referenced COMDAT and referenced non-promoted statics Late global and machine specific optimizations, code generation

ThinLTO Function Import Decisions Ideally import a close superset of functions that profitable to inline Use minimal summary information and callsite context to estimate Achieve some of LTO s internalization benefit by importing functions with a single static callsite as per summary (i.e. call once local linkage inline benefit heuristic) More accurate with optional profile data (particularly with indirect call profiles enabling promotion in the first stage compile). Minimize required static promotions Always import referenced statics, so promote only address-taken statics (i.e. may still be referenced by another exported function)

IR and Function Map Format Two main options for IR and function /summary map: 1. Bitcode Stage 1 per-module function map represented with new metadata Could encode combined function map as an on disk hash table (ala profile) Tools like $AR, $RANLIB, $NM and $LD -r must be invoked with a plugin to handle IR 2. ELF wrapper Leverage recent support for handling ELF-wrapped LLVM bitcode Stage 1 per-module function maps encoded in special ELF section Combined function map ELF sections can simply use $AR format Tools like $AR, $RANLIB, $NM and $LD -r just work The transparency with standard tools is a big advantage of using an ELF wrapper, but encoding the function map as metadata is the simplest/fastest route to implementing within LLVM. Looking for feedback from the community on this aspect.

ThinLTO Distributed Build Integration: Challenges How/where should the BE jobs be distributed/dispatched to the build cluster BE build input sets preparation/staging: ThinLTO works best when build nodes share network file system. Lazy importing can minimize network traffic needed for CMO Otherwise BE compile dependency needs to be precomputed for file staging from local disks or network storage (compute in plugin layer from profile data, or from symbol table and heuristics at O2) Incremental builds: Incremental compile for IR files works with any build system Using BE compile dependency list, the BE compilation can be incremental as well

ThinLTO Distributed Build Integration: Example To Compile x.ir Func Map x.o Profile data consumed by and produces IR with module level summary. y.ir Action 1:prelink with plugin:original ld cmd Module Map Action 2: Parallel BE Invocations y.o Action 3: Real Linking Step z.ir Imports Files z.o Split the ThinLTO link step into 3 actions: 1. A pre-link step with ThinLTO plugin producing global data: a. Function Map, Module Map (from IR to real object), Imports Files (for file staging) b. Exits without producing real objects 2. Backend invocation action a. Uses map files from step 1 to create backend command lines and schedule jobs 3. Real linking a. Fix up the original command line using the module map and perform real link

LLVM Prototype Status Implemented prototype within LLVM Support for stage 2 (thin plugin layer) and 3 (BE with importing) in both gold plugin and llvm-lto driver under options Function map/summary and thin archive in separate side text files for now Testing with SPEC cpu2006 C/C++ benchmarks

LLVM Prototype Details ThinLTO importing implemented as SCC pass before inlining Leverage LTO module linking, with ThinLTO-specific modifications Initial heuristics for guiding import decisions (e.g. function size, profile, call count from function summary) Implemented minimized static promotion Rename statics by appending MD5 hash of module name/path Enhanced GlobalDCE to remove unneeded imported functions Lazy debug import as a post-pass during ThinLTO pass finalization Only import DISubroutine metadata needed by imported functions Uses bookkeeping to track temporary metadata on imported instructions to enable fixup as debug metadata imported/linked

Preliminary Experimental Data Very little tuning, preliminary import heuristics Limit the number of instructions (computed at parse time) Try allowing more aggressive import/inline when single use No results with profile data yet ThinLTO negatively impacted by lack of indirect call visibility (needs to be addressed with value profile/promotion during stage-1) Runtime system configuration: Intel Corei7 6-core hyperthreaded CPU 32K L1I, 32K L1D, 256K L2, 12M shared L3 SPEC CPU2006 C/C++ benchmarks Averaged the results across 3 runs

Performance comparison ThinLTO 1: Import iff < 100 instructions (recorded in function during parse) Thin LTO 2: Above, but also import if only one use (currently requires additional parsing/analysis during stage-2), also apply inliner s called once+internal linkage preferential treatment Indirect call profiling/promotion will help close gap

Build Overhead Comparisons Build system configuration: Dual Intel 10-Core Xeon CPU 64GB Quad-Channel SPEC CPU2006 C/C++ benchmarks Small compared to many real-world C++ applications! Compare maximum LTO+BE times and memory Exclude parsing to compare optimization time All use the same clang -O2 -flto -c bitcode files as input For ThinLTO: Max BE (including ThinLTO importing) time/memory For a distributed build the BE jobs will be in parallel For O2: Max optimization time/memory measured with llc -O2

Compile time comparison (no -g)

Memory Comparison (no -g)

Compile time comparison (-g)

Memory Comparison (-g)

Next Steps Decide on file formats Fine-tune import heuristics Profile feedback integration Distributed build system integration Decide how to best stage upstream patches

Thank you! Teresa Johnson (tejohnson@google.com) Xinliang David Li (davidxl@google.com)

Backup

Text Size Comparison (GC = -ffunction-sections -fdatasections -Wl,--gc-sections)

Performance Effect of Internalization ThinLTO 1: Import iff < 100 instructions (recorded in function during parse) Thin LTO 2: Above, but also import if only one use (currently requires additional parsing/analysis during stage-2), also apply inliner s called once+internal linkage preferential treatment LTO-Internalization: LTO with internalization disabled