Programming Larrabee: Beyond Data Parallelism. Dr. Larry Seiler Intel Corporation

Similar documents
NVMHCI: The Optimized Interface for Caches and SSDs

Kirk Skaugen Senior Vice President General Manager, PC Client Group Intel Corporation

The Heart of A New Generation Update to Analysts. Anand Chandrasekher Senior Vice President, Intel General Manager, Ultra Mobility Group

Parallel Programming on Larrabee. Tim Foley Intel Corp

Thunderbolt Technology Brett Branch Thunderbolt Platform Enabling Manager

Data center day. Non-volatile memory. Rob Crooke. August 27, Senior Vice President, General Manager Non-Volatile Memory Solutions Group

Hardware and Software Co-Optimization for Best Cloud Experience

JOE NARDONE. General Manager, WiMAX Solutions Division Service Provider Business Group October 23, 2006

Performance Monitoring on Intel Core i7 Processors*

Carsten Benthin, Sven Woop, Ingo Wald, Attila Áfra HPG 2017

Data Centre Server Efficiency Metric- A Simplified Effective Approach. Henry ML Wong Sr. Power Technologist Eco-Technology Program Office

Intel Architecture Press Briefing

Solid-State Drives for Servers and Clients: Why, When, Where and How

Software Occlusion Culling

Enterprise Data Integrity and Increasing the Endurance of Your Solid-State Drive MEMS003

Mobile World Congress Claudine Mangano Director, Global Communications Intel Corporation

DisplayPort: Foundation and Innovation for Current and Future Intel Platforms

Technology is a Journey

Server Efficiency: A Simplified Data Center Approach

Innovation Accelerating Mission Critical Infrastructure

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

Parallel Programming for Graphics

Using Wind River Simics * Virtual Platforms to Accelerate Firmware Development PTAS003

Introduction to Parallel Programming Models

Unlocking the Future with Intel

Intel Many Integrated Core (MIC) Architecture

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Using Wind River Simics * Virtual Platforms to Accelerate Firmware Development

Rack Scale Architecture Platform and Management

32nm Westmere Family of Processors

Building a Firmware Component Ecosystem with the Intel Firmware Engine

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Intel Core 4 DX11 Extensions Getting Kick Ass Visual Quality out of the Latest Intel GPUs

The Transition to PCI Express* for Client SSDs

Interconnect Bus Extensions for Energy-Efficient Platforms

Intel Desktop Board DZ68DB

Sensors on mobile devices An overview of applications, power management and security aspects. Katrin Matthes, Rajasekaran Andiappan Intel Corporation

CS427 Multicore Architecture and Parallel Computing

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Lecture 25: Board Notes: Threads and GPUs

Extending Energy Efficiency. From Silicon To The Platform. And Beyond Raj Hazra. Director, Systems Technology Lab

Installation Guide and Release Notes

Portland State University ECE 588/688. Graphics Processors

Intel Virtualization Technology Roadmap and VT-d Support in Xen

Intel Array Building Blocks

Introduction to the NVMe Working Group Initiative

Modern Processor Architectures. L25: Modern Compiler Design

Jomar Silva Technical Evangelist

Mobility: Innovation Unleashed!

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Krzysztof Laskowski, Intel Pavan K Lanka, Intel

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Risk Factors. Rev. 4/19/11

Increase your FPS with CPU Onload

GraphBuilder: A Scalable Graph ETL Framework

PowerVR Hardware. Architecture Overview for Developers

Increase your FPS. with CPU Onload Josh Doss. Doug McNabb.

Intel and the Future of Consumer Electronics. Shahrokh Shahidzadeh Sr. Principal Technologist

Intel Transparent Computing

Intel Xeon Phi Coprocessor Performance Analysis

Striking the Balance Driving Increased Density and Cost Reduction in Printed Circuit Board Designs

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Intel Iris Graphics Quick Sync Video Innovation Behind Quality and Performance Leadership

Supra-linear Packet Processing Performance with Intel Multi-core Processors

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Using Tasking to Scale Game Engine Systems

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Computer Architecture

Show Me the Money: Monetization Strategies for Apps. Scott Crabtree moderator

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

Risk Factors. Rev. 4/19/11

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Making Nested Virtualization Real by Using Hardware Virtualization Features

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Nokia Conference Call 1Q 2012 Financial Results

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

NOKIA FINANCIAL RESULTS Q3 / 2012

From Shader Code to a Teraflop: How Shader Cores Work

Intel Desktop Board D945GCLF2

Intel and Badaboom Video File Transcoding

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Real-Time Rendering Architectures

Windowing System on a 3D Pipeline. February 2005

GRAPHICS PROCESSING UNITS

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Device Firmware Update (DFU) for Windows

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Installation Guide and Release Notes

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

Transcription:

Dr. Larry Seiler Intel Corporation Intel Corporation, 2009

Agenda: What do parallel applications need? How does Larrabee enable parallelism? How does Larrabee support data parallelism? How does Larrabee support task parallelism? How can applications mix parallel methods? 2

Architecture Convergence Larrabee Multi-threading threading Single Thread Multi-core Throughput Performance IA: Highly Programmable Fully Programmable? Programmability GPU: High Parallelism Partially Programmable Fixed Function CPU programmability & GPU parallelism 3

Programming Larrabee Larrabee supports braided parallelism : GPU-style data-parallelism, mixed with Multi-core task parallelism, mixed with Streaming pipe parallelism, mixed with Sequential code (still often necessary) Larrabee can submit work to itself Larrabee supports the IA architecture Larrabee lets applications mix the kinds of parallelism they need 4

Agenda What do parallel applications need? How does Larrabee enable parallelism? How does Larrabee support data parallelism? How does Larrabee support task parallelism? How can applications mix parallel methods? 5

Larrabee Block Diagram Memory Controller Fixed Function Texture Logic Multi-Threaded Threaded Wide SIMD Wide SIMD I$ D$ I$ D$ Multi-Threaded Threaded Wide SIMD Wide SIMD I$ D$ Multi-Threaded Threaded Wide SIMD Wide SIMD I$ D$ I$ D$... L2 Cache... Multi-Threaded Threaded Wide SIMD Wide SIMD I$ D$ I$ D$ Display Interface System Interface Memory Memory Controller Cores communicate on a wide ring bus Fast access to memory and fixed function blocks Fast access for cache coherency L2 cache is partitioned among the cores Provides high aggregate bandwidth Allows data replication & sharing 6

Processor Core Block Diagram Instruction Decode Scalar Unit Scalar Registers Vector Unit Vector Registers L1 Icache & Dcache 256KB L2 Cache Local Subset Separate scalar and vector units with separate registers Vector unit: 16 32-bit ops/clock In-order instruction execution Short execution pipelines Fast access from L1 cache Direct connection to each core s subset of the L2 cache Prefetch instructions load L1 and L2 caches Ring 7

Vector Unit Block Diagram Mask Registers 16-wide Vector ALU Replicate Numeric Convert Reorder Vector Registers L1 Data Cache Numeric Convert Vector complete instruction set Scatter/gather for vector load/store Mask registers select lanes to write, which allows data-parallel flow control Masks also support data compaction Vector instructions support Memory operand for most instructions: full speed when data in L1 cache Free numeric type conversion and data replication while reading from memory Rearrange the lanes on register read Fused multiply add (three arguments) Int32, Float32 and Float64 data 8

Key Differences from Typical GPUs Each Larrabee core is a complete Intel processor Context switching & pre-emptive multi-tasking Virtual memory and page swapping, even in texture logic Fully coherent caches at all levels of the hierarchy Efficient inter-block communication Ring bus for full inter-processor communication Low latency high bandwidth L1 and L2 caches Fast synchronization between cores and caches Larrabee: the programmability of IA with the parallelism of graphics processors 9

Agenda What do parallel applications need? How does Larrabee enable parallelism? How does Larrabee support data parallelism? How does Larrabee support task parallelism? How can applications mix parallel methods? 10

Data Parallelism Overview Key ideas of data parallel programming Define an array (grid) of data elements Run same program (kernel) over all the elements Share memory between groups within the grid Basically just nested parallel for loops Strengths Good utilization of SIMD processing elements (VPU) Hides long latencies (memory or texture accesses) Limitations Works best for large, homogenous problems Work efficiency drops with irregular conditional execution Work efficiency drops with irregular data structures 11

Larrabee Data Parallelism: s 16-wide vector unit 16-wide vector unit Each strand is the kernel running on one data element Each strand maps onto one (or more) VPU lanes s must support different conditional execution 12

Larrabee Data Parallelism: Fibers Fiber: SW-managed context (hides long predictable latencies) 16-wide vector unit 16-wide vector unit More Fibers running (typically 2-10, depending on latency to cover) SW cycles among fibers groups of strands) to cover latency 13

Larrabee Data Parallelism: Threads Cores: Each runs multiple threads Thread: HW-managed context (hide short unpredictable latencies) Fiber: SW-managed context (hides long predictable latencies) 16-wide vector unit 16-wide vector unit More Fibers running (typically 2-10, depending on latency to cover) More Threads (up to 4 per core, share memory via L1 & L2 caches) 14

Data Parallelism Summary Implementation hierarchy (your names may vary) : runs a data kernel in one (or more) VPU lane(s) Fiber: SW-managed group of strands, like co-routines Thread: HW-managed, swaps fibers to cover long latencies Core: runs multiple threads to cover short latencies Note: we use thread and core in micro-processor sense Comparison to GPU data parallelism Same mechanisms as used in GPUs, except Larrabee allows SW scheduling (except for HW threads) Larrabee: uses many-core IA architecture to execute data parallel workloads 15

Agenda What do parallel applications need? How does Larrabee enable parallelism? How does Larrabee support data parallelism? How does Larrabee support task parallelism? How can applications mix parallel methods? 16

Task Parallelism Overview Think of a task as an asynchronous function call Do X at some point in the future Optionally after Y is done Light weight, often in user space Strengths Copes well with heterogeneous workloads Doesn t require 1000 s of strands Scales well with core count Limitations No automatic support for latency hiding Must explicitly write SIMD code X() Y() 17

Larrabee Task Parallelism: Tasks Task: SW-managed context Scalar code Auto-vectorizing compiler, C++ vector classes, or vector intrinsics 16-wide vector unit More tasks enqueued, waiting to run 18

Larrabee Task Parallelism: Threads Cores: Each runs multiple threads Thread: HW-managed context (hide short unpredictable latencies) Task: SW-managed context Scalar code Auto-vectorizing compiler, C++ vector classes, or explicit vector code 16-wide vector unit More tasks enqueued, waiting to run More Threads (up to 4 per core, share memory via L1 & L2 caches) 19

Examples of Using Tasks Applications Scene traversal and culling Procedural geometry synthesis Physics contact group solve Data parallel strand groups Distribute across threads/cores using task system Exploit core resources with fibering/simd Larrabee can submit work to itself! Tasks can spawn other tasks Exposed in Larrabee Native programming interface 20

Larrabee Native Programming Internal projects include tools and compilers C and C++: std C libs: printf(), malloc(), etc. Data pointers, function pointers, recursion, etc. Threading: spawn P-Threads or use tasks Full software control when needed Thread affinity, assembly code, etc. Access to fixed function units Larrabee: supports fully general purpose task parallel programming 21

Agenda What do parallel applications need? How does Larrabee enable parallelism? How does Larrabee support data parallelism? How does Larrabee support task parallelism? How can applications mix parallel methods? 22

Applications Can Be Complex May need many kinds of parallel code GPU-style data-parallelism Multi-core task parallelism Streaming pipe parallelism And even some sequential code May also need Self-scheduling (data-dependent new work) Irregular data structures Example? The rendering pipeline All in software in Larrabee, except for texture filtering Works due to support for irregular computation 23

Larrabee s Binning Renderer Front End Back End Command Stream Process a Primitive Set Process a Primitive Set Bin Set (stored in off-chip memory) Render a Tile Render a Tile Render a Tile Frame Buffer Binning pipeline Reduces synchronization Front end processes vertices Back end processes pixels Bin FIFO between them Multi-tasking by cores Each orange box is a core Cores run independently Other cores can run other tasks, e.g. physics Intel Microarchitecture (Larrabee) 24

Back-End Rendering a Tile Sub-bins for Tile Setup Thread: interpolate & scoreboard Work Thread: early & late Z, pixel shader & alpha blender Work Thread: early & late Z, pixel shader & alpha blender Work Thread: early & late Z, pixel shader & alpha blender L2 Cache for Tile Orange boxes represent work on separate HW threads Three work threads do Z, pixel shader, and blending Setup thread reads from bins and does pre-processing Combines task parallel, data parallel, and sequential 25

Pipeline Can Be Changed Parts can move between front end & back end Vertex shading, tesselation, rasterization, etc. Allows balancing computation vs. bandwidth New features can be added Transparency, shadowing, ray tracing etc. Each of these need irregular data structures They benefit from mix of tasks and data-parallel Also helps to be able to repack the data Graphics pipleline is just an example These methods apply to many parallel applications 26

Application Scaling Studies 60 50 Production Fluid Marching Cubes Video Cast Indexing Foreground Estimation BLAS3 Game Cloth Game Rigid Body Production Face Sports Video analysis Text Indexing 3D-FFT Game Fluid GJK Collision Detect, 1K objs Sweep & Prune, Broad Phase 40 30 20 10 0 Parallel Speedup Number of Larrabee Cores 8 16 24 32 40 48 56 64 Data in graph from Seiler, L., Carmean, D., et al. 2008. Larrabee: A many-core x86 architecture for visual computing. SIGGRAPH 08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY 27

Choose the Right Tool Use the 3D rendering pipeline How you always have (or create your own) Use data parallelism Parallel for loop Hide latency by switching to work from other elements Use task parallelism Asynchronous function call Hide latency with in-task reuse and asynch memory ops Use sequential code Tie it all together Communicate with host at higher level of abstraction Flexible & programmable for many applications 28

Summary Larrabee lets applications mix the kinds of parallelism they need Larrabee provides the programmability of IA with the parallelism of GPUs Larrabee uses the many-core IA architecture to execute data parallel workloads Larrabee supports fully general purpose task parallel programming Larrabee is flexible & programmable for many applications Intel Microarchitecture (Larrabee) 29

Larrabee: Questions? See also s08.idav.ucdavis.edu for slides from a Siggraph2008 course titled Beyond Programmable Shading www.intel.com/idf -- click content catalog & search for Larrabee: to find a Larrabee presentation Seiler, L., Carmean, D., et al. 2008. Larrabee: A manycore x86 architecture for visual computing. SIGGRAPH 08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY Fatahalian, K., Houston, M., GPUs: a closer look, Communications of the ACM October 2008, vol 51 #10. graphics.stanford.edu/~kayvonf/papers/fatahaliancacm.pdf Data in graph from Seiler, L., Carmean, D., et al. 2008. Larrabee: A many-core x86 architecture for visual computing. SIGGRAPH 08: ACM SIGGRAPH 2008 Papers, ACM Press, New York, NY 30

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Larrabee and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright 2009 Intel Corporation. 31

Risk Factors This presentation contains forward-looking statements that involve a number of risks and uncertainties. These statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other similar transactions that may be completed in the future. The information presented is accurate only as of today s date and will not be updated. In addition to any factors discussed in the presentation, the important factors that could cause actual results to differ materially include the following: Demand could be different from Intel's expectations due to factors including changes in business and economic conditions, including conditions in the credit market that could affect consumer confidence; customer acceptance of Intel s and competitors products; changes in customer order patterns, including order cancellations; and changes in the level of inventory at customers. Intel s results could be affected by the timing of closing of acquisitions and divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of new Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel s response to such actions; Intel s ability to respond quickly to technological developments and to incorporate new features into its products; and the availability of sufficient supply of components from suppliers to meet demand. The gross margin percentage could vary significantly from expectations based on changes in revenue levels; product mix and pricing; capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; excess or obsolete inventory; manufacturing yields; changes in unit costs; impairments of long-lived assets, including manufacturing, assembly/test and intangible assets; and the timing and execution of the manufacturing ramp and associated costs, including start-up costs. Expenses, particularly certain marketing and compensation expenses, vary depending on the level of demand for Intel's products, the level of revenue and profits, and impairments of long-lived assets. Intel is in the midst of a structure and efficiency program that is resulting in several actions that could have an impact on expected expense levels and gross margin. Intel's results could be impacted by adverse economic, social, political and physical/infrastructure conditions in the countries in which Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. A detailed discussion of these and other factors that could affect Intel s results is included in Intel s SEC filings, including the report on Form 10-Q for the quarter ended June 28, 2008. 32