Computer Architecture and Structured Parallel Programming James Reinders, Intel

Similar documents
Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Intel Xeon Phi Coprocessor

Knights Corner: Your Path to Knights Landing

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013

Intel Many Integrated Core (MIC) Architecture

Intel Xeon Phi Coprocessor

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

HPC code modernization with Intel development tools

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Parallel Programming on Ranger and Stampede

Trends and Challenges in Multicore Programming

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Bring your application to a new era:

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Parallel Computing. Hwansoo Han (SKKU)

Teaching Think Parallel

Introduc)on to Xeon Phi

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Accelerator Programming Lecture 1

Introduction to Parallel Programming

Expressing and Analyzing Dependencies in your C++ Application

The Era of Heterogeneous Computing

Parallelism. CS6787 Lecture 8 Fall 2017

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

GPUs and Emerging Architectures

45-year CPU Evolution: 1 Law -2 Equations

Does the Intel Xeon Phi processor fit HEP workloads?

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Martin Kruliš, v

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Using Intel VTune Amplifier XE for High Performance Computing

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Multicore Programming

Intel Xeon Phi Coprocessors

An Introduction to Parallel Programming

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Intel Knights Landing Hardware

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Structured Parallel Programming Patterns for Efficient Computation

Computer Architecture. R. Poss

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),

Intel Architecture for Software Developers

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Multi-core Architectures. Dr. Yingwu Zhu

Intel Architecture for HPC

Efficiently Introduce Threading using Intel TBB

The Art of Parallel Processing

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Parallel Programming on Larrabee. Tim Foley Intel Corp

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Scientific Computing with Intel Xeon Phi Coprocessors

Heterogeneous Computing and OpenCL

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

April 2 nd, Bob Burroughs Director, HPC Solution Sales

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Kernel Synchronization I. Changwoo Min

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Introduction to the Intel Xeon Phi on Stampede

Parallel Algorithm Engineering

Parallelism in Hardware

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CS671 Parallel Programming in the Many-Core Era

WHY PARALLEL PROCESSING? (CE-401)

Lecture 3: Intro to parallel machines and models

Bei Wang, Dmitry Prohorov and Carlos Rosales

Multicore Hardware and Parallelism

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Advanced Parallel Programming I

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Parallel Computing. November 20, W.Homberg

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

CS 475: Parallel Programming Introduction

Structured Parallel Programming

Preparing for Highly Parallel, Heterogeneous Coprocessing

Modern Processor Architectures. L25: Modern Compiler Design

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Transcription:

Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs

Computer Architecture & Structured Parallel Programming review aspects of computer architecture that are critical to high performance computing discuss how to think about best algorithm design using structured parallel programming techniques task vs. data parallelism and why data parallelism is key introduce TBB, OpenMP* introduce Intel Xeon Phi architecture. HARDWARE SOFTWARE SOFTWARE SOFTWARE HARDWARE 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

See the Forest 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 4

See the Forest A cliché about someone missing the big picture because they focus too much on details: They cannot see the forest for the trees. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 5

See the Forest I architecture. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 6

See the Forest I architecture. but 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 7

See the Forest Can you teach parallel programming without first teaching computer architecture? 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 8

See the Forest Can you teach parallel programming without first teaching computer architecture? (Or without just teaching a single API?) 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 9

See the Forest TREES Cores HW threads Vectors Offload Heterogeneous Cloud Caches NUMA 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 10

See the Forest TREES Cores HW threads Vectors Offload Heterogeneous Cloud Caches NUMA FOREST Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 11

See the Forest TREES Cores HW threads Vectors Offload Heterogeneous Cloud Caches NUMA Advice: FORESTproper abstractions Use Parallelism, tasks Locality Use Parallelism, tasks Locality Use Parallelism, SIMD (10:30 Locality talk) Avoid, Parallelism, Use TARGET Locality Avoid Parallelism, via neo-hetero Locality What s Parallelism, a cloud? Locality Use Parallelism, abstractions Locality Use Parallelism, abstractions Locality 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 12

See the Forest TREES Cores HW threads Vectors Offload Heterogeneous Cloud Caches NUMA FOREST Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality Parallelism, Locality 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 13

Teach the Forest Increase exposing parallelism. Increase locality of reference. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 14

Teach the Forest Increase exposing parallelism. Increase locality of reference. Why? Because it s programming that addresses the universal needs of computers today and in the future future. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 15

Teach the Forest Increase exposing parallelism. Increase locality of reference. THIS IS YOUR MISSION 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 16

Why so many cores? 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 17

Why Multicore? The Free Lunch is over, really. But Moore s Law continues! 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Processor Clock Rate over Time Growth halted around 2005 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Transistors per Processor over Time Continues to grow exponentially (Moore s Law) 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Moore s Law Number of components (transistors) doubles about every 18-24 months. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

SSE AVX MIC AVX-512 80386 MMX 8008, 8080 8086, 8088 4004 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 61

Is this the Architecture Track? 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 24

CPU CPU Memory These were simpler times. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 25

CPU + cache CPU Cache Memory Memories got further away (meaning: CPU speed increased faster than memory speeds) A closer cache for frequently used data helps performance when memory is no longer a single clock cycle away. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 26

CPU + caches CPU (L1) Cache Cache Memory Memories keep getting further away (this trend continues today). More caches help even more (with temporal reuse of data). 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 27

L1 CPU CPU with caches Memory As transistor density increased (Moore s Law), cache capabilities were integrated onto CPUs. Higher performance external (discrete) caches persisted for some time while integrated cache capabilities increase. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 28

CPU / Coprocessors L1 CPU FP Memory Coprocessors appearing first in 1970s were FP accelerators for CPUs without FP capabilities. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 29

CPU / Coprocessors L1 CPU FP Memory As transistor density increased (Moore s Law), FP capabilities were integrated onto CPUs. Higher performance discrete FP accelerators persisted a little bit while integrated FP capabilities increase. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 30

CPU / Coprocessors Early Design Display Interest to provide hardware support for displays increased as use of graphics grew (games being a key driver). This led to graphics processing units (GPUs) attached to CPUs to create video displays. GPU (card) CPU Memory 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 31

CPU / Coprocessors GPU speeds and CPU speeds increase faster than memory speeds. Direct connection to memory best done via caches (on the CPU). Display GPU (card) L1 CPU FP Memory 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 32

CPU / Coprocessors GPU speeds and CPU speeds increase faster than memory speeds. Direct connection to memory best done via caches (on the CPU). Display GPU (card) L1 CPU FP Memory 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 33

CPU / Coprocessors Display As transistor density increased (Moore s Law), GPU capabilities were integrated onto CPUs. Higher performance external (discrete) GPUs persist while integrated GPU capabilities increase. CPU L1 GPU FP Memory 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

CPU / Coprocessors A many core coprocessor (Intel Xeon Phi ) appears, purpose built for accelerating technical computing. many core coprocessor (card) L1 CPU FP Memory 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

CPU / Coprocessors As transistor density increased (Moore s Law), many core capabilities will be integrated to create a many core CPU. ( Knights Landing ) many core CPU L1 FP Memory 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Nodes Nodes are building blocks for clusters. L1 CPU FP Memory GPU (card) L1 CPU FP Memory With or without GPUs. Displays not needed. many core coprocessor (card) L1 CPU FP Memory L1 GPU CPU FP Memory 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Clusters Node Node Node NIC NIC NIC Clusters are made by connecting nodes - regardless of Nodes type. Node NIC Node NIC Node NIC Node Node Node NIC NIC NIC 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

NIC (Network Interface Controller) integration many core CPU L1 FP Memory NIC As transistor density increased (Moore s Law), NIC capabilities will be integrated onto CPUs. L1 CPU FP Memory CPU L1 GPU FP Memory NIC NIC 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

What matters when programming? Parallelism Locality 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Amdahl who? 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 41

How much parallelism is there? Amdahl s Law Gustafson s observations on Amdahl s Law 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Amdahl s law the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude. Amdahl, 1967 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Amdahl s law an observation speedup should be measured by scaling the problem to the number of processors, not by fixing the problem size. Gustafson, 1988 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

How much parallelism is there? Amdahl s Law Gustafson s observations on Amdahl s Law Plenty but the workloads need to continue to grow! 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Why Intel Xeon Phi? 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 54

Intel Xeon Phi Coprocessor It s just a different design point. Not a different programming paradigm. Little cores vs. big cores. All x86. vs. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Performance Work Work Instruction Cycle = Time Instructions x x Cycle Time Path Length IPC Frequency Better algorithm same work with fewer instructions The compiler can optimize for fewer instructions, choose instructions with better IPC Cache WORK efficient algorithms: higher WORK IPC TIME INSTRUCTIONS Vectorization: same work with fewer instructions Parallelization: more instructions per cycle INSTRUCTION CYCLE CYCLE TIME 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Remember Pollack s rule: Performance ~ 4x the die area gives 2x the performance in one core, but 4x the performance when dedicated to 4 cores Conclusions (with respect to Pollack s rule) A powerful handle to adjust Performance/Watt Weaker cores can be beneficial (but many of them) Parallel hardware Parallel algorithms Appropriate tools Performance GHz Era Multicore Time Manycore 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Speedup? Peak perf. by example (http://ark.intel.com/) Intel Xeon E5-2680 (not the top-bin) 2S x 8C x 2.7 GHz x 4F DP x 2 ops* ~345 GF/s Intel Xeon Phi 3120A (lowest bin) 57C x 1.1 GHz x 8F DP x 2 ops* ~1 TF/s Amdahl s Law determines the total speedup S* with S* = 1 / [(1-P) + P/S] of a mixture of serial and parallel code sections with the parallel speedup S and an amount of parallel code P (strong scaling). 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Picture worth many words 2013, James Reinders & Jim Jeffers, diagram used with permission 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Intel Xeon Phi Coprocessors Highly-parallel Processing for Unparalleled Discovery 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Knights Corner Micro-architecture PCIe Client Logic Core Core Core Core GDDR MC GDDR MC TD TD TD TD TD TD TD TD GDDR MC GDDR MC Core Core Core Core 61 Visual and Parallel Computing Group Copyright 2012 Intel Corporation. All rights reserved.

Knights Corner Core PPF PF D0 D1 D2 E WB T0 IP T1 IP T2 IP T3 IP L1 TLB and 32KB Code Cache Code Cache Miss TLB Miss 4 Threads In-Order Pipe 0 Decode 16B/Cycle (2 IPC) Pipe 1 ucode TLB Miss Handler TLB HWP Ctl 512KB Cache VPU RF X87 RF Scalar RF VPU 512b SIMD X87 ALU 0 ALU 1 L1 TLB and 32KB Data Cache TLB Miss DCache Miss Core To On-Die Interconnect x86 specific logic < 2% of core + area 62 Visual and Parallel Computing Group Copyright 2012 Intel Corporation. All rights reserved.

Vector Processing Unit PPF PF D0 D1 D2 E WB D2 E VC1 VC2 V1-V4 WB D2 E VC1 VC2 V1 V2 V3 V4 DEC VPU RF 3R, 1W LD EMU Vector ALUs ST 16 Wide x 32 bit 8 Wide x 64 bit Fused Multiply Add Mask RF Scatter Gather 63 Visual and Parallel Computing Group Copyright 2012 Intel Corporation. All rights reserved.

Interconnect Core Core Core Core BL - 64 Bytes Data AD Command and Address TD TD TD TD AK Coherence and Credits TD TD TD TD AK AD Core Core Core Core BL 64 Bytes 64 Visual and Parallel Computing Group Copyright 2012 Intel Corporation. All rights reserved.

Distributed Tag Directories Core Core Core Core TAG Core Valid Mask State TAG Core Valid Mask State TD TD TD TD Core Core Core Core TD TD TD TD Tag Directories track cache-lines in all s 65 Visual and Parallel Computing Group Copyright 2012 Intel Corporation. All rights reserved.

Interleaved Memory Access Core Core GDDR MC GDDR MC Core Core TD TD TD TD Core Core GDDR MC GDDR MC Core Core TD TD TD TD 66 Visual and Parallel Computing Group Copyright 2012 Intel Corporation. All rights reserved.

Interconnect: 2X AD/AK Core Core Core Core BL - 64 Bytes AD TD TD TD TD TD TD TD TD AK AK 2x AD Core Core Core Core BL 64 Bytes 67 Visual and Parallel Computing Group Copyright 2012 Intel Corporation. All rights reserved.

Caches For or Against? 50 45 40 35 30 25 20 15 10 5 0 Relative BW Caches: high data BW low energy per byte of data supplied programmer friendly (coherence just works) Relative BW/Watt Memory BW Cache BW L1 Cache BW Coherent Caches are a key MIC Architecture Advantage Results have been simulated and are provided for informational purposes only. Results were derived using simulations run on an architecture simulator or model. Any difference in system hardware or software design or configuration may affect actual performance. 68 Visual and Parallel Computing Group Copyright 2012 Intel Corporation. All rights reserved.

it is an SMP-on-a-chip running Linux 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 69

vision span from few cores to many cores with consistent models, languages, tools, and techniques 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 70

Source Multicore CPU Compilers Libraries, Parallel Models Multicore CPU Intel MIC architecture coprocessor 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Illustrative example Fortran code using MPI, single threaded originally. Run on Intel Xeon Phi coprocessor natively (no offload). Based on an actual customer example. Shown to illustrate a point about common techniques. Your results may vary! Untuned Performance on Intel Xeon processor Untuned Performance on Intel Xeon Phi coprocessor 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 72

Illustrative example Fortran code using MPI, single threaded originally. Run on Intel Xeon Phi coprocessor natively (no offload). Yeah! Untuned Performance on Intel Xeon processor Untuned Performance on Intel Xeon Phi coprocessor TUNED Performance on Intel Xeon Phi coprocessor 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 73

Illustrative example Fortran code using MPI, single threaded originally. Run on Intel Xeon Phi coprocessor natively (no offload). Common optimization techniques dual benefit Yeah! Untuned Performance on Intel Xeon processor Untuned Performance on Intel Xeon Phi coprocessor TUNED Performance on Intel Xeon Phi coprocessor 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 74

Illustrative example Fortran code using MPI, single threaded originally. Run on Intel Xeon Phi coprocessor natively (no offload). Common optimization techniques dual benefit Untuned Performance on Intel Xeon processor Untuned Performance on Intel Xeon Phi coprocessor TUNED Performance on Intel Xeon processor TUNED Performance on Intel Xeon Phi coprocessor 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 75

Xeon Phi Intro. Source: June 2014 Top 500 - www.top500.org Top 500 (June 2014): Again the #1 system (third time) is a Neo-heterogeneous system (Common Programming Model) (Intel Xeon Processors + Intel Xeon Phi Coprocessor) 2014, 2014, Intel Intel Corporation. Corporation. All rights All reserved. rights reserved. Intel, the Intel Intel, logo, the Intel Intel Inside, logo, Intel Xeon, Intel and Inside, Intel Xeon Cilk, Phi VTune are trademarks, Xeon, and of Intel Xeon Corporation Phi are trademarks in the U.S. and/or of Intel other Corporation countries. *Other in the names U.S. and and/or brands may other be countries. claimed as *Other property names of others. and brands may be claimed as th

Source: June 2014 Intel @ ISC 14 2014, 2014, Intel Intel Corporation. Corporation. All rights All reserved. rights reserved. Intel, the Intel Intel, logo, the Intel Intel Inside, logo, Intel Xeon, Intel and Inside, Intel Xeon Cilk, Phi VTune are trademarks, Xeon, and of Intel Xeon Corporation Phi are trademarks in the U.S. and/or of Intel other Corporation countries. *Other in the names U.S. and and/or brands may other be countries. claimed as *Other property names of others. and brands may be claimed as th

How do I think parallel? 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 78

Parallel Patterns: Overview 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Map Map invokes a function on every element of an index set. Examples: gamma correction and thresholding in images; color space conversions; Monte Carlo sampling; ray tracing. The index set may be abstract or associated with the elements of an array. Corresponds to parallel loop where iterations are independent. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Reduce Reduce combines every element in a collection into one using an associative operator: x+(y+z) = (x+y)+z For example: reduce can be used to find the sum or maximum of an array. Examples: averaging of Monte Carlo samples; convergence testing; image comparison metrics; matrix operations. Vectorization may require that the operator also be commutative: x+y = y+x 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Stencil Stencil applies a function to neighbourhoods of an array. Neighbourhoods are given by set of relative offsets. Boundary conditions need to be considered. Examples: image filtering including convolution, median, anisotropic diffusion 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Pipeline Pipeline uses a sequence of stages that transform a flow of data Some stages may retain state Data can be consumed and produced incrementally: online Examples: image filtering, data compression and decompression, signal processing 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Pipeline Parallelize pipeline by Running different stages in parallel Running multiple copies of stateless stages in parallel Running multiple copies of stateless stages in parallel requires reordering of outputs Need to manage buffering between stages 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

For More Information Structured Parallel Programming Michael McCool Arch Robison James Reinders Uses Cilk Plus and TBB as primary frameworks for examples. Appendices concisely summarize Cilk Plus and TBB. www.parallelbook.com 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Use abstractions!!! 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 86

Choosing a non-proprietary parallel abstraction Use abstractions!!! Avoid direct programming to the low level interfaces (like pthreads). PROGRAM IN TASKS, NOT THREADS Is OpenCL* low level? For HPC YES. 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Choosing a non-proprietary parallel abstraction Choose First (limited functions) 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Choosing a non-proprietary parallel abstraction Choose First (limited functions) Cluster (distributed memory) 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Choosing a non-proprietary parallel abstraction Choose First (limited functions) Cluster (distributed memory) Node (shared memory) 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Intel Threading Building Blocks We asked ourselves: How should C++ be extended? templates / generic programming What do we want to solve? Abstraction with good performance (scalability) Abstraction that steers toward easier (less) debugging Abstraction that is readable 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch Concurrent Containers Concurrent access, and a scalable alternative to containers that are externally locked for thread-safety Flow Graph A set of classes to express parallelism via a dependency graph or a data flow graph Task Scheduler Sophisticated engine with a variety of work scheduling techniques that empowers parallel algorithms & the flow graph Thread-safe timers Thread Local Storage Supports infinite number of thread local data Synchronization Primitives Atomic operations, several flavors of mutexes, condition variables Threads OS API wrappers Memory Allocation Per-thread scalable memory manager and false-sharing free allocators 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Choosing a non-proprietary parallel abstraction Choose First (limited functions) Cluster (distributed memory) Node (shared memory) 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Choosing a non-proprietary parallel abstraction Up and coming for C++ (keywords, compilers) Choose First (limited functions) Cluster (distributed memory) Node (shared memory) Because you just have to expect more Affect future C++ standards? (2021?) 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Choosing a non-proprietary parallel abstraction Compare... * * 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Choosing a non-proprietary parallel abstraction Compare... 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

Choosing a non-proprietary parallel abstraction 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th

It s your Forest Increase exposing parallelism. Increase locality of reference. YOUR MISSION 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Cilk, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as th 98

Questions? james.r.reinders@intel.com 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.