Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Similar documents
These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

GPU Microarchitecture Note Set 2 Cores

UNIT 2 PROCESSORS ORGANIZATION CONT.

Parallel and Distributed Programming Introduction. Kenjiro Taura

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

ECE 571 Advanced Microprocessor-Based Design Lecture 4

45-year CPU Evolution: 1 Law -2 Equations

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Preparing seismic codes for GPUs and other

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Dynamic Scheduling. CSE471 Susan Eggers 1

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {

Spring 2014 Midterm Exam Review

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Larrabee New Instructions. S.Jarp

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

GPU Fundamentals Jeff Larkin November 14, 2016

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Modern Processor Architectures. L25: Modern Compiler Design

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

INF5063: Programming heterogeneous multi-core processors Introduction

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Parallel Processing SIMD, Vector and GPU s cont.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Multi-Processors and GPU

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Lecture 3 CIS 341: COMPILERS

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Accelerator Programming Lecture 1

Parallel Accelerators

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Parallel Computing: Parallel Architectures Jin, Hai

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture

Processors. Young W. Lim. May 12, 2016

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Registers. Ray Seyfarth. September 8, Bit Intel Assembly Language c 2011 Ray Seyfarth

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Update for New Implementations. As new implementations of the Itanium architecture

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Putting it all Together: Modern Computer Architecture

RECAP. B649 Parallel Architectures and Programming

Introduction to Machine/Assembler Language

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

OpenCL Vectorising Features. Andreas Beckmann

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

Introducing Multi-core Computing / Hyperthreading

EC 513 Computer Architecture

Writing high performance code. CS448h Nov. 3, 2015

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Portland State University ECE 588/688. Cray-1 and Cray T3E

Credits and Disclaimers

SIMD: Data parallel execution

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Superscalar Processors Ch 14

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

Advanced processor designs

Multi-cycle Instructions in the Pipeline (Floating Point)

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

CS367 Test 1 Review Guide

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Instruction Set Architectures

CS 179: GPU Computing. Lecture 2: The Basics

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Hybrid Architectures Why Should I Bother?

General introduction: GPUs and the realm of parallel architectures

! Readings! ! Room-level, on-chip! vs.!

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 4

Kaisen Lin and Michael Conley

Assembly Language. Lecture 2 x86 Processor Architecture

Superscalar Processors

Parallel Accelerators

CISC / RISC. Complex / Reduced Instruction Set Computers

How Software Executes

Transcription:

phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this class.] A manycore chip in which each core has a vector FP unit. These notes will describe Intel s manycore accelerator, Phi. phi 1 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 1

phi 2 phi 2 Intel s Manycore GPU. Larrabee Initially a project to develop a GPU. Plans to develop a GPU product dropped but design targeted at accelerator use. Knights Corner Code name for Intel s planned manycore accelerator product. Like Larrabee but not intended for graphics use. Knights Ferry Name of manycore accelerator Intel development package, including chip. Specs match Larrabee closely. phi 2 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 2

phi 3 phi 3 Xeon Phi Name of Intel s manycore product. Knights Landing Version of Phi meant to operate without a companion CPU. In notes will sometimes use term Larrabee to refer to all these designs. phi 3 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 3

phi 4 phi 4 Overview Chip consists of many cores. Xeon Phi 7120A: 61 cores, 1.1GHz. MSRP $4235. Xeon Phi 7210: 64 cores, 1.3GHz. (Released 2017). $5000 Xeon Phi 7250: 68 cores, 1.4 GHz. (Released 2017). A Phi core is comparable with a CPU core or CUDA multiprocessor, not a CUDA core. Each Core: A 2-way superscalar, statically scheduled (in order) Intel 64 processor. Four-way simultaneous multi-threaded (hyperthreaded). The processor can issue instructions to a 512-bit vector unit. Cores are complete enough to host an operating system. phi 4 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 4

phi 5 phi 5 Storage Per Core Context for four threads. 32kiB of L1 data cache. 32 kib of L1 instruction cache. Share of coherent L2 cache: 512 kib (Through model 72xx, 2017). L2 cache is coherent: stores to cached data managed reasonably (short def). Coherence is more precisely defined as meaning that the observed order of stores to a given location is the same for all processors. A core can access any core s L2 cache, but access is fastest to its own core. phi 5 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 5

phi 6 phi 6 Interconnection Cores interconnected by a dual ring network. Two rings, one in each direction. Ring network used for messages to maintain coherence, among other purposes. Width is 512 bits. phi 6 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 6

phi 7 phi 7 Instruction Issue Two instructions per cycle. At most one of these can be a vector instruction. Instruction Features Each vector instruction can access up to two registers and one memory operand. Similar to NVIDIA GPUs. CPU arithmetic instructions cannot directly access memory. (Including both RISC instructions and IA-32 µops.). One vector operand can be swizzled (lanes rearranged). Vector memory operand can have type conversion applied. phi 7 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 7

phi 8 phi 8 Important Instruction Set Features. Memory Instructions Prefetch Cannot rely on massive multithreading to hide latency. Cache Placement Hints Addresses for vector load can come from a vector register. Would still need one cycle per cache line. phi 8 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 8

phi 9 phi 9 Instruction Operand Features Swizzling Source operand conversion. Can convert memory source of an arithmetic insn from int types to float. Supports efficient use of memory bandwidth. phi 9 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 9

phi 10 Phi Chip Multiple cores. Phi Overview phi 10 Each Phi Core Has four thread contexts. Is two-way superscalar. 32kiB of L1 data cache. 32 kib of L1 instruction cache. phi 10 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 10

phi 11 phi 11 Phi Core ISA Instruction Set A subset of Intel 64 (x86 64). Phi-specific vector instructions. phi 11 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 11

phi 12 phi 12 Phi Register Sets Most Intel 64 registers. There are 16 general-purpose (integer) registers. Reg Names: RAX-RDX, RDI, RSI, RBP, RSP, R8-R15. Phi-specific vector registers. There are 32 vector registers, each 512 bits (64 bytes). Assembler names: zmm0 to zmm31. Phi-specific mask registers. There are 8 mask registers, each 16 bits. Assembler names: k0 to k15. phi 12 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 12

phi 13 phi 13 Phi Vector Arithmetic Instructions Common Features Two or three source operands. One source operand can come from memory. Mask used to control which lanes of dest are written. phi 13 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 13

phi 14 phi 14 Two-Register Format: vaddps zmm2, S_{f32}(zmm3/m_t), zmm1{k1} ---- ---------------- -------- Src1 Src2 Dest Warning: Assembler conventions vary: Sometimes destination appears first. Argument Types Dest: Vector register zmm1, mask register k1. Only lanes of zmm1 corresponding to k1 are written. Src1: Vector register zmm2. Src2: Vector or converted vector /memory. phi 14 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 14

phi 15 phi 15 Instruction Naming Convention vaddps zmm2, S_{f32}(zmm3/m_t), zmm1{k1}, 122234 Dest Src1 Src2 1 = v: Vector, uses vector (z) registers. 2 = add: Operation. 3 = p: Packed: Operate on vector elements individually. 4 = s: Single-precision. phi 15 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 15

phi 16 phi 16 Examples: vaddpd %zmm7, %zmm8, %zmm1 # zmm1 = zmm7 + zmm8 vaddpd %zmm2{badc}, %zmm2, %zmm3 vaddpd 112(%rsp){1to8}, %zmm0, %zmm1{%k1} #434.4 c35 vfmadd231pd %zmm7, %zmm6, %zmm3 vfmadd231pd 2368(%rsi,%rcx,8), %zmm2, %zmm8 #44.4 c77 #52.4 c33 phi 16 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 16

phi 17 phi 17 Phi Pipeline Pipeline Features Two-way superscalar. Most vector instructions have a four-cycle latency and one-cycle initiation interval. Vector insn don t raise exceptions. Four-cycle latency assumes dependence not into shuffle/conv stage. Phi: 6-stage Scalar: Six stages PPF PF D0 D1 D2 E WB Vector 11-stage phi 17 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 17

phi 18 phi 18 PPF PF D0 D1 D2 E VC1 VC2 V1 V2 V3 V4 WV D0: Decode Intel 64 insn prefixes. D1: Decode remainder of instruction. D2: More decode, register read. (Microcode control.) E: Execute. VC1, VC2: Shuffle, load conversion. V1-V4 FP execute. phi 18 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 18

phi 19 phi 19 Prefetch Buffer Sort of an L0 insn cache. Fed in PPF stage. Entry in buffer is 32 B (256 b). Each thread has four entries. phi 19 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 19

phi 20 phi 20 Phi Programming Execution Modes Native Mode: Program starts and runs on Phi. Offload Mode: Program starts on CPU and executes code on Phi as needed. phi 20 EE 7722 Lecture Transparency. Formatted 11:06, 26 April 2017 from lsli08.phi. phi 20