Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Similar documents
Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduc)on to Xeon Phi

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Godson Processor and its Application in High Performance Computers

Accelerator Programming Lecture 1

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Parallel Programming on Ranger and Stampede

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Overview of Tianhe-2

Introduction to the Intel Xeon Phi on Stampede

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

The Era of Heterogeneous Computing

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Chapter 1 Introduction to Xeon Phi Architecture

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

the Intel Xeon Phi coprocessor

Introduc)on to Xeon Phi

Intel Many Integrated Core (MIC) Architecture

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Intel Knights Landing Hardware

Intel Xeon Phi Coprocessors

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Intel Xeon Phi архитектура, модели программирования, оптимизация.

HPC Technology Trends

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Intel Architecture for HPC

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Preparing for Highly Parallel, Heterogeneous Coprocessing

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

ECE 574 Cluster Computing Lecture 23

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Trends in HPC (hardware complexity and software challenges)

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Introduction to GPU computing

GPU Architecture. Alan Gray EPCC The University of Edinburgh

BlueGene/L. Computer Science, University of Warwick. Source: IBM

The Mont-Blanc approach towards Exascale

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Architecture, Programming and Performance of MIC Phi Coprocessor

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

INF5063: Programming heterogeneous multi-core processors Introduction

Tile Processor (TILEPro64)

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

GPUs and Emerging Architectures

Intra-MIC MPI Communication using MVAPICH2: Early Experience

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU for HPC. October 2010

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Performance of deal.ii on a node

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

n N c CIni.o ewsrg.au

HPC Hardware Overview

Timothy Lanfear, NVIDIA HPC

MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS

Hybrid Architectures Why Should I Bother?

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Steve Scott, Tesla CTO SC 11 November 15, 2011

Overview of Intel Xeon Phi Coprocessor

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Interconnect Your Future

High Performance Computing The Essential Tool for a Knowledge Economy

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Introduction to GPU hardware and to CUDA

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Hybrid MPI - A Case Study on the Xeon Phi Platform

The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016

Intel Architecture for Software Developers

Portland State University ECE 588/688. Cray-1 and Cray T3E

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Outline Marquette University

Intel Xeon Phi Coprocessor

Vector Engine Processor of SX-Aurora TSUBASA

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Simulation using MIC co-processor on Helios

Placement de processus (MPI) sur architecture multi-cœur NUMA

Chapter Seven Morgan Kaufmann Publishers

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Introduc)on to Xeon Phi

Transcription:

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications Results Conclusion

History: First Teraflop Computer Headed towards exascale computing 1000 times faster than a petaflop Intel's goal: exaflop by 2018 First sustained TFLOP: ASCI Red Used by US government 1997-2005 9,298 Pentium II Xeons 104 cabinets, 1600 ft2 850 kw of power Not including AC

ASCI Red on a Chip: Knight's Corner Officially introduced in 2011 ASCI Red on a single die A lot less power / space Double precision TFLOP Based on Intel MIC Physical implementation: 22 nm process Used as a coprocessor through PCIe Coprocessor is branded as Xeon Phi Has embedded Linux OS

High Level Overview Goal Extremely high level of parallelism on one die Many-core... and those cores support SIMD Idea Take number of transistors in a current high-end Xeon processor (~2.8 billion) Divide by number of transistors in old Pentiums (~40 million max + overhead) Result: Up to 61 cores on a chip

Main Challenges Selecting core architecture Communication Decide on a topology Synchronization, avoiding bottlenecks Memory Feeding the chip from external memory Cache coherence Programming Should be as easy as possible for end user Heat

Core Architecture Uses Intel Xeon Phi Coprocessor ISA Provided new support for vector commands and floating point 64 Dedicated 512 bit wide vector floating point unit (VPU) in each core Based on Pentium 1 processors (P54C) Each core connects to the Ring Interconnect via the Core Ring Interface (CRI) L2 Cache Tag Directory Ring Stop

Core Architecture 2 Instructions per clock cycle One on U-Pipe One on V-Pipe

Core Architecture: Instruction Decoder To simplify design, decoder was modified to a two-cycle unit Fully pipelined 2 instructions can be issued per thread in a cycle No back-to-back issues from same thread At least two threads are needed for maximum core utilization 1 thread can have 50% utilization at best

Core Architecture: Overview

Network Topology Overview Basic topology is a bidirectional ring Intel has mastered the ring Widely used in Core and Xeon architectures recently Usually rings don't scale well Prone to congestion as network scales up Intel's response: Wider rings used More rings used Xeon Phi will handle "carefully structured workloads"

Network Topology Overview: Diagram Tag Directories Actually many rings

Many Rings 10 rings Data block ring 64 bytes total Most expensive Address ring Also: read / write commands Acknowledge Flow control & coherence messages Least expensive 2x

Memory Hierarchy: Cache Each core has L1 32 KB instruction, 8-way set associative 32 KB data, 8-way set associative Each core has shared L2 No L3! 512 KB total for L2 instruction & data, 8-way Globally distributed tag directory Helps eliminate hotspots Uniform access pattern Each L2 has a Translation Lookaside Buffer (TLB) to further reduce latency Holds virtual-to-physical memory translations

Core Architecture: Cache Organization Simple MESI protocol Unlike new Intel architectures which use a more advanced approach

Cache Misses Core accesses its own L2 cache On miss: Address request sent on AD ring to all tag directories (TDs) If requested block found in other L2: Forwarding request sent to owner Owner sends back block If requested block not found: Tag directory sends forwarding request to memory controller

Memory Hierarchy: External 8 GB GDDR5 memory on coprocessor Graphics DDR 8 on-die dual-channel GDDR memory controllers Connected to ring Much faster than external controllers 5.5 GTransfers / sec Theoretical aggregate memory bandwidth = 352 GB / sec

MIC Vs. GPU MIC is easier to program for Recompile x86 code (assuming it's already written to be parallel) GPU must be rewritten Generally GPUs are designed around highly data parallelizable applications (SIMD) MIC supports SIMD but can be used for other situations

MIC Vs. GPU MIC has higher double precision throughput than GPUs GPUs typically run on the order of GFLOPs Xeon Phi will be capable of running over 1 TFLOPs MIC implementations consume more power Xeon Phi 3100 uses 300 W Xeon Phi 5110P uses 225 W Nvidia Tesla full height video card uses 170.9 W

MIC Vs. Tilera Tilera has been developing "many core" chips longer Tilera began shipping their many core Tile units in 2008 Intel began manufacturing MIC prototypes in 2010 MIC uses smaller transistor sizes Intel is using their 22 nm process (60 cores) Tilera is using a 28 nm process (100 cores)

MIC Vs. Tilera Tilera focused on "many core" CPUs Intel MIC generally focused as a coprocessor Tile is designed to be more power efficient Tile uses a mesh topology (imesh) and MIC utilizes a ring topology Different design strategies Mesh uses more connections Has up to eight 10 Gb Ethernet ports Each core has 320 KB local memory Four DDR3 interfaces to reduce DRAM accesses

Programming Must be easily programmable Ideally, little or no porting required Standard C, C++, Fortran Solution: OpenMP & OpenMPI No proprietary language extensions No special tool chain / design flow required Looks like vanilla x86 cluster to host Can directly run applications on coprocessor SSH into embedded Linux OS

Programming Examples OpenMP Intel SIMD support OpenCL... and many more ways & combinations to program using existing libraries

Programming Models Many different models are possible Can even offload all processing to MIC through SSH

Applications SGI's UV 2000 "Big Brain Machine" Uses 32 Xeon Phi coprocessors Answers questions about the cosmos Team up with HP for National Renewable Energy Laboratory (NREL) supercomputer Highly energy efficient Combination of 32 nm Xeon E5 and 22 nm Ivy Bridge 600 Xeon Phi coprocessors 1 PFLOP Meteorological simulations

Applications Texas Advanced Computing Center (TACC) Stampede cluster 7th fastest supercomputer as of November 2012 90% used for XSEDE 10% used for open science projects Located at University of Texas at Austin

Results: Xeon Phi vs. Xeon E5 Performance gap increases with problem size One of the main motivations for using Xeon Phi

Results: Performance Per Watt Xeon Phi outperforms competitors in terms of performance per watt

Conclusion World's first sustained double precision TeraFLOP on a single GPP Intel's first market release into many core computing Not a GPU, a GPP Generally much easier to program for Efficient at large scale computing

Questions

Sources [1] R. Smith. (2012, Jun. 19). Intel Announces Xeon Phi Family of Co-Processors - MIC Goes Retail [Online]. Available: http://www.anandtech.com/show/6017/intel-announces-xeon-phi-family-of-coprocessors-mic-goes-retail [2] Intel Xeon Phi Coprocessor System Sofware Developers Guide, May 2013. [3] G. Chrysos, "Intel Xeon Phi Coprocessor - the Architecture," in Hot Chips Conference, Cupertino, CA, 2012. [4] J. Makare. (2012, Dec. 20). Advanced Intel Xeon Phi Coprocessor Workshop Memory [Online]. Available: http: //software.intel.com/en-us/videos/advanced-intel-xeon-phi-coprocessor-workshop-memory-part-1-basics [5] V. Mudryk. (2012, Nov. 16). Performance Intel Xeon Phi Coprocessor [Online]. Available: http://scientificgpgpu. blogspot.com/2012/11/performance-intel-xeon-phi-coprocessor.html [6] R. Johnson. (2012, Oct. 8). NCSA Scientist Backs MICs over GPUs [Online]. Available: http://goparallel.sourceforge. net/ncsa-scientist-backs-mics-gpus/ [7] P. Glaskowsky. (2009, Nov. 2). Tilera's balancing act: 100 cores vs. market realities [Online]. Available: http://news. cnet.com/8301-13512_3-10388025-23.html [8] A. Shah (2013, Feb. 19). Tilera developing chip with more than 100-plus cores [Online]. Available: http://www. computerworld.com/s/article/9236926/tilera_developing_chip_with_more_than_100_plus_cores [9] Texas Advanced Computing Center. STAMPEDE [Online]. Available: http://www.tacc.utexas. edu/resources/hpc/stampede [10] S. Curtis. Intel and HP to build world's most efficient supercomputer [Online]. Available: http://news.techworld. com/data-centre/3380329/intel-and-hp-to-build-worlds-most-efficient-supercomputer/ [11] R. Johnson. (2012, Nov. 29). Hawking's 'Big Brain' Powered by Intel MIC [Online]. Available: http://goparallel. sourceforge.net/hawkings-big-brain-powered-intel-mic/