Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications Results Conclusion

History: First Teraflop Computer Headed towards exascale computing 1000 times faster than a petaflop Intel's goal: exaflop by 2018 First sustained TFLOP: ASCI Red Used by US government 1997-2005 9,298 Pentium II Xeons 104 cabinets, 1600 ft2 850 kw of power Not including AC

ASCI Red on a Chip: Knight's Corner Officially introduced in 2011 ASCI Red on a single die A lot less power / space Double precision TFLOP Based on Intel MIC Physical implementation: 22 nm process Used as a coprocessor through PCIe Coprocessor is branded as Xeon Phi Has embedded Linux OS

High Level Overview Goal Extremely high level of parallelism on one die Many-core... and those cores support SIMD Idea Take number of transistors in a current high-end Xeon processor (~2.8 billion) Divide by number of transistors in old Pentiums (~40 million max + overhead) Result: Up to 61 cores on a chip

Main Challenges Selecting core architecture Communication Decide on a topology Synchronization, avoiding bottlenecks Memory Feeding the chip from external memory Cache coherence Programming Should be as easy as possible for end user Heat

Core Architecture Uses Intel Xeon Phi Coprocessor ISA Provided new support for vector commands and floating point 64 Dedicated 512 bit wide vector floating point unit (VPU) in each core Based on Pentium 1 processors (P54C) Each core connects to the Ring Interconnect via the Core Ring Interface (CRI) L2 Cache Tag Directory Ring Stop

Core Architecture 2 Instructions per clock cycle One on U-Pipe One on V-Pipe

Core Architecture: Instruction Decoder To simplify design, decoder was modified to a two-cycle unit Fully pipelined 2 instructions can be issued per thread in a cycle No back-to-back issues from same thread At least two threads are needed for maximum core utilization 1 thread can have 50% utilization at best

Core Architecture: Overview

Network Topology Overview Basic topology is a bidirectional ring Intel has mastered the ring Widely used in Core and Xeon architectures recently Usually rings don't scale well Prone to congestion as network scales up Intel's response: Wider rings used More rings used Xeon Phi will handle "carefully structured workloads"

Network Topology Overview: Diagram Tag Directories Actually many rings

Many Rings 10 rings Data block ring 64 bytes total Most expensive Address ring Also: read / write commands Acknowledge Flow control & coherence messages Least expensive 2x

Memory Hierarchy: Cache Each core has L1 32 KB instruction, 8-way set associative 32 KB data, 8-way set associative Each core has shared L2 No L3! 512 KB total for L2 instruction & data, 8-way Globally distributed tag directory Helps eliminate hotspots Uniform access pattern Each L2 has a Translation Lookaside Buffer (TLB) to further reduce latency Holds virtual-to-physical memory translations

Core Architecture: Cache Organization Simple MESI protocol Unlike new Intel architectures which use a more advanced approach

Cache Misses Core accesses its own L2 cache On miss: Address request sent on AD ring to all tag directories (TDs) If requested block found in other L2: Forwarding request sent to owner Owner sends back block If requested block not found: Tag directory sends forwarding request to memory controller

Memory Hierarchy: External 8 GB GDDR5 memory on coprocessor Graphics DDR 8 on-die dual-channel GDDR memory controllers Connected to ring Much faster than external controllers 5.5 GTransfers / sec Theoretical aggregate memory bandwidth = 352 GB / sec

MIC Vs. GPU MIC is easier to program for Recompile x86 code (assuming it's already written to be parallel) GPU must be rewritten Generally GPUs are designed around highly data parallelizable applications (SIMD) MIC supports SIMD but can be used for other situations

MIC Vs. GPU MIC has higher double precision throughput than GPUs GPUs typically run on the order of GFLOPs Xeon Phi will be capable of running over 1 TFLOPs MIC implementations consume more power Xeon Phi 3100 uses 300 W Xeon Phi 5110P uses 225 W Nvidia Tesla full height video card uses 170.9 W

MIC Vs. Tilera Tilera has been developing "many core" chips longer Tilera began shipping their many core Tile units in 2008 Intel began manufacturing MIC prototypes in 2010 MIC uses smaller transistor sizes Intel is using their 22 nm process (60 cores) Tilera is using a 28 nm process (100 cores)

MIC Vs. Tilera Tilera focused on "many core" CPUs Intel MIC generally focused as a coprocessor Tile is designed to be more power efficient Tile uses a mesh topology (imesh) and MIC utilizes a ring topology Different design strategies Mesh uses more connections Has up to eight 10 Gb Ethernet ports Each core has 320 KB local memory Four DDR3 interfaces to reduce DRAM accesses

Programming Must be easily programmable Ideally, little or no porting required Standard C, C++, Fortran Solution: OpenMP & OpenMPI No proprietary language extensions No special tool chain / design flow required Looks like vanilla x86 cluster to host Can directly run applications on coprocessor SSH into embedded Linux OS

Programming Examples OpenMP Intel SIMD support OpenCL... and many more ways & combinations to program using existing libraries

Programming Models Many different models are possible Can even offload all processing to MIC through SSH

Applications SGI's UV 2000 "Big Brain Machine" Uses 32 Xeon Phi coprocessors Answers questions about the cosmos Team up with HP for National Renewable Energy Laboratory (NREL) supercomputer Highly energy efficient Combination of 32 nm Xeon E5 and 22 nm Ivy Bridge 600 Xeon Phi coprocessors 1 PFLOP Meteorological simulations

Applications Texas Advanced Computing Center (TACC) Stampede cluster 7th fastest supercomputer as of November 2012 90% used for XSEDE 10% used for open science projects Located at University of Texas at Austin

Results: Xeon Phi vs. Xeon E5 Performance gap increases with problem size One of the main motivations for using Xeon Phi

Results: Performance Per Watt Xeon Phi outperforms competitors in terms of performance per watt

Conclusion World's first sustained double precision TeraFLOP on a single GPP Intel's first market release into many core computing Not a GPU, a GPP Generally much easier to program for Efficient at large scale computing

Questions

Sources [1] R. Smith. (2012, Jun. 19). Intel Announces Xeon Phi Family of Co-Processors - MIC Goes Retail [Online]. Available: http://www.anandtech.com/show/6017/intel-announces-xeon-phi-family-of-coprocessors-mic-goes-retail [2] Intel Xeon Phi Coprocessor System Sofware Developers Guide, May 2013. [3] G. Chrysos, "Intel Xeon Phi Coprocessor - the Architecture," in Hot Chips Conference, Cupertino, CA, 2012. [4] J. Makare. (2012, Dec. 20). Advanced Intel Xeon Phi Coprocessor Workshop Memory [Online]. Available: http: //software.intel.com/en-us/videos/advanced-intel-xeon-phi-coprocessor-workshop-memory-part-1-basics [5] V. Mudryk. (2012, Nov. 16). Performance Intel Xeon Phi Coprocessor [Online]. Available: http://scientificgpgpu. blogspot.com/2012/11/performance-intel-xeon-phi-coprocessor.html [6] R. Johnson. (2012, Oct. 8). NCSA Scientist Backs MICs over GPUs [Online]. Available: http://goparallel.sourceforge. net/ncsa-scientist-backs-mics-gpus/ [7] P. Glaskowsky. (2009, Nov. 2). Tilera's balancing act: 100 cores vs. market realities [Online]. Available: http://news. cnet.com/8301-13512_3-10388025-23.html [8] A. Shah (2013, Feb. 19). Tilera developing chip with more than 100-plus cores [Online]. Available: http://www. computerworld.com/s/article/9236926/tilera_developing_chip_with_more_than_100_plus_cores [9] Texas Advanced Computing Center. STAMPEDE [Online]. Available: http://www.tacc.utexas. edu/resources/hpc/stampede [10] S. Curtis. Intel and HP to build world's most efficient supercomputer [Online]. Available: http://news.techworld. com/data-centre/3380329/intel-and-hp-to-build-worlds-most-efficient-supercomputer/ [11] R. Johnson. (2012, Nov. 29). Hawking's 'Big Brain' Powered by Intel MIC [Online]. Available: http://goparallel. sourceforge.net/hawkings-big-brain-powered-intel-mic/