COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Size: px

Start display at page:

Download "COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors"

Katrina Burns
6 years ago
Views:

1 COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan: Larrabee: a many-core x86 architecture for visual computing, ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp IBM Cell processor: [2] C. R. Johns, D. A. Brokenshire Introductioon to the Cell Broadband Engine Architecture, IBM Journal of Research and Development, vol. 51, no. 5, pp [3] M. Kistler, M. Perrone, F. Petrini, Cell Multiprocessor Communication Network: Built for Speed IEEE Micro, vol. 26, no. 3, pp ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf 1

2 Larrabee Motivation Comparison of two architectures with the same number of transistors Half the performance of a single stream for the simplified core 40x increase for multi-stream executions 2 out-of-order cores Instruction issue in-order cores VPU per core 4-wide SSE 16-wide L2 cache size 4 MB 4 MB Single stream 4 per clock 2 per clock Vector throughput 8 per clock 160 per clock Larrabee Overview Many-core visual computing architecture Based on x86 CPU cores Extended version of the regular x86 instruction set Supports subroutines and page faulting Number of x86 cores can vary depending on the implementation and processor version Fixed functional units for texture filtering Other graphical operations such as rasterization or postshader blending done in software 2

3 Larrabee Overview (II) Image Source: [1] Overview of a Larrabee Core (I) Image Source: [1] 3

4 Overview of a Larrabee Core (I) x86 core derived from the Pentium processor No out-of-order execution Standard Pentium instruction set with the addition of 64 bit instructions Instructions for pre-fetching data into L1 and L2 cache Support for 4 simultaneous threads, separate registers for each thread Each core is augmented with a wide vector processor (VPU) 32kb L1 Instruction cache, 32 kb L1 Data Cache 256 KB of local subset of the L2 cache Coherent L2 cache across all cores Vector Processing Unit in Larrabee 16-wide VPU executing integer, single- and double precision floating point operations VPU supports gather-scatter operations The 16 elements are loaded or can be stored from up to 16 different addresses Support for predicated instructions using a mask control register (if-then-else statements) 4

5 Inter-Processor Ring Network Bi-directional ring network 512 bits-wide per direction Routing decisions done before injecting message into the network Larrabee Programming Models Most application can be executed without modification due to the full support of the x86 instruction set Support for POSIX threads to create multiple threads API extended by thread affinity parameters Recompiling code with Larrabee s native compiler will generate automatically the codes to use the VPUs. Alternative parallel approaches Intel threading building blocks Larrabee specific OpenMP directives 5

0 GHz 512-bit wide vector engine 32 Kb L1 I/D cache, 512 Kb L2 cache (per core) Up

6 Larrabee Performance Image Source: [1] Intel Xeon Phi Processor First generation of Intel MIC (Many Integrated Cores) architecture 60 cores / 1.0 GHz 512-bit wide vector engine 32 Kb L1 I/D cache, 512 Kb L2 cache (per core) Up to 1 TFLOPS double-precision performance 8 Gb GDDR5 memory and 320 Gb/s bandwidth Standard PCIe x16 form factor 6

IBM Cell Overview (I) Cell Broadband Architecture (CBEA) defined by a consortium of IBM, Sony, and Toshiba Originally targeting the multi-media industry E.g. Playstation 3, Toshiba HDTV, etc.

7 IBM Cell Overview (I) Cell Broadband Architecture (CBEA) defined by a consortium of IBM, Sony, and Toshiba Originally targeting the multi-media industry E.g. Playstation 3, Toshiba HDTV, etc. Sold as regular compute-blades also by IBM IBM QS20, QS21, QS22 Main idea: heterogeneous microprocessor consisting of one (or more) general purpose processor element (PPE) and (one or) more synergistic processor elements (SPEs) 7

Cell Architecture block diagram Image Source: [2] Two generations available so far: Cell BE: 204.8 GFLOPS single precision peak performance 14.

8 Cell Architecture block diagram Image Source: [2] Two generations available so far: Cell BE: GFLOPS single precision peak performance 14.6 GFLOPS double precision peak performance PowerXCell 8i (2008): GFLOPS single precision peak performance GFLOPS double precision peak performance Both have 1 PPE and 8 SPEs 8

9 General Purpose Processor (PPE) Based on the IBM PowerPC processor Supports multiple simultaneous operating environments (virtualization) E.g. can execute an instance of a real-time operating system and an instance of a non-real-time operating system Performs management and application control functions Synergistic Processor Element (SPE) SIMD processor used for offloading compute-intensive, data parallel operations from the PPE Each SPE has its own local storage and can access data only from the local storage Current versions of the Cell processors: 256k local storage The local storage is connected to the main memory through a Memory Flow Controller (MFC) MFC moves data from main memory to local storage or between two SPEs. 9

10 MFC commands Image Source: [2] Synergistic Processor Element (SPE) (II) Each SPE has 128 registers Each register is 128 bits wide which can be used to hold Sixteen 8-bit integers or Eight 16-bit integers or Four 32-bit integers or single precision floating-point numbers Two 64-bit integers or double precision floating point numbers Most instructions supported by the synergistic processor unit utilize all elements in a register -> SIMD 10

Simplified representation of a current Cell processor Image Source: [3] Element Interconnect Bus PPE and SPEs communicate through the Element Interconnect Bus Contains a shared command bus Sets up

11 Simplified representation of a current Cell processor Image Source: [3] Element Interconnect Bus PPE and SPEs communicate through the Element Interconnect Bus Contains a shared command bus Sets up end-to-end transactions Used for coherence protocols Point-to-point data interconnect Four 16-byte-wide rings, two used for clockwise data transfers, two for counter-clockwise data transfers Each ring transfer 128 byte packets ( = cache block size of an SPE) Communication costs between two SPEs can vary between 1 hop and 6 hops Overall bandwidth: GB/s 11

12 Comparison IBM Cell and Intel Larrabee Both use a large number of small and simple cores Both use high-bandwidth ring bus to communicate between the cores Intel Larrabee is homogeneous, while IBM Cell is a heterogeneous process (difference between PPE and SPE) IBM Cell requires data to be moved explicitly to the local store, while Larrabee can address any memory area Programm for the Cell have to be written taking the limited amount of memory available for a SPE into account 12

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M.