Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008
|
|
- Michael Malone
- 6 years ago
- Views:
Transcription
1 Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008
2 Nec Hercules Contra Plures Chip's performance is related to its cross section same area 2 performance (Pollack's Rule) Cray's two oxen Cray's 1024 chicken 2
3 Three Performance-Limiting Walls Power Wall Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. Frequency Wall Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account. Memory Wall On multi-gigahertz symmetric multiprocessors even those with integrated memory controllers latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. 3
4 Conventional Processors Don't Cut It... shallower pipelines with in-order execution have proven to be the most area and energy efficient. [...] we believe the efficient building blocks of future architectures are likely to be simple, modestly pipelined (5-9 stages) processors, floating point units, vector, and SIMD processing elements. Note that these constraints fly in the face of the conventional wisdom of simplifying parallel programming by using the largest processors available. [...] David Patterson, [...] Kathy Yelick The Landscape of Parallel Computing Research: A View from Berkeley future architectures shallow pipelines in order execution SIMD units 4
5 Cache Memories Don't Cut It When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available. H. Peter Hofstee Cell Broadband Engine Architecture from 20,000 feet conventional processor bucket brigade a hundred people a few buckets 5
6 Cache Memories Don't Cut It Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem. Richard B. Walsh New Processor Options for HPC conventional processor latency-limited bandwidth 6
7 Conventional Memories Don't Cut It More flexible or even reconfigurable data coherency schemes will be needed to leverage the improved bandwidth and reduced latency. An example might be large, on-chip, caches that can flexibly adapt between private or shared configurations. In addition, real-time embedded applications prefer more direct control over the memory hierarchy, and so could benefit from on-chip storage configured as software-managed scratchpad memory. [...] David Patterson, [...] Kathy Yelick The Landscape of Parallel Computing Research: A View from Berkeley future architectures reconfigurable coherency software-managed scrachpad memory 7
8 CELL multi-core in-order execution shallow pipeline SIMD scratchpad memory GFLOPS single precision GB/s internal bandwidth 25.6 GB/s memory bandwidth 3.2 GHz 90 nm OSI 234 million transistors 165 million Xbox million Itanium 2 (2002) 1,700 million Dual-Core Itanium 2 (2006) 12.8 GFLOPS (single or double) 8
9 CELL PPE Power Processing Element SPE Synergistic Processing Element SPU Synergistic Processing Unit LS Local Store MFC Memory Flow Controller EIB Element Interconnection Bus MIC Memory Interface Controller 9
10 Power Processing Element Power Processing Element (PPE) Power 970 architecture compliant 2-way Symmetric Multithreading (SMT) 32KB Level 1 instruction cache 32KB level 1 data cache 512KB level 2 cache VMX (AltiVec) with bit vector registers standard FPU fully pipelined DP with FMA 6.4 Gflop/s DP at 3.2 GHz AltiVec no DP 4-way fully pipelined SP with FMA 25.6 Gflop/s SP at 3.2 GHz 10
11 Synergistic Processing Elements Synergistic Processing Elements (SPEs) 128-bit SIMD 128 vector registers 256KB instruction and data local memory Memory Flow Controller (MFC) 16-way SIMD (8-bit integer) 8-way SIMD (16-bit integer) 4-way SIMD (32-bit integer, single prec. FP) 2-way SIMD (64-bit double prec. FP) 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency) 11
12 SPE SIMD architecture two in-order (dual issue) pipelines large register file ( bit registers) 256 KB of scratchpad memory (Local Store) Memory Flow Controller to DMA code and data from system memory 12
13 Element Interconnection Bus Element Interconnection Bus (EIB) 4 16B-wide unidirectional channels half the system clock (1.6GHz) GB/s bandwidth (arbitration) 13
14 EIB 16 byte channels 4 unidirectional rings token based arbitration half system clock (1.6 GHz) 14
15 Main Memory System Memory Interface Controller (MIC) external dual XDR, 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), each: 8 banks max 256 MB, total: 16 banks max 512 MB, 25.6 GB/s. 15
16 CELL Performance Double Precision In double precision every seven cycles each SPE can: process a two element vector, perform two operations on each element. in one cycle the FPU on the PPE can: process one element, perform two operations on the element. 8 x 2 x 2 x 3.2 GHz / 7 = Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s Gflop/s 16
17 CELL Performance Single Precision In single precision in one cycle each SPE can: process a four element vector, perform two operations on each element. in one cycle the VMX on the PPE can: process a four element vector, perform two operations on each element. 8 x 4 x 2 x 3.2 GHz = Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s Gflop/s 17
18 CELL Performance Bandwidth Bandwidth: 3.2 GHz clock: each SPU 25.6 GB/s, Main memory 25.6 GB/s, EIB GB/s. (compare to 25.6 Gflop/s per SPU) (compare to Gflop/s 8 SPUs) 18
19 CELL Performance Historical Perspective Connection Machine CM-5 (512 CPUs) 512 x 128 = 65 Gflop/s DP Playstation3 (4 units) 4 x 17 = 68 Gflop/s DP 19
20 Performance Comparison Double Precision 1.6 GHz Dual-Core Itanium x 4 x 2 = 12.8 Gflop/s 3.2 GHz CELL BE (SPEs only) 3.2 x 8 x 8 = 14.6 Gflop/s 20
21 Performance Comparison Single Precision 1.6 GHz Dual-Core Itanium x 4 x 2 = 12.8 Gflop/s 3.2 GHz SPE 3.2 x 8 = 25.6 Gflop/s One SPE = 2 Dual-Core Itaniums GHz CELL BE (SPEs only) 3.2 x 8 x 8 = Gflop/s One CBE = 16 Dual-Core Itaniums 2 21
22 SDK - Platforms Linux x86 x86-64 PPC64 CELL BE RPM-based distribution Fedora Core recommended CELL plugins for Eclipse available 22
23 SDK - Compilers PPU and SPU are different ISAs different sets of compilers PPU GCC SPU GCC G++ G++ GFORTRAN XLC XLC XLC++ XLC++ XLF 32-bit 32-bit 64-bit OpenMP assembler and linker are common to GNU and XL compilers XL compilers requite GNU tool chain for cross-assembling and cross linking for both the PPE and the SPE 23
24 SDK - Samples /opt/ibm/cell-sdk/prototype /lib FFT matrix /samples overlays /workloads FFT using ALF MatMul... game math simple DMA audio resample tutorial samples curves and surfaces... software managed cache... 24
25 Compilation and Linking SPU objects are embedded in PPU objects SPU code and PPU code are linked into one executable SDK provides standard makefile structure /opt/ibm/cell-sdk/prototype make.env compilation options make.footer build rules do not modify make.header definitions do not modify README_build_env.txt makefile howto understand the build process run the default makefile see what it does 25
26 Compilation and Linking 26
27 Hello World - libspe2 spe_context_run() is a blocking call create one POSIX thread for each SPE thread PPE #include <libspe2.h> #include <pthread.h> int main() { spe_context_create() pthread_create() pthread_join() spe_context_destroy() } SPE int main() { //... } void* spe_thread() { spe_image_open() spe_program_load() spe_context_run() pthread_exit() } 27
28 SPU Context Switching quiesce the SPE harvest (reset) an SPE save privileged and problem state to CSA save low 16 K of LS to CSA load and start SPE context-save sequence save GPRs and channel state to CSA save LSCSA to CSA save 240 KB of LS to CSA load and start SPE context-restore sequence copy LSCSA from CSA to LS restore 240 KB of LS from CSA restore GPRs and channel state from LSCSA restore privileged state from CSA restore remaining problem state from CSA restore 16 KB of LS from CSA 28
29 CELL Programming Models / Environments Gedae commercial product 29
30 CELL Resources developerworks CELL Resource Center Barcelona Supercomputer Center Computer Sciences Linux on CELL Power.org CELL Developer Corner The CELL Project at IBM Research CELL BE at IBM alphaworks GA Tech CELL Workshop ICL CELL Summit CellPerformance CELL Broadband Engine Programming Handbook CELL Broadband Engine Programming Tutorial SPE Runtime Management Library C/C++ Language Extensions for CELL Broadband Engine Architecture SPU Assembly Language Specification Synergistic Processor Unit Instruction Set Architecture 30
Amir Khorsandi Spring 2012
Introduction to Amir Khorsandi Spring 2012 History Motivation Architecture Software Environment Power of Parallel lprocessing Conclusion 5/7/2012 9:48 PM ٢ out of 37 5/7/2012 9:48 PM ٣ out of 37 IBM, SCEI/Sony,
More informationAll About the Cell Processor
All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,
More informationCell Broadband Engine. Spencer Dennis Nicholas Barlow
Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History
More informationCell Processor and Playstation 3
Cell Processor and Playstation 3 Guillem Borrell i Nogueras February 24, 2009 Cell systems Bad news More bad news Good news Q&A IBM Blades QS21 Cell BE based. 8 SPE 460 Gflops Float 20 GFLops Double QS22
More informationSoftware Development Kit for Multicore Acceleration Version 3.0
Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Software Development Kit for Multicore Acceleration Version 3.0 Programming Tutorial SC33-8410-00 Note
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationRoadrunner. By Diana Lleva Julissa Campos Justina Tandar
Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner
More informationTechnology Trends Presentation For Power Symposium
Technology Trends Presentation For Power Symposium 2006 8-23-06 Darryl Solie, Distinguished Engineer, Chief System Architect IBM Systems & Technology Group From Ingenuity to Impact Copyright IBM Corporation
More informationMemory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.
Week 2, Lecture 1 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. Directory-Based Coherence Idea Maintain pointers instead of simple states with each cache block. Ingredients Data owners
More informationPerformance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications
Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications Daniel Jiménez-González, Xavier Martorell, Alex Ramírez Computer Architecture Department Universitat Politècnica de
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More information( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems
( ZIH ) Center for Information Services and High Performance Computing Event Tracing and Visualization for Cell Broadband Engine Systems ( daniel.hackenberg@zih.tu-dresden.de ) Daniel Hackenberg Cell Broadband
More informationEvaluating the Portability of UPC to the Cell Broadband Engine
Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and
More informationCell SDK and Best Practices
Cell SDK and Best Practices Stefan Lutz Florian Braune Hardware-Software-Co-Design Universität Erlangen-Nürnberg siflbrau@mb.stud.uni-erlangen.de Stefan.b.lutz@mb.stud.uni-erlangen.de 1 Overview - Introduction
More informationCONSOLE ARCHITECTURE
CONSOLE ARCHITECTURE Introduction Part 1 What is a console? Console components Differences between consoles and PCs Benefits of console development The development environment Console game design What
More informationHello World! Course Code: L2T2H1-10 Cell Ecosystem Solutions Enablement. Systems and Technology Group
Hello World! Course Code: L2T2H1-10 Cell Ecosystem Solutions Enablement 1 Course Objectives You will learn how to write, build and run Hello World! on the Cell System Simulator. There are three different
More informationSystems Design and Programming. Instructor: Chintan Patel
Systems Design and Programming Instructor: Chintan Patel Text: Barry B. Brey, 'The Intel Microprocessors, 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium and Pentium Pro Processor, Pentium II, Pentium
More informationINF5063: Programming heterogeneous multi-core processors Introduction
INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using
More informationXbox 360 Architecture. Lennard Streat Samuel Echefu
Xbox 360 Architecture Lennard Streat Samuel Echefu Overview Introduction Hardware Overview CPU Architecture GPU Architecture Comparison Against Competing Technologies Implications of Technology Introduction
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationBruno Pereira Evangelista
Bruno Pereira Evangelista Introduction The multi-core era Playstation3 Architecture Cell Broadband Engine Processor Cell Architecture How games are using SPUs Cell SDK RSX Graphics Processor PSGL Cg COLLADA
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationXbox 360 high-level architecture
11/2/11 Xbox 360 s Xenon vs. Playstation 3 s Cell Both chips clocked at a 3.2 GHz Architectural Comparison: Xbox 360 vs. Playstation 3 Prof. Aaron Lanterman School of Electrical and Computer Engineering
More informationCELL CULTURE. Sony Computer Entertainment, Application development for the Cell processor. Programming. Developing for the Cell. Programming the Cell
Dmitry Sunagatov, Fotolia Application development for the Cell processor CELL CULTURE The Cell architecπture is finding its way into a vast range of computer systems from huge supercomputers to inauspicious
More informationHW Trends and Architectures
Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationRevisiting Parallelism
Revisiting Parallelism Sudhakar Yalamanchili, Georgia Institute of Technology Where Are We Headed? MIPS 1000000 Multi-Threaded, Multi-Core 100000 Multi Threaded 10000 Era of Speculative, OOO 1000 Thread
More informationConcurrent Programming with the Cell Processor. Dietmar Kühl Bloomberg L.P.
Concurrent Programming with the Cell Processor Dietmar Kühl Bloomberg L.P. dietmar.kuehl@gmail.com Copyright Notice 2009 Bloomberg L.P. Permission is granted to copy, distribute, and display this material,
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationCrypto On the Playstation 3
Crypto On the Playstation 3 Neil Costigan School of Computing, DCU. neil.costigan@computing.dcu.ie +353.1.700.6916 PhD student / 2 nd year of research. Supervisor : - Dr Michael Scott. IRCSET funded. Playstation
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationProgramming for Performance on the Cell BE processor & Experiences at SSSU. Sri Sathya Sai University
Programming for Performance on the Cell BE processor & Experiences at SSSU Sri Sathya Sai University THE STI CELL PROCESSOR The Inevitable Shift to the era of Multi-Core Computing The 9-core Cell Microprocessor
More informationUMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming
Systems Design and Programming Instructor: Professor Jim Plusquellic Text: Barry B. Brey, The Intel Microprocessors, 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium and Pentium Pro Processor Architecture,
More informationIDE Tutorial and User s Guide
Software Development Kit for Multicore Acceleration Version 3.1 IDE Tutorial and User s Guide SC34-2561-00 Software Development Kit for Multicore Acceleration Version 3.1 IDE Tutorial and User s Guide
More informationIntroduction to Computing and Systems Architecture
Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little
More informationOptimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP
Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Michael Gschwind IBM T.J. Watson Research Center Cell Design Goals Provide the platform for the future of computing 10
More informationOriginal PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy
Competitors using generic parts Performance benefits to be had for custom design Original PlayStation: no vector processing or floating point support Geometry issues Photorealism at the core of design
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationHigh-Performance Modular Multiplication on the Cell Broadband Engine
High-Performance Modular Multiplication on the Cell Broadband Engine Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 21 Outline Motivation and previous
More informationQDP++ on Cell BE WEI WANG. June 8, 2009
QDP++ on Cell BE WEI WANG June 8, 2009 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2009 Abstract The Cell BE provides large peak floating point performance with
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationPower 7. Dan Christiani Kyle Wieschowski
Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super
More informationVector Engine Processor of SX-Aurora TSUBASA
Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance
More informationSPE Runtime Management Library Version 2.2
CBEA JSRE Series Cell Broadband Engine Architecture Joint Software Reference Environment Series SPE Runtime Management Library Version 2.2 SC33-8334-01 CBEA JSRE Series Cell Broadband Engine Architecture
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationOpenMP on the IBM Cell BE
OpenMP on the IBM Cell BE PRACE Barcelona Supercomputing Center (BSC) 21-23 October 2009 Marc Gonzalez Tallada Index OpenMP programming and code transformations Tiling and Software Cache transformations
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationCell Broadband Engine Overview
Cell Broadband Engine Overview Course Code: L1T1H1-02 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn An overview of Cell history Cell microprocessor highlights Hardware architecture
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationInterval arithmetic on the Cell processor
Interval arithmetic on the Cell processor Stef Graillat Jean-Luc Lamotte Siegfried M. Rump Svetoslav Markov LIP6/PEQUAN, P. and M. Curie University, Paris Institute for Reliable Computing, Hamburg University
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers
William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved
More informationCOMP 635: Seminar on Heterogeneous Processors. Lecture 7: ClearSpeed CSX600 Processor.
COMP 635: Seminar on Heterogeneous Processors Lecture 7: ClearSpeed CSX600 Processor www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu October
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationSAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation
SAS Enterprise Miner Performance on IBM System p 570 Jan, 2008 Hsian-Fen Tsao Brian Porter Harry Seifert IBM Corporation Copyright IBM Corporation, 2008. All Rights Reserved. TABLE OF CONTENTS ABSTRACT...3
More informationNeil Costigan School of Computing, Dublin City University PhD student / 2 nd year of research.
Crypto On the Cell Neil Costigan School of Computing, Dublin City University. neil.costigan@computing.dcu.ie +353.1.700.6916 PhD student / 2 nd year of research. Supervisor : - Dr Michael Scott. IRCSET
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationA Brief View of the Cell Broadband Engine
A Brief View of the Cell Broadband Engine Cris Capdevila Adam Disney Yawei Hui Alexander Saites 02 Dec 2013 1 Introduction The cell microprocessor, also known as the Cell Broadband Engine (CBE), is a Power
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationThe Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM
The Pennsylvania State University The Graduate School College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM FOR THE IBM CELL BROADBAND ENGINE A Thesis in Computer Science and Engineering by
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationThe PlayStation 3 for High Performance Scientific Computing. Kurzak, Jakub and Buttari, Alfredo and Luszczek, Piotr and Dongarra, Jack
The PlayStation 3 for High Performance Scientific Computing Kurzak, Jakub and Buttari, Alfredo and Luszczek, Piotr and Dongarra, Jack 2008 MIMS EPrint: 2008.7 Manchester Institute for Mathematical Sciences
More informationReference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded
Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8
More information1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola
1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device
More informationHakam Zaidan Stephen Moore
Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationLeveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD
Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationCell Broadband Engine Processor: Motivation, Architecture,Programming
Cell Broadband Engine Processor: Motivation, Architecture,Programming H. Peter Hofstee, Ph. D. Cell Chief Scientist and Chief Architect, Cell Synergistic Processor IBM Systems and Technology Group SCEI/Sony
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationHow to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)
How to build a Megacore microprocessor by Andreas Olofsson (MULTIPROG WORKSHOP 2017) 1 Disclaimers 2 This presentation summarizes work done by Adapteva from 2008-2016. Statements and opinions are my own
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationThe Pennsylvania State University. The Graduate School. College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE
The Pennsylvania State University The Graduate School College of Engineering A NEURAL NETWORK BASED CLASSIFIER ON THE CELL BROADBAND ENGINE A Thesis in Electrical Engineering by Srijith Rajamohan 2009
More informationHands-on - DMA Transfer Using get Buffer
IBM Systems & Technology Group Cell/Quasar Ecosystem & Solutions Enablement Hands-on - DMA Transfer Using get Buffer Cell Programming Workshop Cell/Quasar Ecosystem & Solutions Enablement 1 Class Objectives
More informationDepartment of Computer Science. Chair of Computer Architecture. Diploma Thesis. Execution of SPE code in an Opteron-Cell/B.E.
Department of Computer Science Chair of Computer Architecture Diploma Thesis Execution of SPE code in an Opteron-Cell/B.E. hybrid system Andreas Heinig Chemnitz, March 11, 2008 Supervisor: Advisor: Prof.
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More information