Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP

Size: px

Start display at page:

Download "Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP"

Jasmine French
6 years ago
Views:

1 Optimizing Data Sharing and Address Translation for the Cell BE Heterogeneous CMP Michael Gschwind IBM T.J. Watson Research Center

2 Cell Design Goals Provide the platform for the future of computing 10 performance of desktop systems Computing density as main challenge Dramatically increase performance per X X = Area, Power, Volume, Cost, Single core designs offer diminishing returns on investment In power, area, design complexity and verification cost 2 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

3 Cell B.E. Architecture Heterogeneous multicore system architecture SPE SPU SXU LS SPU SXU LS SPU SXU LS SPU SXU LS SPU SXU LS SPU SXU LS SPU SXU LS SPU SXU LS Power Processor Element for control tasks Synergistic Processor Elements for dataintensive processing 16B/cycle MFC PPE MFC MFC 16B/cycle MFC MFC EIB (up to 96B/cycle) MFC 16B/cycle MFC MFC 16B/cycle (2x) Synergistic Processor Element (SPE) consists of Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (MFC) L2 PPU L1 32B/cycle 16B/cycle PXU MIC Dual XDR TM BIC FlexIO TM 3 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

4 Heterogeneous Multiprocessors HMPs are emerging as a new class of processors Central processors + architected accelerators Beyond SoC Tighter system integration between processors and accelerators Multiprogramming and execute user-defined code Use of accelerators by user-level applications Desired attributes Efficient data sharing Efficient CPU/accelerator communication Process isolation 4 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

5 Integrated Execution View heterogeneous architecture as the platform for execution Executes on different processor elements Share data across processor elements Single (common) process container under operating system Heterogeneous threads (LWPs) for scheduling execution on different processor cores Share common effective (virtual) memory space 5 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

6 Heterogeneous Multi-threading and OS Management for Integrated Executables Heterogeneous Multi- Threading Model PPE Threads SPE Threads SPE Context Fully Managed OS assignment of SPE threads Programmer directed using affinity mask Integrated Executable PPE object files PPE threads Physical PPE Application Source & Libraries Cell Broadband Engine Linux SPE threads SPE object files SPE SPE SPE PPE T1 T2 SPE SPE SPE SPE Physical SPEs SPE 6 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

7 Synergistic Optimization of the Cell B.E. Integrate SPE into established SMP architecture Define a scalable system architecture From consumer electronics to supercomputers target applications programming model Cell B.E. Design architecture definition 7 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

8 Data access for accelerators No memory access Input and output via device registers Use real (physical) addresses to access system memory Similar to traditional device models Use effective (virtual) addresses to access system memory 8 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

9 Programming models with accelerator real addressing Applications run without address translation in RA space Similar to deeply embedded applications No paging possible EA-space multiprogramming with real address accelerators super-trusted applications Provide safety fencing mechanism for real address accesses Accelerators managed by operating system OS provides accelerated services Parameters checked on OS entry High entry cost reduces acceleration benefits Trusted/secure code in accelerator Programming/Performance Security 9 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

Multiprogramming with real address accelerators Applications need effective (virtual) to real translations How to obtain address translations? Where to store real addresses?

10 Multiprogramming with real address accelerators Applications need effective (virtual) to real translations How to obtain address translations? Where to store real addresses? How to revoke address translations? Once a real address has been give to user program, page is pinned and cannot be moved or paged Limits OS memory allocation & paging flexibility Data non-contiguous at page boundaries 10 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

11 Virtual-address accelerators Accelerators have access to page translation Architect MMU for accelerator System memory used by accelerator can be paged out Virtual addressing provides pointer sharing Same pointer representation for same data Linear addressing in effective memory space across page boundaries Virtual addressing provides process isolation Security in multiprogramming/multiuser environment 11 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

12 Accelerator MMU options Minimized hardware complexity Small accelerator-specific virtual/physical lookup table If address not in lookup table notify central processor Essentially, a software-managed accelerator TLB Full system page tables Accelerator has full-scale MMU which uses system page table located in memory Hardware page table walks and participate in translation coherence actions 12 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

13 TLB software-management becomes serial bottleneck 9 normalized execution time in seconds Cost avoided by system-wide page tables in Cell B.E. compute address management SPEs 13 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

14 Enabling efficient translation management Offer system protection and data security with memory translation across system Central CPU and accelerators Enable distributed address translation Avoid serial bottleneck in central CPU Address translation parallelized across accelerator MMUs 14 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

15 Architecture optimization for data sharing DMA data transfer into and out of local store Managing data coherence between PPE data caches and DMA transfers Software managed Explicitly evict data by software in program Hardware managed DMA requests perform coherence interrogates (See scenarios in the paper) 15 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

16 Coherence management in software Software coherence cost scales with area synchronized Need to flush all cache lines from cache Insert cache flush instructions into code Power ISA dbcf instruction sequence Performance modeling Execute code with explicit flush instructions on Cell BE Not needed for functional correctness, but reflects costs Compare execution time with unmodified code 16 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

17 Coherence cost for software managed coherence Cost avoided by cache-coherent MFC block transfers compute flush time SPEs 17 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

18 Usage scenario impact on performance Area to be synchronized hard to determine Programmers will use worst case assumptions FFT16M is a best-case scenario Access data ranges accurately determined High data reuse 18 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

19 Green Computing: TOP500 & green500.org Cell BE offers superior power efficiency 488 MFLOPS/W (DP) Power efficient design enables Petaflop computing Cell BE powers world s first and only Petaflop system world s top 3 power-efficient high-performance systems 19 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

20 Cell BE: a Synergistic System Architecture Cell BE is not a collection of different processors, but a synergistic whole Accelerator integration important focus in design process Data sharing between central processor and accelerators Memory address translation performed in parallel in accelerators Avoid serial bottleneck in address translation Cell BE implements hardware DMA coherence Avoid programmer burden and programming errors Software was a driver in the design of the architecture Optimize for processing-intensive applications 20 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

21 Cell BE Cell BE is the result of a partnership between SCEI/Sony, Toshiba, and IBM Cell BE represents the work of more than 400 people starting in 2000 and a design investment of about $400M Thank you! 21 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

22 Copyright International Business Machines Corporation 2005,6,7,8. All Rights Reserved. Printed in the United States October The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. 22 M. Gschwind, Optimizing Data Sharing and Address Translation for the Cell BE

Cell Broadband Engine Architecture. Version 1.0

Cell Broadband Engine Architecture. Version 1.0 Copyright and Disclaimer Copyright International Business Machines Corporation, Sony Computer Entertainment Incorporated, Toshiba Corporation 2005 All Rights Reserved Printed in the United States of America