QDP++/Chroma on IBM PowerXCell 8i Processor

Size: px

Start display at page:

Download "QDP++/Chroma on IBM PowerXCell 8i Processor"

Cleopatra Hodge
5 years ago
Views:

1 QDP++/Chroma on IBM PowerXCell 8i Processor Frank Winter (QCDSF Collaboration) University Regensburg NIC, DESY-Zeuthen STRONGnet 2010 Conference Hadron Physics in Lattice QCD Paphos, Cyprus from August 24 to 27, 2010 Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 1 / 18

2 Outline 1 Motivation 2 QDP++/Chroma 3 IBM PowerXCell 8i Processor 4 Implementation 5 Benchmarks Results 6 Conclusion and Outlook Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 2 / 18

Chroma Very successful versatile lattice QCD application suite Frank Winter

3 Motivation: QDP++/Chroma on QPACE New type of massive parallel scalable supercomputer 200 TFlops aggregate performance (double precision) based on IBM PowerXCell 8i Processor Chroma Very successful versatile lattice QCD application suite Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 3 / 18

4 Motivation: Generic and Retargetable Addresses QDP++ Generic, i.e. includes all functions Chroma as application Retargetable Code-Generator retargetable Right now just one target architecture: IBM PowerXCell 8i Processor includes: Single core Processors Symmetric Multiprocessing Processors Heterogeneous multi-core Processors Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 4 / 18

5 SciDAC Software Components Involved in Chroma Main developers: B. Joó and R. Edwards Chroma (main application) QDP++ (QCD Data Parallel) QMP (QCD Message Passing) QMT (QCD Multi-Threading) Several specialized kernels: BAGEL (P. Boyle) QUDA (M. Clark) Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 5 / 18

Chroma (Main Application) First CVS stamp 2002, in production since 2003 151 citations (Aug 2010) Spectroscopy, decay constant, nucleon form factor, structure function moment,.

6 Chroma (Main Application) First CVS stamp 2002, in production since citations (Aug 2010) Spectroscopy, decay constant, nucleon form factor, structure function moment,... Actions: Wilson, domain wall, overlap fermion operators,... Numerous inverters: MR, CG, BiCGStab,... builds on top of QDP++ Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 6 / 18

7 QDP++ (QCD Data Parallel) C++ class library for Lattice Field Theory, basis for Chroma Lattice wide datatypes QCD tensor structure nested template instantiation Data parallel operations PETE (Portable Expression Template Engine) eliminates lattice temporaries C++ operator overloading Hides architectural details to the user Make code highly portable and generic But drawback: moderate performance High performance requires specialized/optimized code Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 7 / 18

8 Optimized Kernels by Specialization Chroma Level Accesses Raw Data of QDP++ Lattice Datatypes Directly BAGEL Wilson DSlash BAGEL Clover QUDA QDP++ Level SSE Kernels BAGEL QDP Optimizations for Clusters (Myrinet, Infiniband,...) QCDOC, BlueGene L/P Cray XT 3/4/5/6 systems all by template specialization Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 8 / 18

Symmetric Multiprocessing Processors (SMP) Computer industry goes Multi-Core IBM Power7 Intel Nehalem AMD Opteron/Barcelona Cray XT4/5/6 JLab s Answer to SMP QMT (QCD Multi-Threading) Performance

9 Symmetric Multiprocessing Processors (SMP) Computer industry goes Multi-Core IBM Power7 Intel Nehalem AMD Opteron/Barcelona Cray XT4/5/6 JLab s Answer to SMP QMT (QCD Multi-Threading) Performance gain over pure QMP/MPI but: SMP (Symmetric Multi-Processing) required Homogeneous multi-core architecture required Latest Trend: Heterogeneous multi-core acceleration IBM PowerXCell 8i Processor Larrabee CUDA not supported Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 9 / 18

Hardware Overview: IBM PowerXCell 8i Processor Cell Broadband Engine Architecture (CBEA) 1 PowerPC Processing Element (PPE) 8 Synergistic Processing Elements (SPE) Element Interconnect Bus (EIB)

10 Hardware Overview: IBM PowerXCell 8i Processor Cell Broadband Engine Architecture (CBEA) 1 PowerPC Processing Element (PPE) 8 Synergistic Processing Elements (SPE) Element Interconnect Bus (EIB) shared by all Processing Elements Memory Interface Controller (MIC), I/O Interface (IOIF) Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 10 / 18

Hardware Overview: Synergistic Processing Element (SPE) Synergistic Processing Unit (SPU) RISC Processor with 128-bit SIMD organization 256 KB Local Storage (LS) for Instructions and Data 128-Entry

11 Hardware Overview: Synergistic Processing Element (SPE) Synergistic Processing Unit (SPU) RISC Processor with 128-bit SIMD organization 256 KB Local Storage (LS) for Instructions and Data 128-Entry 128-bit Register File 2 Instruction Pipelines Feature Dual-Issue Floating Point Pipeline Supports Fused Multiply-Add/Sub Memory Flow Controller (MFC) Interfaces the LS to Main Memory DMA Controller Transfers Data in Parallel to SPU Execution Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 11 / 18

How to get Chroma/QDP++ on the Cell? Problem: Build of Chroma/QDP++ for PPE is possible But, executes with poor performance Exploiting SPU s floating-point performance and DMA controller necessary!

12 How to get Chroma/QDP++ on the Cell? Problem: Build of Chroma/QDP++ for PPE is possible But, executes with poor performance Exploiting SPU s floating-point performance and DMA controller necessary! But, build of Chroma/QDP++ for SPU impossible (SIMD, LS size, no I/O)! Solution: Build only required functions for SPU Build remaining parts of Chroma for PPE Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 12 / 18

13 New Components Modified QDP++ for PPE generates SPU Meta-Code Lightweight QDP++ for SPU SPU Code Generator Boost Meta-Programming Library (MPL) for Compile-Time Calculations Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 13 / 18

14 Integration of New Components into Build Process Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 14 / 18

15 Benchmark Measurements 60 QDP++ Functions were selected for Benchmarking from: Propagator Calculation Smearing Routines, and Hadron Spectrum Calculation n QDP++ function index range 1006 M i,j = M i,j {i, 1, 3}{j, 1, 3} 1007 M i,j = (M M ) i,j {i, 1, 3}{j, 1, 3} 1014 Mi,j,k,l+ SC = (M C M SC + M SC ) i,j,k,l {i, 1, 3}{j, 1, 3}{k, 1, 4}{l, 1, 4} Test Hardware: Jülich Super-Computing Center (JSC) QS22 Cell Blades (dual IBM PowerXCell 8i Processor) Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 15 / 18

16 Benchmark Results: Pure DMA good overall memory bandwidth saturation some drops for very small functions, i.e. 1022: LatticeBool = LatticeInt > ScalarInt (execution time negligible) Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 16 / 18

17 Benchmark Results: Computation Switched On 50% of functions already at highest performance some functions floating-point performance limited Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 17 / 18

18 Conclusion and Outlook Pro First step towards Chroma on QPACE is done But bigger goal: We are generic and address all functions (not only a few found in the hot spot of a program) Good memory bandwidth saturation Con Build process still a little crude Outlook Parallelization step to QPACE nodes Build process enhancement, some parts go into the compiler Frank Winter (DESY/University Regensburg) QDP++/Chroma on IBM PowerXCell 8i Processor STRONGnet 2010 Conference 18 / 18

QDP-JIT/PTX: A QDP++ Implementation for CUDA-Enabled GPUs

QDP-JIT/PTX: A QDP++ Implementation for CUDA-Enabled GPUs : A QDP++ Implementation for CUDA-Enabled GPUs, R. G. Edwards Thomas Jefferson National Accelerator Facility, 236 Newport News, VA E-mail: fwinter@jlab.org These proceedings describe briefly the framework