Parallelism for the Masses: Opportunities and Challenges Andrew A. Chien

Similar documents
Parallelism for the Masses: Opportunities and Challenges

UPCRC Overview. Universal Computing Research Centers launched at UC Berkeley and UIUC. Andrew A. Chien. Vice President of Research Intel Corporation

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Performance Tools for Technical Computing

Programming Models for Supercomputing in the Era of Multicore

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

High Performance Computing on GPUs using NVIDIA CUDA

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

2009 International Solid-State Circuits Conference Intel Paper Highlights

Intel High-Performance Computing. Technologies for Engineering

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

New Intel 45nm Processors. Reinvented transistors and new products

Trends and Challenges in Multicore Programming

Multi-Core Microprocessor Chips: Motivation & Challenges

Intel Enterprise Processors Technology

Microprocessor Trends and Implications for the Future

Supercomputing and Mass Market Desktops

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Multicore Computing and Scientific Discovery

Intel Many Integrated Core (MIC) Architecture

Parallel Algorithm Engineering

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Why Parallel Architecture

Intel: Driving the Future of IT Technologies. Kevin C. Kahn Senior Fellow, Intel Labs Intel Corporation

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Computer Architecture s Changing Definition

Architecting Storage for Semiconductor Design: Manufacturing Preparation

Trends in HPC (hardware complexity and software challenges)

World s most advanced data center accelerator for PCIe-based servers

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Intel Parallel Studio 2011

CS758: Multicore Programming

Fundamentals of Computers Design

Oracle Developer Studio 12.6

Fundamentals of Computer Design

Parallel Computing. Parallel Computing. Hwansoo Han

The Cloud Evolution. Tom Kilroy, Vice President General Manager, Digital Enterprise Group

Sean Maloney Executive Vice President

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Lecture 1: Gentle Introduction to GPUs

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Kevin Donnelly, General Manager, Memory and Interface Division

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

MICROPROCESSOR TECHNOLOGY

Lecture 1. Introduction Course Overview

Towards Parallel, Scalable VM Services

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Table of contents. OpenVMS scalability with Oracle Rdb. Scalability achieved through performance tuning.

Parallelism: The Real Y2K Crisis. Darek Mihocka August 14, 2008

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Parallel Software 2.0

The Art of Parallel Processing

EE5780 Advanced VLSI CAD

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

Crossing the Chasm: Sneaking a parallel file system into Hadoop

The Processor Memory Hierarchy

Parallel Computing. Hwansoo Han (SKKU)

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

The Implications of Multi-core

Turbostream: A CFD solver for manycore

AutoTune Workshop. Michael Gerndt Technische Universität München

HPC in the Multicore Era

Transactional Memory

1. NoCs: What s the point?

Outline Marquette University

Introduction to Xeon Phi. Bill Barth January 11, 2013

W H I T E P A P E R U n l o c k i n g t h e P o w e r o f F l a s h w i t h t h e M C x - E n a b l e d N e x t - G e n e r a t i o n V N X

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Research Challenges for FPGAs

John Hengeveld Director of Marketing, HPC Evangelist

Parallelism and Concurrency. COS 326 David Walker Princeton University

FAST FORWARD TO YOUR <NEXT> CREATION

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Multithreading: Exploiting Thread-Level Parallelism within a Processor

NAMD Performance Benchmark and Profiling. January 2015

The Cray Rainier System: Integrated Scalar/Vector Computing

Efficient Parallel Programming on Xeon Phi for Exascale

Performance Analysis in the Real World of Online Services

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Philippe Thierry Sr Staff Engineer Intel Corp.

Intel Math Kernel Library 10.3

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Voltaire Making Applications Run Faster

Delivering HPC Performance at Scale

Intel SSD Data center evolution

Introduction. CS 2210 Compiler Design Wonsun Ahn

CUDA GPGPU Workshop 2012

GPUs and Emerging Architectures

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Architecture of a Real-Time Operational DBMS

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

Diffusion TM 5.0 Performance Benchmarks

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Leading Performance for Oracle Applications? John McAbel Collaborate 2015

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Transcription:

: Opportunities and Challenges Andrew A. Chien Vice President of Research Intel Corporation Carnegie Mellon University Parallel Thinking Seminar October 29, 2008

Outline Is Parallelism a crisis? Opportunities in Parallelism Expectations and Challenges Moving Parallel Programming Forward What s Going on at Intel Questions 2

Parallelism is here And Growing! Future: 100+ Number of Cores Core2 Quad (4) Dunnington (6) Nehalem: 8+ Larrabee: 12-32 Core 2 Duo (2) 2006 2007 2008 2009 2010 2015 3

Q. Is parallelism a crisis? A. Parallelism is an opportunity. 4

Parallelism is a key driver for Energy and Performance 1.73x Dual-Core Performance Power 1.73x 1.13x 1.00x 0.87x 1.02x 0.51x Over-clocked (+20%) Design Frequency Dual-core Underclocked (-20%) 5

Unified IA and Parallelism Vision a Foundation Across Platforms PERFORMANCE Integration Options Example: Graphics Notebook, Desktop and Server Canmore, Tolapai, Atom Intel Core 2, i7 POWER Visual Computing, Gaming 6

Opportunity in Low-power Computing 10x Lower Power 7

Opportunity in Highly Parallel Computing 10x Higher Performance Visual Computing FIXED FUNCTION LOGIC VECTOR IA CORE COHERENT CACHE VECTOR IA CORE VECTOR IA CORE VECTOR IA CORE INTERPROCESSOR NETWORK VECTOR IA CORE INTERPROCESSOR NETWORK COHERENT CACHE COHERENT COHERENT CACHE CACHE VECTOR IA CORE VECTOR IA CORE COHERENT COHERENT CACHE CACHE COHERENT COHERENT CACHE CACHE VECTOR IA CORE MEMORY and I/O INTERFACES Financial Modeling Gaming, Entertainment Biological Modeling 8

Opportunity #1: Highly Portable, Parallel Software All computing systems (servers, desktops, laptops, MIDs, smart phones, embedded ) converging to A single framework with parallelism and a selection of CPU s and specialized elements Energy efficiency and Performance are core drivers Must become forward scalable Parallelism becomes widespread all software is parallel Create standard models of parallelism in architecture, expression, and implementation. 9

Forward Scalable Software # Cores Academic Research N 2N Today 128 512 256 256 8 32 16 16 Time Software with forward scalability can be moved unchanged from N -> 2N -> 4N cores with continued performance increases. 10

Opportunity #2: Major Architectural Support for Programmability Single core growth and aggressive frequency scaling are weakening competitors to other types of architecture innovation Architecture innovations for functionality programmability, observability, GC, are now possible Don t ask for small incremental changes, be bold and ask for LARGE changes that make a LARGE difference 11

New Golden Age of Architectural Support for Programming? Programming Support Parallelism System Integration Language Support Integration Ghz Scaling Issue Scaling Mid 60 s To Mid 80 s Mid 80 s To Mid 200x Terascale Era 12

Opportunity #3: High Performance, High Level Programming Approaches Single chip integration enables closer coupling (cores, caches) and innovation in intercore coordination Eases performance concerns Supports irregular, unstructured parallelism Forward scalable performance with good efficiency may be possible without detailed control. Functional, declarative, transactional, object-oriented, dynamic, scripting, and many other high level models will thrive. 13

Manycore!= SMP on Die Parameter SMP Tera-scale Improvement On-die 12 GB/s ~1.2 TB/s ~100X Bandwidth On-die 400 cycles 20 cycles ~20X Latency Less locality sensitive; Efficient sharing Runtime techniques more effective for dynamic, irregular data and programs Can we do less tuning? And program at a higher level? 14

Opportunity #4: Parallelism can Add New Kinds of Capability and Value Additional core and computational capability can be available on-chip Single-chip design enables enhancement at low cost Integration enables close coupling Security: taintcheck, invariant monitoring, etc. Robustness: race detection, invariant checking, etc. Interface: sensor data processing, self-tuning, activity inference Deploy Parallelism to enable new applications, software quality, and enhance user experience 15

Expectations and Challenges 16

Two Students in 2015 Mine s s an Intel 1,000 core with 64 Out- of-order cores! How many cores does your computer have? End Users don t care about core counts; they care about capability 17

Chip Real Estate and Performance/Value Tukwila next generation Intel Itanium processor 4 cores, 2B transistors, 30MB cache 50% cache, 50% logic 1% of chip area = 30M transistors = 1/3MB 0.1% of chip area = 3M transistors = 1/30MB How much performance benefit? What incremental performance benefit should you expect for the last core in a 100-core? 1000-core? Incremental performance benefit or forward scalability, not efficiency should be the goal. 18

Key Software Development Challenges New functionality Productivity Portability Performance, Performance Robustness Debugging/Test Security Time to market Software Development is Hard! Parallelism is critical for performance, but must be achieved in conjunction with all of these requirements 19

HPC: What we have learned about Parallelism Large-scale parallelism is possible, and typically comes with scaling of problems and data Portable expression of parallelism matters High level program analysis is a critical technology Working with domain experts is a good idea Multi-version programming (algorithms and implementations) is a good idea. Autotuning is a good idea Locality is hard, modularity is hard, data structures are hard, efficiency is hard Of course, this list is not exhaustive. 20

HPC lessons not to learn Programmer effort doesn t matter Hardware efficiency matters Low-level programming tools are acceptable Low-level control is an acceptable path to performance Horizontal locality / explicit control of communication is critical Move beyond conventional wisdom parallelism is hard, based on these lessons. Parallelism can be easy. 21

Moving Parallel Programming Forward 22

Spending Moore s Dividend (Larus) 30 year retrospective, analyzing Microsoft s experience Spent on -- New Application Features Higher level programming > Structured programming > Object-oriented > Managed runtimes Programmer productivity (decreased focus) 23

Today, most programmers do not focus on performance Productivity: Quick functionality, adequate performance Matlab, Mathematica, R, SAS, etc. VisualBasic, PERL, Javascript, Python, Ruby Rails Mixed: Productivity and performance Java, C# (Managed languages + rich libraries) Programmer Productivity Performance: Efficiency is critical (HPC focus) C++ and STL C and Fortran How can we enable productivity programmers write 100-fold parallel programs? Parallelism must be accessible to productivity programmers. Execution Efficiency 24

Challenge #1: Can we introduce parallelism in the Productivity Layer? Enable productivity programmers to create largescale parallelism with modest effort A simple models of parallelism mated to productivity languages Data and collection parallelism, determinism, functional/declarative Parallel Libraries Generate scalable parallelism for many applications Exploit with dynamic and compiled approaches => May be suitable for introduction to programming classes 25

Challenge #2: Can we compose parallel programs safely? (and easily) Parallel program composition frameworks which preserved correctness of the composed parts Language, execution model, a/o tools Interactions when necessary are controllable at the programmer s semantic level (i.e. objects or modules) Composition and interactions supported efficiently by hardware Fulfill vision behind TM, Concurrent Collections, Concurrent Aggregates, Linda, Lock-free, Racefree, etc. 26

Challenge #3: Can we invent an easy, modular way to express locality? Describe data reuse and spatial locality Descriptions compose Enable software exploitation Enable hardware exploitation Computation = Algorithms + Data structures Efficient Computation = Algorithms + Data Structures + Locality A + DS + L (A + DS) + L A + (DS + L) (A + L) + DS 27

Challenge #4: Can we raise the level of efficient parallel programming? HPC model = Lowest Level of abstraction Direct Control of resource mapping Explicit control over communication and synchronization Direct management of locality Starting from HPC models, how can we elevate? 28

Parallelism Implications for Computing Education in 2015 (and NOW!) Parallelism is in all computing systems, and should be front and center in education An integral part of early introduction and experience with programming (simple models) An integral part of early theory and algorithms (foundation) An integral part of software architecture and engineering Observation: No biologist, physicist, chemist, or engineer thinks about the world with a fundamentally sequential foundation. The world is parallel! We can design large scale concurrent systems of staggering complexity. 29

Are All Applications Parallel? 100%??% Today Manycore Era # of Applications Growing Rapidly 30

What s going on at Intel 31

Universal Parallel Computing Research Centers Catalyze breakthrough research enabling pervasive use of parallel computing Parallel Programming Languages, Compilers, Runtime, Tools Parallel Applications For desktop, gaming, and mobile systems Parallel Architecture Support new generation of programming models and languages Parallel Sys. S/W Performance scaling, memory utilization, and power consumption 32

Making Parallel Computing Pervasive Joint HW/SW R&D program to enable Intel products 3-7+ in future Intel Tera-scale Research Academic Research UPCRCs Academic research seeking disruptive innovations 7-10+ 7 years out Software Products Enabling Parallel Computing Multi-core Education Wide array of leading multi-core SW development tools & info available today Community and Experimental Tools Multi-core Education Program - 400+ Universities - 25,000+ students - 2008 Goal: Double this Intel Academic Community TBB Open Sourced Threading for Multi-core SW STM-Enabled Compiler on community Whatif.intel.com Multi-core books Parallel Benchmarks at Princeton s s PARSEC site 33

Ct: A Throughput Programming Language TVEC<F32> TVEC<F32> a(src1), a(src1), b(src2); b(src2); TVEC<F32> TVEC<F32> c c = = a a + + b; b; c.copyout(dest); c.copyout(dest); User Writes Serial-Like Core Independent C++ Code 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 + 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 Primary Data Abstraction is the Nested Vector Supports Dense, Sparse, and Irregular Data 1 1 0 1 + 1 1 0 1 Thread 1 0 0 0 1 + 0 0 0 1 Thread 2 0 1 0 0 + 0 1 0 0 Thread 3 0 0 1 1 + 0 0 1 1 Thread 4 Ct Parallel Runtime: Auto-Scale to Increasing Cores SIMD Unit Core 1 SIMD Unit Core 2 SIMD Unit Core 3 SIMD Unit Core 4 Ct JIT Compiler: Auto-vectorization, SSE, AVX, Larrabee Programmer Thinks Serially; Ct Exploits Parallelism See whatif.intel.com 34

Intel Concurrent Collections for C/C++ The application problem The work of the domain expert Semantic correctness Constraints required by the application Supports serious separation of concerns: The domain expert does not need to know about parallelism Concurrent Collections Spec The work of the tuning expert Architecture Actual parallelism Locality Overhead Load balancing Distribution among processors Scheduling within a processor Mapping to target platform The tuning expert does not need to know about the domain. image Classifier1 Classifier2 Classifier3 http://whatif.intel.com 35

Intel Software Products 2007 2008 36

Polaris Teraflops Research Processor 21.72mm Technology Transistors Die Area C4 bumps # PLL I/O Area 12.64mm I/O Area single tile 1.5mm 2.0mm 65nm, 1 poly, 8 metal (Cu) 100 Million (full-chip) 1.2 Million (tile) 275mm 2 (full-chip) 3mm 2 (tile) 8390 TAP Deliver Tera-scale performance TFLOP @ 62W, Desktop Power, 16GF/W Frequency target 5GHz, 80 cores Bi-section B/W of 256GB/s Link bandwidth in hundreds of GB/s Prototype two key technologies On-die interconnect fabric 3D stacked memory Develop a scalable design methodology Tiled design approach Mesochronous clocking Power-aware capability 37

OpenCirrus Testbed Open large-scale, global testbed for academic research Catalyze creation of an Open Source Cloud stack Focus on data center management and services Systems-level and application-level research Announced July 29, 2008, deployment expected December 2008 6 Testbeds of 1,000 to 4,000 cores each UIUC KIT Germany HP, Yahoo Bay area Intel Research CMU IDA Singapore Create an academic cloud computing research community and open source stack - www.opencirrus.org 38

Summary Is Parallelism a crisis? No, there are major opportunities to lay new foundations and synergy across layers. Expectations and Challenges: Efficiency is not king Moving Parallel Programming Forward Productivity and Ease, Composition, Locality, and only then Efficiency What s Going on at Intel: A broad array of explorations 39

Questions? 40

Copyright Intel Corporation. *Other names and brands may be claimed as The property of others. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise or used for any public Purpose under any circumstance, without the express written consent and permission of Intel Corporation. 41