Scaling Data Warehousing Applications using GPUs
|
|
- Beverly Spencer
- 6 years ago
- Views:
Transcription
1 Scaling Data Warehousing Applications using GPUs Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA Sponsors: National Science Foundation, LogicBlox Inc., NVIDIA, Intel Outline n New Rules n Scaling and energy efficiency n Data movement costs n Thermal issues and processor physics n Scaling Relational Database Performance with GPUs n Optimized primitives n Optimization of Data Movement n DRAM memory aggregation in clusters 2 1
2 Scaling Computing Performance Data Movement Costs Thermal Limits Energy Limits Cray Titan: Heterogeneous Computing 3 3 Moore s Law Goal: Sustain Performance Scaling Performance scaled with number of transistors Dennard scaling: power scaled with feature size From wikipedia.org From R. Dennard, et al., Design of ion-implanted MOSFETs with very small physical dimensions, IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp , Oct
3 Post Dennard Architecture Performance Scaling Power Delivery Cooling! Perf # " ops s W. J. Dally, Keynote IITC 2012 $! & = Power( W ) Efficiency# % " ops $ & joule% You can hide latency but you cannot hide energy! Data_movement_cost Three operands x 64 bits/operand Moving 1-bit of data 1mm at 22nm 1 = ~1 pj 1 HIPEAC Roadmap hipeacvision.pdf Energy = # bits dist mm energy bit mm 5 Scaling Performance: Cost of Data Movement Embedded Platforms Big Science: To Exascale Cost of Data Movement Goal: GOps/w Goal: 20MW/Exaflop Courtesy: Sandia National Labs :R. Murphy. Sustain performance scaling through massive concurrency Data movement becomes more expensive than computation 6 3
4 Post Dennard Architecture Performance Scaling! Perf # " ops s W. J. Dally, Keynote IITC 2012 $! & = Power( W ) Efficiency# % " ops $ & joule% Operator_cost + Data_movement_cost Specialization à heterogeneity and asymmetry Three operands x 64 bits/operand Energy = # bits dist mm energy bit mm 7 Scaling Performance: Simplify, Diversify & Multiply AMD Bulldozer Core ARM A7 Core (arm.com) n Extracting single thread performance costs energy n Out-of-order execution n Branch prediction n Scheduling etc. Still important! NVIDIA Fermi n Multithread performance exploits parallelism n Simpler pipelines n Core scaling 8 4
5 Asymmetry vs. Heterogeneity Performance Asymmetry Functional Asymmetry Heterogeneous MC MC MC MC MC MC MC MC n Multiple voltage and frequency islands n Different memory technologies n STT-RAM, PCM, Flash n Complex cores and simple cores n Shared instruction set architecture (ISA) n Subset ISA n Distinct microarchitecture n Fault and migrate model of operation 1 Uniform ISA n Multi-ISA n Microarchitecture n Memory & Interconnect hierarchy Multi-ISA 1 Li., T., et.al., Operating system support for shared ISA asymmetric multi-core architectures, in WIOSCA, The Challenge: The Memory System Xeon Phi Hybrid Memory Cube n What should the memory hierarchy look like? n Parallelism vs. locality tradeoffs n Minimize data movement à Processor in Memory? 10 5
6 Thermal Capacity n Exploit package physics n Temperature changes on the order of milliseconds n Workload behaviors change on the order of microseconds n Impact on device behavior? Thermal Capacity Time Varying Workload Instructions/cycle Time Figures: psdgraphics.com and wikipedia.org Power-Performance Management! 11 Summary: New Performance Scaling Rules n Energy efficiency: Scale performance by scaling energy efficiency à diversify à programming models? n Parallelism: Scale number of cores rather than performance of a single core à multiply à programming models n Data Movement: Energy cost of data movement is more expensive than the energy cost of computation à communication-centric n Physics Capacity: Scaling limited by thermal/power capacity à power/thermal management 12 6
7 Outline n New Rules n Scaling and energy efficiency n Data movement costs n Thermal issues and processor physics n Scaling Relational Database Performance with GPUs n Optimized primitives n Optimization of Data Movement n DRAM memory aggregation in clusters 13 System Diversity Amazon EC2 GPU Instances Hardware Diversity is Mainstream Mobile Platforms (DSP, GPUs) Keeneland System (GPUs) Cray Titan (GPUs) 14 7
8 System Model Large Graphs Programming Models Data Movement Optimizations System Abstractions e.g. GAS, Virtual DIMMs, etc Domain Specific Languages Compiler and Run-Time Support Cluster Wide Hardware Consolidation Hardware Customization 15 Databases: Not a Traditional Domain of GPUs LargeQty(p) <- Qty(q), q > Relational Computations Over Massive Data Sets 16 8
9 Data Warehousing Applications on GPUs n The Opportunity n Significant potential data parallelism n If data fits in GPU memory, 2x 27x speedup has been shown 1 n The Challenge n Need to process 1-50 TBs of data 2 n 15 90% of the total time * spent in moving data between CPU and GPU * n Fine grained computation 1 B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey. 17 Red Fox: Goal and Status n Goal Haicheng Wu n Build a compiler/runtime framework to accelerate Datalog LB query using GPUs n Understand the Good, the Bad and the Ugly! n Status n Capable of running all/full TPC-H queries on GPUs n Requires that data fits in the GPU memory à move to fusion parts n Focus to date: correctness and performance n Moving forward à performance and scale 18 9
10 Domain Specific Compilation: Red Fox Datalog LB Queries Joint with LogicBlox Inc. LogicBlox Front-End Language Front-End src-src Optimization Kernel Weaver IR Optimization RA-To-PTX (nvcc + RA-Lib) Red Fox RT Query Plan Kernel IR RA Primitives Translation Layer Machine Neutral Back-End Targeting Accelerator Clouds for meeting the demands of data warehousing applications In-core databases 19 Datalog LB Query and Front-end Example Datalog LB Query Example Harmony IR (CFG) 1 number(n)->int32 (n). 2 number(0). 3 // other number facts elided for brevity 4 next(n,m)->int32(n), int32(m). 5 next(0,1). 6 // other next facts elided for brevity 7 8 even(n)-> int32(n). 9 even(0). Recursive Definition 10 even(n)<-number(n),next(m,n),odd(m) odd (n)->int32(n). 13 odd (n)<-next(m,n),even(m). Front-end BB1: COPY(pre_odd,odd){PTX} COPY(pre_even,even){PTX} JOIN_PARTITION(next,even){PTX} JOIN_COMPUTE(next,even){PTX} JOIN_GATHER(temp_odd){PTX} PROJECT(odd,temp_odd){PTX} BB2: PROJECT(m_1,next){PTX} JOIN_PARTITION(number,m_1){PTX} JOIN_COMPUTE(number,m_1){PTX} JOIN_GATHER(temp_j_1){PTX} PROJECT(j_1,temp_j_1){PTX} JOIN_PARTITION(j_1,odd){PTX} JOIN_COMPUTE(j_1,odd){PTX} JOIN_GATHER(temp_even){PTX} PROJECT(even,temp_even){PTX} BB3: if pre_odd == odd? Y BB4: pre_even == even? Y BB5: HALT N N 20 10
11 Research Thrusts n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size 21 Primitives Map Operators to GPU implementations From RA Library PROJECT PRODUCT SELECT JOIN From Thrust Library SORT UNIQUE AGGREGATION SET Family Data Structure: weekly sorted arrays of densely id price tax padding packed tuples zeros 4 bytes 8 bytes 16 bytes Key Value Tuple fields can be integer, float, datetime, string, etc
12 RA Primitives Library: Multistage Algorithms Hybrid multi-stage algorithm (partition, compute, gather) to make trade-offs between computation complexity and memory access efficiency Strategy: Increase core utilizations until the computation becomes memory bound, and then achieve near peak utilization of the memory interface Example of SELECT * G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, RA Primitives Library: Example of JOIN Most complicated JOIN: 57%~72% peak performance Most efficient PRODUCT, PROJECT and SELECT: 86%~92% peak performance Measured on Tesla C2050 Random Integers as inputs * G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP,
13 Research Thrusts n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size 25 Data Movement in Kernel Execution T M N 2 Execute Thread Block or Cooperative Thread Array (CTA) ~250GB/s 1 Input 3 Result 26 13
14 Kernel Fusion- A Data Movement Optimization n Increase the granularity of kernel computation n Reduce data movement throughout the hierarchy n Inspired by loop fusion n Compile-time automation n Input is an optimized query plan 27 Kernel Weaving and Fusion Interweaving and Fusing individual stages (CUDA kernels) Use registers or shared memory to store temporary result 28 14
15 Kernel Weaver: Major Benefits n Reduce Data Footprint n Reduction in accesses to global memory n Access to common data across kernels improves temporal locality n Reduction in PCIe transfers n Expand optimization scope of the compiler n Data re-use n Increase textual scope of optimizers A1 A2 A1 A2 A3 Temp Kernel A Kernel B Result A3 Fused Kernel A, B Result * H. Wu, G.Diamos, S.Cadambi, and S. Yalamanchili. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In MICRO Kernel Weaver: Micro-benchmarks If fusing below operators together on Tesla C2070 Speedup Fused vs. Not Fused Average 2.89x speedup a b c d e 30 15
16 Resource Usage & Occupancy Individual primitive After kernel fusion PTX Reg # Shared MEM (Byte) Occupancy (%) PROJECT SELECT JOIN / Multiply PTX Reg # Shared MEM (Byte) Occupancy (%) (a) (b) (c) (d) (e) n Kernel fusion may increase resource usage and thus decrease occupancy n Retains other benefits 31 TPC-H Queries n A popular decision making benchmark suite n Have 22 queries analyzing data from 6 big tables n Scale Factor parameter to control database size n Red Fox can run SF=1 for all 22 queries n GPU benchmark suite being generated (Summer 2013) 32 16
17 Experimental Environment CPU Xeon 2.80GHz GPU 1 Tesla C2075 (6GB GDDR5 memory) OS Ubuntu Server GCC NVCC 4.2 Thrust TPC-H Performance (SF = 1) n 22 queries totally takes seconds n Compared with MySQL implementation in 4 node CPU cluster*, Red Fox is 59x faster on average Example: Q22 Input Size: 192MB Operator #: 92 CUDA Kernel #: 205 Query Plan: *Ngamsuriyaroj, Pornpattana, Performance Evaluation of TPC-H Queries on MySQL Cluster. WAINA
18 Where is the time spent? 48.82% 38.94% project select product join diff sort unique merge agg arith conv others copy pcie n Most of time is spent in JOIN and SORT n PCIe transfer time is less than 10% n PROJECT used most frequently, but takes less than 5% 35 Future Improvements n Optimized query plan n Reduce tuple size n Common operator reduction n Reorder operators n n More RA implementations n Hash Join n Radix Sort n n Pipeline the execution n Expect 10x-100x speedup from above techniques n Increase scale factor à Oncilla 36 18
19 Research Thrusts n I: Optimized implementations of primitives n Relational algebra n Data management within the GPU memory hierarchy n II: Data movement optimizations n Between hosts and (local or remote) accelerators n Within an accelerator n III: In-core processing n Cluster wide memory aggregation techniques n Change the ratio of host memory size to accelerator memory size 37 II. In-Core Processing GPU ~2K Cores GPU ~2K Cores GPU ~2K Cores GPU ~2K Cores GPU MEM ~6GB GPU MEM ~6GB GPU MEM ~6GB GPU MEM ~6GB MAIN MEM ~128GB MAIN MEM ~128GB MAIN MEM ~128GB MAIN MEM ~128GB CPU (Multi Core) 2-16 Cores CPU (Multi Core) 2-16 Cores CPU (Multi Core) 2-16 Cores CPU (Multi Core) 2-16 Cores n Cluster-based memory aggregation n Hardware support for global non-coherent, physical address space system n Change the ratio of host-memory : GPU-memory n Joint project with the University of Heidelberg 38 19
20 Oncilla: Fabrics for Accelerator Clouds Jeff Young n Goal: Efficient memory aggregation for accelerators in data centers n Solution: Use Global Address Spaces (GAS) and commodity fabrics (HT, QPI, PCIe, 10GE, IB) n Support in-core databases using software from Red Fox project 39 Oncilla TPC-H Microbenchmarks (Preliminary Results) Using Disk Using Aggregation 40 20
21 EXTOLL Network Adapter and Fabric Courtesy, Prof. H. Fröning, the University of Heidelberg n Provides RDMA transfer (RMA), MMIO-based put/get operations for GAS (SMFU), and support for efficient, small messages (VELO) n Current V6 prototype: 300 ns latency per hop, 24 Gbps bandwidth, very low overhead (64 B per packet) [1] n ASIC projected to have bandwidth of 8-12 GB/s [1] H. Fröning, On Achieving High Message Rates, CCGRID n Two node cluster prototypes n GB of DRAM n NVIDIA C2070 GPUs n EXTOLL cluster n Network adapters and fabric developed by University of Heidelberg, Germany n AIC custom blades n Galibier Virtex 6 prototypes Oncilla Infrastructure n IB cluster based on KIDS n Mellanox QDR IB adapter n Dual-socket Intel Xeon X
22 System Software Scaling Rules Applications Technology Architecture Thank You Questions? 43 22
Red Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationAccelerating Data Warehousing Applications Using General Purpose GPUs
Accelerating Data Warehousing Applications Using General Purpose s Sponsors: Na%onal Science Founda%on, LogicBlox Inc., IBM, and NVIDIA The General Purpose is a many core co-processor 10s to 100s of cores
More informationRela*onal Processing Accelerators: From Clouds to Memory Systems
Rela*onal Processing Accelerators: From Clouds to Memory Systems Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology Collaborators: M. Gupta, C. Kersey,
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationCommodity Converged Fabrics for Global Address Spaces in Accelerator Clouds
Commodity Converged Fabrics for Global Address Spaces in Accelerator Clouds Jeffrey Young, Sudhakar Yalamanchili School of Electrical and Computer Engineering, Georgia Institute of Technology Motivation
More informationOncilla: A GAS Runtime for Efficient Resource Allocation and Data Movement in Accelerated Clusters
Oncilla: A GAS Runtime for Efficient Resource Allocation and Data Movement in Accelerated Clusters Jeff Young, Se Hoon Shon, Sudhakar Yalamanchili, Alex Merritt, Karsten Schwan School of Electrical and
More informationGPU-centric communication for improved efficiency
GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop
More informationThe Era of Heterogeneous Compute: Challenges and Opportunities
The Era of Heterogeneous Compute: Challenges and Opportunities Sudhakar Yalamanchili Computer Architecture and Systems Laboratory Center for Experimental Research in Computer Systems School of Electrical
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationEE282 Computer Architecture. Lecture 1: What is Computer Architecture?
EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer
More informationThe Case for Heterogeneous HTAP
The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid
More informationArchitecture-Conscious Database Systems
Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query
More informationMulti-threaded Queries. Intra-Query Parallelism in LLVM
Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted
More informationEnergy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS
Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationBig Data Systems on Future Hardware. Bingsheng He NUS Computing
Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big
More informationGPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA
GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles
More informationRe-architecting Virtualization in Heterogeneous Multicore Systems
Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationData Processing on Emerging Hardware
Data Processing on Emerging Hardware Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland 3 rd International Summer School on Big Data, Munich, Germany, 2017 www.systems.ethz.ch
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationRealizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics
Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics Zvonimir Z. Bandic, Sr. Director, Next Generation Platform Technologies Western Digital Corporation
More informationACCELERATION AND EXECUTION OF RELATIONAL QUERIES USING GENERAL PURPOSE GRAPHICS PROCESSING UNIT (GPGPU)
ACCELERATION AND EXECUTION OF RELATIONAL QUERIES USING GENERAL PURPOSE GRAPHICS PROCESSING UNIT (GPGPU) A Dissertation Presented to The Academic Faculty By Haicheng Wu In Partial Fulfillment of the Requirements
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationRegister File Organization
Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationLECTURE 1. Introduction
LECTURE 1 Introduction CLASSES OF COMPUTERS When we think of a computer, most of us might first think of our laptop or maybe one of the desktop machines frequently used in the Majors Lab. Computers, however,
More informationDeploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c
White Paper Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c What You Will Learn This document demonstrates the benefits
More informationIMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM
IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationMicroprocessor Trends and Implications for the Future
Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from
More informationSDA: Software-Defined Accelerator for general-purpose big data analysis system
SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationPerformance of computer systems
Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type
More informationArm Processor Technology Update and Roadmap
Arm Processor Technology Update and Roadmap ARM Processor Technology Update and Roadmap Cavium: Giri Chukkapalli is a Distinguished Engineer in the Data Center Group (DCG) Introduction to ARM Architecture
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationTiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation
Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationMemory-Based Cloud Architectures
Memory-Based Cloud Architectures ( Or: Technical Challenges for OnDemand Business Software) Jan Schaffner Enterprise Platform and Integration Concepts Group Example: Enterprise Benchmarking -) *%'+,#$)
More informationIntel SSD Data center evolution
Intel SSD Data center evolution March 2018 1 Intel Technology Innovations Fill the Memory and Storage Gap Performance and Capacity for Every Need Intel 3D NAND Technology Lower cost & higher density Intel
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCan Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationCUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker
CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationOcelot: An Open Source Debugging and Compilation Framework for CUDA
Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering
More information7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT
7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationOptimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationScaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc
Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC
More informationShadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies
Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu
More informationEnergy-efficient acceleration of task dependency trees on CPU-GPU hybrids
Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids Mark Silberstein - Technion Naoya Maruyama Tokyo Institute of Technology Mark Silberstein, Technion 1 The case for heterogeneous
More informationCS560 Lecture Parallel Architecture 1
Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationHuge market -- essentially all high performance databases work this way
11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More information