Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing

Size: px
Start display at page:

Download "Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing"

Transcription

1 Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, Steven Swanson Department of Computer Science and Engineering University of California, San Diego

2 Applications interact with files 2

3 How we process files today GPU 0xBC614E CPU Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 DRAM SSD 3

4 The conventional model CPU/APU Retrieve File Parse data and create objects Compute kernel Creating objects generates traffic DRAM on CPU-memory bus and results in system overhead SSD GPU Compute kernel 4

5 64% Overhead of creating objects 1.0 Object creation GPU Other CPU computation Moving data to GPU Percentage of Execution Time Creating objects is now the bottleneck in applications 0 PageRank CC bfs gaussian hybridsort kmeans lud nn srad GPU accelerated applications 5 JASPA average

6 High-speed storage doesn t help Throughput of Parsing Input Data (MB/Sec) PageRank SSD RamDrive HDD Very little difference among different storage technologies CC bfs gaussian hybridsort kmeans lud nn srad JASPA GPU accelerated applications 6 average

7 Preventing P2P communication between peripherals GPU CPU Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 DRAM P2P is useless since we need CPU to create application objects Desired data path Real data path in the current model SSD 7

8 We need to rethink the Morpheus processing model! 8

9 Outline The Morpheus model The system architecture Experimental result Conclusion 9

10 Morpheus: Creating application objects in SSDs GPU CPU DRAM SSD Processor 10 Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 SSD

11 The Morpheus model CPU/APU Retrieve objects Compute kernel DRAM SSD StorageApp GPU Compute kernel 11

12 Benefits of Morpheus Bypass system overheads applications to take advantage from Allow P2P data communication interconnects Reduce traffic over systemmorpheus: Creating application Lower energy consumption objects in SSDs GPU CPU Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 SSD Processor Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 SSD 12 6 DRAM

13 Outline The Morpheus model The system architecture Experimental result Conclusion 13

14 Implementing the Morpheus model Application Morpheus runtime Operating System Morpheus-NVMe Driver NVMe-P2P GPU Runtime PCIe Interconnect Hardware GPU Morpheus-SSD 14

15 Morpheus-NVMe NVMe: An interface defines how the host computer should interact with non-volatile memory devices Morpheus-NVMe extensions MInit: install and prepare the execution of a StorageApp MRead: reads and applies the StorageApp on the reading data MWrite: writes and applies the StorageApp on the writing data MDeinit: completes and releases the StorageApp 15

16 Morpheus-SSD DDR3/DDR4 DRAM Flash Managing Morpheus-NVMe commands Executing StorageApps PCI EXPRESS Flash Flash Flash Flash PCIe/NVMe Interface Embedded Embedded Embedded Embedded core core Embedded Embedded core core Embedded Embedded core core core core In-storage Interconnect Accelerator Accelerator Accelerator Accelerator DMA Engine flash interface DRAM controller Flash memory SSD DRAM 16

17 NVMe-P2P Mapping GPU device memory to PCIe BAR using AMD DirectGMA or NVIDIA GPUDirect Generate Morpheus-NVMe commands using GPU memory addresses as the DMA targets Morpheus directly pulls/pushes data from/to GPU addresses, without going through the main memory GPU 17 Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07

18 Creating a StorageApp Use C to compose a StorageApp Use the Morpheus-SSD library to access SSD resources The compiler generates machine code that the embedded processors can execute StorageApp int inputapplet (ms_stream ssd_input_stream, void *edge_array) { Edge ssd_edge_array[4096]; int i = 0; while(ms_scanf(ssd_input_stream, "%d %d", &ssd_edge_array[i%4096].first, &ssd_edge_array[i%4096].second)==2) { i++; if(i % 4096 == 0) { ms_memcpy(edge_array, ssd_edge_array, sizeof(edge)*4096); edge_array += sizeof(edge)*4096; } } ms_memcpy(edge_array, ssd_edge_array, sizeof(edge)*(i%4096)); return i; } 18

19 Invoking a StorageApp in host applications Like calling a function Prepare arguments using the Morpheus runtime library The runtime library interacts with the driver to utilize the SSD facilities void test_distributed_page_rank(char* graphfilename, int num_ofvertex, int num_ofedges, int iterations) { FILE *fin; ms_stream ssd_input_stream; void **arg_list; fin = fopen(graphfilename, "r"); ssd_input_stream = ms_stream_create(fin); Edge *edge_array = (Edge *)malloc(sizeof(edge)*num_ofedges); inputapplet(ssd_input_stream, edge_array); ms_stream_destroy(ssd_input_stream); // The rest of code... } 19

20 Outline The Morpheus model The system architecture Experimental result Conclusion 20

21 Experimental setup Intel Xeon E v2 processor NVIDIA K20 GPU Morpheus-SSD: A 512GB SSD with a PMCS (now Microsemi) NVMe controller K20 GPU Morpheus-SSD 21

22 Morpheus improves performance Morpheus-SSD Morpheus+NVMe-P2P 1.32x 1.39x 1.2 Speedup PageRank CC bfs gaussian hybridsort kmeans lud nn srad GPU accelerated applications 22 JASPA average

23 Morpheus saves power/energy Power Energy 1.32x -7 % 1.39x -42% Normalized Value PageRank CC bfs gaussian hybridsort kmeans lud nn srad GPU accelerated applications 23 JASPA average

24 Morpheus makes wimpy servers more competitive G CPU Morpheus-SSD on 1.2G CPU Morpheus-SSD on 1.2G CPU + NVMe-P2P Morpheus-SSD + wimpy CPUs can compete with high-end servers Speedup over 2.5G CPUs x 1.08x 1.12x 0 PageRank CC bfs gaussian hybridsort kmeans lud nn GPU accelerated applications 24 srad JASPA average

25 Conclusion Object creation/deserialization/serialization becomes a new bottleneck for highperformance heterogeneous computers Morpheus model leverages under-utilized computing resources in storage device to bypass system overheads enable efficient data communication mechanisms Morpheus-SSD improves application performance by 1.39x and allows wimpy servers to compete with high-end servers 25

26 Thank you! Hung-Wei Tseng will be an assistant professor in starting from this August 26

Morpheus: Exploring the Potential of Near-Data Processing for Creating Application Objects in Heterogeneous Computing

Morpheus: Exploring the Potential of Near-Data Processing for Creating Application Objects in Heterogeneous Computing Morpheus: Exploring the Potential of Near-Data Processing for Creating Application Objects in Heterogeneous Computing Hung-Wei Tseng North Carolina State University Qianchen Zhao Arista Networks Yuxiao

More information

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing

Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing 216 ACM/IEEE 43rd Aual International Symposium on Computer Architecture Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan,

More information

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein : Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and s Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein What do we do? Enable efficient file I/O for s Why? Support diverse

More information

Accelerating Data Centers Using NVMe and CUDA

Accelerating Data Centers Using NVMe and CUDA Accelerating Data Centers Using NVMe and CUDA Stephen Bates, PhD Technical Director, CSTO, PMC-Sierra Santa Clara, CA 1 Project Donard @ PMC-Sierra Donard is a PMC CTO project that leverages NVM Express

More information

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory

More information

Summarizer: Trading Communication with Computing Near Storage

Summarizer: Trading Communication with Computing Near Storage Summarizer: Trading Communication with Computing Near Storage Gunjae Koo*, Kiran Kumar Matam*, Te I, H.V. Krishina Giri Nara*, Jing Li, Hung-Wei Tseng, Steven Swanson, Murali Annavaram* *University of

More information

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting

p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting p2pmem: Enabling PCIe Peer-2-Peer in Linux Stephen Bates, PhD Raithlin Consulting 1 Nomenclature: A Reminder PCIe Peer-2Peer using p2pmem is NOT blucky!! 2 The Rationale: A Reminder 3 The Rationale: A

More information

Hung-Wei Tseng. Assistant Professor

Hung-Wei Tseng. Assistant Professor Hung-Wei Tseng Assistant Professor 919-515-7354 Dept. of Computer Science hungwei tseng@ncsu.edu North Carolina State University http://people.engr.ncsu.edu/htseng3 Raleigh, NC 27695-8206 Education University

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

NVMe: The Protocol for Future SSDs

NVMe: The Protocol for Future SSDs When do you need NVMe? You might have heard that Non-Volatile Memory Express or NVM Express (NVMe) is the next must-have storage technology. Let s look at what NVMe delivers. NVMe is a communications protocol

More information

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing

More information

Important new NVMe features for optimizing the data pipeline

Important new NVMe features for optimizing the data pipeline Important new NVMe features for optimizing the data pipeline Dr. Stephen Bates, CTO Eideticom Santa Clara, CA 1 Outline Intro to NVMe Controller Memory Buffers (CMBs) Use cases for CMBs Submission Queue

More information

Willow: A User- Programmable SSD

Willow: A User- Programmable SSD Willow: A User- Programmable SSD Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker, Arup De, Yanqin Jin, Yang Liu, and Steven Swanson Non- VolaDle Systems Laboratory Computer Science

More information

Onyx: A Prototype Phase-Change Memory Storage Array

Onyx: A Prototype Phase-Change Memory Storage Array Onyx: A Prototype Phase-Change Memory Storage Array Ameen Akel * Adrian Caulfield, Todor Mollov, Rajesh Gupta, Steven Swanson Non-Volatile Systems Laboratory, Department of Computer Science and Engineering

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer

Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer Accelerating Storage with NVM Express SSDs and P2PDMA Stephen Bates, PhD Chief Technology Officer 2018 Storage Developer Conference. Eidetic Communications Inc. All Rights Reserved. 1 Outline Motivation

More information

Martin Dubois, ing. Contents

Martin Dubois, ing. Contents Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information

Portable Power/Performance Benchmarking and Analysis with WattProf

Portable Power/Performance Benchmarking and Analysis with WattProf Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly

More information

NVMe SSDs with Persistent Memory Regions

NVMe SSDs with Persistent Memory Regions NVMe SSDs with Persistent Memory Regions Chander Chadha Sr. Manager Product Marketing, Toshiba Memory America, Inc. 2018 Toshiba Memory America, Inc. August 2018 1 Agenda q Why Persistent Memory is needed

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Paralization on GPU using CUDA An Introduction

Paralization on GPU using CUDA An Introduction Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing

More information

NVMe over Universal RDMA Fabrics

NVMe over Universal RDMA Fabrics NVMe over Universal RDMA Fabrics Build a Flexible Scale-Out NVMe Fabric with Concurrent RoCE and iwarp Acceleration Broad spectrum Ethernet connectivity Universal RDMA NVMe Direct End-to-end solutions

More information

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Persistent Memory. High Speed and Low Latency. White Paper M-WP006 Persistent Memory High Speed and Low Latency White Paper M-WP6 Corporate Headquarters: 3987 Eureka Dr., Newark, CA 9456, USA Tel: (51) 623-1231 Fax: (51) 623-1434 E-mail: info@smartm.com Customer Service:

More information

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

Annual Update on Flash Memory for Non-Technologists

Annual Update on Flash Memory for Non-Technologists Annual Update on Flash Memory for Non-Technologists Jay Kramer, Network Storage Advisors & George Crump, Storage Switzerland August 2017 1 Memory / Storage Hierarchy Flash Memory Summit 2017 2 NAND Flash

More information

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha

More information

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National Center of Atmospheric Research,

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Towards a Performance- Portable FFT Library for Heterogeneous Computing Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon

More information

Intra Application Data Communication Characterization

Intra Application Data Communication Characterization Intra Application Data Communication Characterization Imran Ashraf, Vlad Mihai Sima, Koen Bertels Computer Engineering Lab, TU Delft, The Netherlands Trends Growing demand of processing Growing number

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Hardware NVMe implementation on cache and storage systems

Hardware NVMe implementation on cache and storage systems Hardware NVMe implementation on cache and storage systems Jerome Gaysse, IP-Maker Santa Clara, CA 1 Agenda Hardware architecture NVMe for storage NVMe for cache/application accelerator NVMe for new NVM

More information

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity

More information

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows Rafael Ferreira da Silva, Scott Callaghan, Ewa Deelman 12 th Workflows in Support of Large-Scale Science (WORKS) SuperComputing

More information

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU

More information

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems

More information

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia

More information

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman

More information

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz OPENMP GPU OFFLOAD IN FLANG AND LLVM Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz MOTIVATION What does HPC programmer need today? Performance à GPUs, multi-cores, other

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Database Acceleration Solution Using FPGAs and Integrated Flash Storage

Database Acceleration Solution Using FPGAs and Integrated Flash Storage Database Acceleration Solution Using FPGAs and Integrated Flash Storage HK Verma, Xilinx Inc. August 2017 1 FPGA Analytics in Flash Storage System In-memory or Flash storage based DB reduce disk access

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1,

More information

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue

More information

HP Z Turbo Drive G2 PCIe SSD

HP Z Turbo Drive G2 PCIe SSD Performance Evaluation of HP Z Turbo Drive G2 PCIe SSD Powered by Samsung NVMe technology Evaluation Conducted Independently by: Hamid Taghavi Senior Technical Consultant August 2015 Sponsored by: P a

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Designing elastic storage architectures leveraging distributed NVMe. Your network becomes your storage!

Designing elastic storage architectures leveraging distributed NVMe. Your network becomes your storage! Designing elastic storage architectures leveraging distributed NVMe Your network becomes your storage! Your hosts from Excelero 2 Yaniv Romem CTO & Co-founder Josh Goldenhar Vice President Product Management

More information

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation

More information

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) I/O Devices Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hardware Support for I/O CPU RAM Network Card Graphics Card Memory Bus General I/O Bus (e.g., PCI) Canonical Device OS reads/writes

More information

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University Falcon: Scaling IO Performance in Multi-SSD Volumes Pradeep Kumar H Howie Huang The George Washington University SSDs in Big Data Applications Recent trends advocate using many SSDs for higher throughput

More information

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox

NVMe over Fabrics. High Performance SSDs networked over Ethernet. Rob Davis Vice President Storage Technology, Mellanox NVMe over Fabrics High Performance SSDs networked over Ethernet Rob Davis Vice President Storage Technology, Mellanox Ilker Cebeli Senior Director of Product Planning, Samsung May 3, 2017 Storage Performance

More information

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

Session 201-B: Accelerating Enterprise Applications with Flash Memory

Session 201-B: Accelerating Enterprise Applications with Flash Memory Session 201-B: Accelerating Enterprise Applications with Flash Memory Rob Larsen Director, Enterprise SSD Micron Technology relarsen@micron.com August 2014 1 Agenda Target applications Addressing needs

More information

Using MRAM to Create Intelligent SSDs

Using MRAM to Create Intelligent SSDs Using MRAM to Create Intelligent SSDs Jérôme Gaysse Senior Technology&Market Analyst jerome.gaysse@silinnov-consulting.com Santa Clara, CA 1 Study context Analysis of system & application Performance modeling

More information

Storage Protocol Offload for Virtualized Environments Session 301-F

Storage Protocol Offload for Virtualized Environments Session 301-F Storage Protocol Offload for Virtualized Environments Session 301-F Dennis Martin, President August 2016 1 Agenda About Demartek Offloads I/O Virtualization Concepts RDMA Concepts Overlay Networks and

More information

Programmable Accelerators

Programmable Accelerators Programmable Accelerators Jason Lowe-Power powerjg@cs.wisc.edu cs.wisc.edu/~powerjg Increasing specialization Need to program these accelerators Challenges 1. Consistent pointers 2. Data movement 3. Security

More information

G2M Research Presentation Flash Memory Summit 2018

G2M Research Presentation Flash Memory Summit 2018 G2M Research Presentation Flash Memory Summit 2018 August 7, 2018 The Easy Facts about NVMe u NVMe SSDs will become ubiquitous over the next 1-2 years This will be true for the Cloud, Enterprise, and Consumer

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Inter-Block GPU Communication via Fast Barrier Synchronization

Inter-Block GPU Communication via Fast Barrier Synchronization CS 3580 - Advanced Topics in Parallel Computing Inter-Block GPU Communication via Fast Barrier Synchronization Mohammad Hasanzadeh-Mofrad University of Pittsburgh September 12, 2017 1 General Purpose Graphics

More information

Gen-Z Memory-Driven Computing

Gen-Z Memory-Driven Computing Gen-Z Memory-Driven Computing Our vision for the future of computing Patrick Demichel Distinguished Technologist Explosive growth of data More Data Need answers FAST! Value of Analyzed Data 2005 0.1ZB

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

BlueDBM: An Appliance for Big Data Analytics*

BlueDBM: An Appliance for Big Data Analytics* BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

Asynchronous Peer-to-Peer Device Communication

Asynchronous Peer-to-Peer Device Communication 13th ANNUAL WORKSHOP 2017 Asynchronous Peer-to-Peer Device Communication Feras Daoud, Leon Romanovsky [ 28 March, 2017 ] Agenda Peer-to-Peer communication PeerDirect technology PeerDirect and PeerDirect

More information

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture

genzconsortium.org Gen-Z Technology: Enabling Memory Centric Architecture Gen-Z Technology: Enabling Memory Centric Architecture Why Gen-Z? Gen-Z Consortium 2017 2 Why Gen-Z? Gen-Z Consortium 2017 3 Why Gen-Z? Businesses Need to Monetize Data Big Data AI Machine Learning Deep

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

Yunsup Lee UC Berkeley 1

Yunsup Lee UC Berkeley 1 Yunsup Lee UC Berkeley 1 Why is Supporting Control Flow Challenging in Data-Parallel Architectures? for (i=0; i

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

NVMe From The Server Perspective

NVMe From The Server Perspective NVMe From The Server Perspective The Value of NVMe to the Server Don H Walker Dell OCTO August 2012 1 NVMe Overview Optimized queuing interface, command set, and feature set for PCIe SSDs Targets only

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help? Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Live Migration of Direct-Access Devices. Live Migration

Live Migration of Direct-Access Devices. Live Migration Live Migration of Direct-Access Devices Asim Kadav and Michael M. Swift University of Wisconsin - Madison Live Migration Migrating VM across different hosts without noticeable downtime Uses of Live Migration

More information

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD Linux Software RAID Level Technique for High Performance Computing by using PCI-Express based SSD Jae Gi Son, Taegyeong Kim, Kuk Jin Jang, *Hyedong Jung Department of Industrial Convergence, Korea Electronics

More information

Near- Data Computa.on: It s Not (Just) About Performance

Near- Data Computa.on: It s Not (Just) About Performance Near- Data Computa.on: It s Not (Just) About Performance Steven Swanson Non- Vola0le Systems Laboratory Computer Science and Engineering University of California, San Diego 1 Solid State Memories NAND

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through

MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through Bo Peng 1,2, Haozhong Zhang 2, Jianguo Yao 1, Yaozu Dong 2, Yu Xu 1, Haibing Guan 1 1 Shanghai Key Laboratory of Scalable Computing

More information

Interconnection Network for Tightly Coupled Accelerators Architecture

Interconnection Network for Tightly Coupled Accelerators Architecture Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 What

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit Agarwal Slides based on: many many discussions with Ion Stoica, his class, and many industry folks Servers Typical

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Predictive Runtime Code Scheduling for Heterogeneous Architectures Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation

More information

scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs

scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs Harsha V. Madhyastha*, John C. McCullough, George Porter, Rishi Kapoor, Stefan Savage, Alex C. Snoeren, and Amin Vahdat

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

Evolution of Rack Scale Architecture Storage

Evolution of Rack Scale Architecture Storage Evolution of Rack Scale Architecture Storage Murugasamy (Sammy) Nachimuthu, Principal Engineer Mohan J Kumar, Fellow Intel Corporation August 2016 1 Agenda Introduction to Intel Rack Scale Design Storage

More information

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics

Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics Realizing the Next Generation of Exabyte-scale Persistent Memory-Centric Architectures and Memory Fabrics Zvonimir Z. Bandic, Sr. Director, Next Generation Platform Technologies Western Digital Corporation

More information

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal Lugano April 2018 NVMe Takes It All, SCSI Has To Fall freely adapted from ABBA Brave New Storage World Alexander Ruebensaal 1 Design, Implementation, Support & Operating of optimized IT Infrastructures

More information