Big Data Analytics Using Hardware-Accelerated Flash Storage

Size: px
Start display at page:

Download "Big Data Analytics Using Hardware-Accelerated Flash Storage"

Transcription

1 Big Data Analytics Using Hardware-Accelerated Flash Storage Sang-Woo Jun University of California, Irvine (Work done while at MIT) Flash Memory Summit, 2018

2 A Big Data Application: Personalized Genome Normal Genome Tumor Genome Cancer Patient Next-Generation Sequencing Identified Mutations Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads, Moncunill V. & Gonzalez S., et al., 2014

3 Cluster System for Personalized Genome Complex Algorithm 16 Machines (2 TB DRAM) Terabytes of Data 6 Hours

4 A Cheaper Alternative Using Hardware-Accelerated SSD + Collaborating with

5 Success with Important Applications + VS Graph analytics with billions of vertices 10x Performance 1/3 Power consumption Key-value cache with millions of users Comparable performance 1/5 Power consumption GraFBoost: Using accelerated flash storage for external graph analytics, ISCA 2018 BlueCache: A Distributed Flash-based Key Value Store, VLDB 2017

6 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

7 Storage for Analytics Fine-grained, Irregular access TB DRAM of DRAM Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W

8 Random Access Challenge in Flash Storage Flash DRAM Bandwidth: ~10 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth!

9 Reconfigurable Hardware Acceleration Field Programmable Gate Array (FPGA) Program application-specific hardware High performance, Low power Reconfigurable to fit the application

10 Future of Reconfigurable Hardware Acceleration High Availability Amazon EC2 Intel Accelerated Xeon Accelerated SSD platforms Easy Programmability Bluespec Xilinx HLS Amazon F1 Shell Normal to do HW/SW codesign (Like with GPU computing)

11 Benefits of In-Storage Acceleration Acceleration Host Storage Lower latency, higher bandwidth to storage Reduce data movement cost Lower engineering cost Most data comes from storage anyways

12 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

13 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

14 Large Graphs are Found Everywhere in Nature Human neural network Structure of the internet TB to 100s of TB in size! Social networks 1) Connectomics graph of the brain - Source: Connectomics and Graph Theory Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 3) The Graph of a Social Network Source: Griff s Graphs

15 Various Models for Graph Analytics Vertex-Centric Pregel, GraphLab, TurboGraph, Mosaic, FlashGraph, GraphChi, Edge-Centric X-Stream, Linear Algebraic CombBLAS, GraphMat, Graphulo, Graph-Centric Giraph++, Frontier-Based Ligra, Gemini, Grunrock, Optimized Algorithms And more Galois,

16 Vertex-Centric Programming Model Popular model for efficient parallism and distribution Vertex program only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more Active Vertices Active Vertices 1 4 4

17 Algorithmic Representation of a Vertex Program Iteration for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src. val, e. weight) v dst. next_val = vertex_update(v dst. next_val, ev) end for end for Random read-modify update into vertex data

18 Random Access Within an Active Vertex

19 Random Access Across Active Vertices!! Vertex data must be stored in DRAM? Data size and irregularity limit caching effectiveness

20 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Sort-Reduce Acceleration

21 General Problem of Irregular Array Updates For each Updating < idx, an arg array > xin xs: with a xstream idx = of f(x update idx requests, arg) xs and update function f X xs f Random Updates

22 Solution Part One - Sort Sort xs according to index X Sort Sorted xs xs Sequential Random Updates Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead

23 Solution Part Two - Reduce Associative update function f can be interleaved with sort merge merge 1 7 e.g., (A + B) + C = A + (B + C) xs Reduced Overhead

24 Removing Random Access Using Sort-Reduce for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src. val, e. weight) vlist. dst. append(v next_val = dst vertex_update(v, ev) dst. next_val, ev) end for end for v = SortReduce vertex_update (list) No more random access! Associativity requirement is not very restrictive (CombBLAS, PowerGraph, Mosaic, )

25 Normalized size of update stream Big Benefits from Interleaving Reduction Ratio of data copied at each sort phase Kron32 Kron32 WDC Kron30 WDC Twitter 90% Merge Iteration

26 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

27 Accelerated Graph Analytics Architecture In-storage accelerator reduces data movement and cost Software Host (Server/PC/Laptop) FPGA Vertex Value Update Log (xs) Accelerator-Aware Flash Edge Management Program Multirate Aggregator Edge Weight Multirate 1GB DRAM 16-to-1 Merge-Sorter Sort-Reduce Accelerator Wire-Speed On-chip Sorter Flash Edge Data Vertex Data Partially Sort- Reduced Files Active Vertices

28 Evaluated Graph Analytic Systems In-memory Semi-External External GraphLab (IN) FlashGraph (SE1) X-Stream (SE2) GraphChi (EX) GraFSoft GraFBoost Distributed GraphLab: a framework for machine learning and data mining in the cloud, VLDB 2012 FlashGraph: Processing billion-node graphs on an array of commodity SSDs, FAST 2015 X-Stream: edge-centric graph processing using streaming partitions, SOSP 2013 GraphChi: Large-scale graph computation on just a PC, USENIX 2012

29 Evaluation Environment + 32-core Xeon 128 GB RAM 5x 0.5TB PCIe Flash $8000 All software experiments 4-core i5 4 GB RAM $400 + Virtex 7 FPGA 1TB custom flash 1GB on-board RAM $1000 $???

30 Evaluation Result Overview In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirements Memory, CPU, Power

31 Normalized Performance Results with a Large Graph: Synthetic Scale 32 Kronecker Graph 0.5 TB in text, 4 Billion vertices GraphLab (IN) out of memory FlashGraph (SE1) out of memory GraphChi (EX) did not finish 1.7x PR BFS BC X-Stream (SE2) 2.8x 10x GraFBoost GraFSoft

32 Normalized Performance Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish 4 Only GraFBoost succeeds in both graphs 3 2GraFBoost can run still larger graphs! 1 0 PR BFS BC SE1 SE2 GraFBoost GraFSoft

33 Normalized Performance Results with Smaller Graphs: Breadth-First Search 0.03 TB 0.04 Billion 0.09 TB 0.3 Billion 0.3 TB 1 Billion 5 4 Fastest! Slowest Slowest Slowest GraFSoft 0 twitter kron28 kron30 IN SE1 SE2 EX GraFBoost

34 Normalized Performance Results with a Medium Graph: Against an In-Memory Cluster Synthesized Kronecker Scale TB in text, 0.3 Billion vertices PR BFS 5xIN SE1 SE2 EX GraFBoost GraFSoft

35 GB Threads Watts GraFBoost Reduces Resource Requirements Conventional Conventional GraFBoost GraFBoost 16 2 External analytics Hardware Acceleration Conventional GraFBoost External Analytics + Hardware Acceleration

36 Evaluation Result Recap In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirements Memory, CPU, Power

37 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

38 Key-Value Cache in the Data Center Application Server Application Server Key-Value GET/SET Requests Application Server User Requests KVS node KVS node MySQL Postgres SciDB Cache Layer Backend Storage Layer Transactional DB operations can become bottleneck Key-Value caches like memcached cache DB query results Facebook reported 300 TB of memcached capacity (2010)

39 1000 Requests\Second Miss Rate (%) Cached Application Performance Example Using the BG social network benchmark * MySQL backend and 16 GB memcached 20% % % 200 5% 0% 0 Application KVS Cache Throughput Miss Rate Number of Active Users (Millions) * Open-source project by Twitter, Inc. 39

40 Software KV Caches Are Fast, but Not Power-Efficient Achieving high throughput requires a lot of resources MICA Mega-KV 120 Million Requests Per Second 160 Million Requests Per Second 24 cores, 12 10GbE 2 GTX 780 GPUs 400 W 900 W Superscalar OoO pipeline is underutilized for KVS Only 3MB LLC is needed to sustain MICA s throughput

41 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

42 BlueCache: Flash-based KVS Architecture App Server App Server App Server App Server KVS Node Dedicated Storage Network NIC Network stack Engines CPU Accelerators Cores Key-value data access DRAM KV-Index Cache NAND Flash KV-Data Store Store KV pairs in flash storage Pipelined KV cache accelerators Hardware accelerated dedicated storage network engines Log-structured KV data store

43 Evaluation Setup Frontend Backend BG social network benchmark server MySQL server 32-core Xeon server 64 GB DRAM 1.5 TB PCIe SSDs KV Cache * Open-source project by Twitter, Inc. memcached In-memory 48 Cores 16 GB DRAM FatCache* Flash-based 48 Cores 1 GB DRAM 0.5 TB PCIe Flash BlueCache Flash-based with acceleration 1 GB DRAM 0.5 TB PCIe Flash

44 1000 Requests\Second Miss Rate (%) Performance Evaluation with More Users 20.00% 500 KVS Cache Miss Rate Application Throughput 15.00% % (4.0M, 130KRPS) % 4.18X 0.00% 0 At much lower power! (40 W vs. 200 W) Number of of Active Users (Millions) memcached FatCache BlueCache

45 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

46 Our Custom Flash Card for Distributed Accelerated Flash Architectures Requirement 1: Modify flash management Requirement 2: Dedicated storage-area network Requirement 3: In-storage hardware accelerator Artix 7 FPGA (Flash Controller) 0.5 TB Flash FMC Connection to FPGA board (40 Gbps) "minflash: A Minimalistic Clustered Flash Array, DATE 2016 Inter-Controller Network (10Gbps/lane 4)

47 BlueDBM Cluster Architecture Host Server (24-Core) FPGA (VC707) minflash minflash 1GB RAM Ethernet Host Server (24-Core) FPGA (VC707) minflash minflash 1 TB PCIe Gen2 x8 FMC 10Gbps network 8 Host Server (24-Core) FPGA (VC707) minflash minflash Uniform latency of 100 µs!

48 The BlueDBM Cluster BlueDBM Storage Device

49 Our Research Enabled by BlueDBM 1. Scalable Multi-Access Flash Store for Big Data Analytics, FPGA BlueDBM: An Appliance for Big Data Analytics, ISCA A Transport-Layer Network for Distributed FPGA Platforms, FPL Large-scale high-dimensional nearest neighbor search using Flash memory with in-store processing, ReConFig minflash: A Minimalistic Clustered Flash Array, DATE Application-managed flash, FAST In-Storage Embedded Accelerator for Sparse Pattern Processing, HPEC Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM BlueCache: A Scalable Distributed Flash-based Key-value Store, VLDB GraFBoost: Using accelerated flash storage for external graph analytics, ISCA NoHost: Software-defined Network-attached KV Drives, (Under Review)

50 Future Work Next generation of BlueDBM Newer SSDs are much faster than 2.4 GB/s Prototype flash chips are aging More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center More applications using accelerated flash SQL acceleration collaboration with Samsung 50

51 Thank you Flash-Based Analytics Capacity Algorithm Acceleration

52 A Baseline Hardware Merge Sorter 2-to-1 Sorter Sorter emits one item at every cycle Easy to become bottleneck 52

53 High-Throughput Hardware Sorter using Sorting Networks Sorter emits sorted tuple at every cycle Bitonic sorter Bitonic half-cleaner Bitonic sorter Merge-Sorts at constant 4 GB/s "Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2016

54 Hardware 16-to-1 Sorter Reduces Sorting Passes Constructed as a pipelined tree of 2-to-1 Sorters 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter "Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2016

55 Sort-Reduce Process 1TB Update Stream Sort-Reduced 512MB Blocks Sort-Reduced 8GB Blocks Sort-Reduced 128GB Blocks Fully Sort-Reduced In-memory Sort-Reducers (1 GB Memory) 16-way Sort-Reducer

56 A Baseline Hardware Reducer delay Vertex Update Reducer consumes up to one item at every cycle Vertex update latency impacts performance 56

57 Wire-Speed Reducer Using Sorting Networks Single Reducer 2-to-1 Sorter Multi Reducer Single Reducer Single Reducers achieve single element wire-speed Multi Reducer uses sorting networks to achieve multi-rate Reduces at constant 4 GB/s "Wire-Speed Accelerator for Aggregation Operations, (Under Review) 57

GraFBoost: Using accelerated flash storage for external graph analytics

GraFBoost: Using accelerated flash storage for external graph analytics GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature

More information

BlueDBM: An Appliance for Big Data Analytics*

BlueDBM: An Appliance for Big Data Analytics* BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting

More information

GraFBoost: Using accelerated flash storage for external graph analytics

GraFBoost: Using accelerated flash storage for external graph analytics : Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind Department of Electrical Engineering and Computer Science Massachusetts Institute

More information

HADP Talk BlueDBM: An appliance for Big Data Analytics

HADP Talk BlueDBM: An appliance for Big Data Analytics HADP Talk BlueDBM: An appliance for Big Data Analytics Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+ John Ankcorn+ Myron King+ Shuotao Xu* Arvind* *MIT Computer Science and Artificial Intelligence

More information

FPGP: Graph Processing Framework on FPGA

FPGP: Graph Processing Framework on FPGA FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E. Dept., TNLIST, Tsinghua University dgh14@mails.tsinghua.edu.cn 1 Big graph is widely used Big graph is widely

More information

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

arxiv: v1 [cs.db] 21 Oct 2017

arxiv: v1 [cs.db] 21 Oct 2017 : High-performance external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology

More information

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory

More information

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Da Zheng, Disa Mhembere, Randal Burns Department of Computer Science Johns Hopkins University Carey E. Priebe Department of Applied

More information

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level

More information

Database Acceleration Solution Using FPGAs and Integrated Flash Storage

Database Acceleration Solution Using FPGAs and Integrated Flash Storage Database Acceleration Solution Using FPGAs and Integrated Flash Storage HK Verma, Xilinx Inc. August 2017 1 FPGA Analytics in Flash Storage System In-memory or Flash storage based DB reduce disk access

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research 1 The world s most valuable resource Data is everywhere! May. 2017 Values from Data! Need infrastructures for

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value

More information

BlueDBM: An Appliance for Big Data Analytics

BlueDBM: An Appliance for Big Data Analytics BlueDBM: An Appliance for Big Data Analytics Sang-Woo Jun Ming Liu Sungjin Lee Jamey Hicks John Ankcorn Myron King Shuotao Xu Arvind Department of Electrical Engineering and Computer Science Massachusetts

More information

SociaLite: A Datalog-based Language for

SociaLite: A Datalog-based Language for SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler

More information

A Framework for Processing Large Graphs in Shared Memory

A Framework for Processing Large Graphs in Shared Memory A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge

More information

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Persistent Memory. High Speed and Low Latency. White Paper M-WP006 Persistent Memory High Speed and Low Latency White Paper M-WP6 Corporate Headquarters: 3987 Eureka Dr., Newark, CA 9456, USA Tel: (51) 623-1231 Fax: (51) 623-1434 E-mail: info@smartm.com Customer Service:

More information

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Raghav Sethi Michael Kaminsky David G. Andersen Michael J. Freedman Goal: fast and cost-effective key-value store Target: cluster-level storage for

More information

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020 Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct-2017 4:45 p.m. - 5:30 p.m. Moscone West - Room 3020 Big Data Talk Exploring New SSD Usage Models to Accelerate Cloud Performance

More information

Caribou: Intelligent Distributed Storage

Caribou: Intelligent Distributed Storage : Intelligent Distributed Storage Zsolt István, David Sidler, Gustavo Alonso Systems Group, Department of Computer Science, ETH Zurich 1 Rack-scale thinking In the Cloud ToR Switch Compute Compute + Provisioning

More information

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1,

More information

High Performance Packet Processing with FlexNIC

High Performance Packet Processing with FlexNIC High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet

More information

NOHOST: A New Storage Architecture for Distributed Storage Systems. Chanwoo Chung

NOHOST: A New Storage Architecture for Distributed Storage Systems. Chanwoo Chung NOHOST: A New Storage Architecture for Distributed Storage Systems by Chanwoo Chung B.S., Seoul National University (2014) Submitted to the Department of Electrical Engineering and Computer Science in

More information

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage Accelerating Real-Time Big Data Breaking the limitations of captive NVMe storage 18M IOPs in 2u Agenda Everything related to storage is changing! The 3rd Platform NVM Express architected for solid state

More information

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture Guohao Dai 1, Tianhao Huang 1, Yuze Chi 2, Ningyi Xu 3, Yu Wang 1, Huazhong Yang 1 1 Tsinghua University, 2 UCLA, 3 MSRA dgh14@mails.tsinghua.edu.cn

More information

Reconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

Reconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland Reconfigurable hardware for big data Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland www.systems.ethz.ch Systems Group 7 faculty ~40 PhD ~8 postdocs Researching all

More information

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010 Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming

More information

An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC

An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC Korechika Tamura Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan Email: tamura@arc.ics.keio.ac.jp Hiroki Matsutani Dept.

More information

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating

More information

Network Requirements for Resource Disaggregation

Network Requirements for Resource Disaggregation Network Requirements for Resource Disaggregation Peter Gao (Berkeley), Akshay Narayan (MIT), Sagar Karandikar (Berkeley), Joao Carreira (Berkeley), Sangjin Han (Berkeley), Rachit Agarwal (Cornell), Sylvia

More information

An FPGA-based In-line Accelerator for Memcached

An FPGA-based In-line Accelerator for Memcached An FPGA-based In-line Accelerator for Memcached MAYSAM LAVASANI, HARI ANGEPAT, AND DEREK CHIOU THE UNIVERSITY OF TEXAS AT AUSTIN 1 Challenges for Server Processors Workload changes Social networking Cloud

More information

CS 5220: Parallel Graph Algorithms. David Bindel

CS 5220: Parallel Graph Algorithms. David Bindel CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket

More information

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark Testing validation report prepared under contract with Dell Introduction As innovation drives

More information

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE Digital transformation is taking place in businesses of all sizes Big Data and Analytics Mobility Internet of Things

More information

Service Oriented Performance Analysis

Service Oriented Performance Analysis Service Oriented Performance Analysis Da Qi Ren and Masood Mortazavi US R&D Center Santa Clara, CA, USA www.huawei.com Performance Model for Service in Data Center and Cloud 1. Service Oriented (end to

More information

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, Paul Chow University of Toronto 1 Cloudy with

More information

Samsung Z-SSD SZ985. Ultra-low Latency SSD for Enterprise and Data Centers. Brochure

Samsung Z-SSD SZ985. Ultra-low Latency SSD for Enterprise and Data Centers. Brochure Samsung Z-SSD SZ985 Ultra-low Latency SSD for Enterprise and Data Centers Brochure 1 A high-speed storage device from the SSD technology leader Samsung Z-SSD SZ985 offers more capacity than PRAM-based

More information

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION TABLE OF CONTENTS 2 WHAT IS IN-MEMORY COMPUTING (IMC) Benefits of IMC Concerns with In-Memory Processing Advanced In-Memory Computing using Supermicro MemX 1 3 MEMX ARCHITECTURE MemX Functionality and

More information

GraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC

GraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC GraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC Kai Wang, Guoqing Xu, University of California, Irvine Zhendong Su,

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Chapter 2 Parallel Hardware

Chapter 2 Parallel Hardware Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Accelerating Data Science. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

Accelerating Data Science. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland Accelerating Data Science Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland Data processing today: Appliances (large machines) Data Centers (many machines) Databases are

More information

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013 SAP HANA Jake Klein/ SVP SAP HANA June, 2013 SAP 3 YEARS AGO Middleware BI / Analytics Core ERP + Suite 2013 WHERE ARE WE NOW? Cloud Mobile Applications SAP HANA Analytics D&T Changed Reality Disruptive

More information

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Report. X-Stream Edge-centric Graph processing

Report. X-Stream Edge-centric Graph processing Report X-Stream Edge-centric Graph processing Yassin Hassan hassany@student.ethz.ch Abstract. X-Stream is an edge-centric graph processing system, which provides an API for scatter gather algorithms. The

More information

Memory-Based Cloud Architectures

Memory-Based Cloud Architectures Memory-Based Cloud Architectures ( Or: Technical Challenges for OnDemand Business Software) Jan Schaffner Enterprise Platform and Integration Concepts Group Example: Enterprise Benchmarking -) *%'+,#$)

More information

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

Rhythm: Harnessing Data Parallel Hardware for Server Workloads Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth

More information

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation 1. Overview of NEC PCIe SSD Appliance for Microsoft SQL Server Page 2 NEC Corporation

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

HPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom

HPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom HPE SimpliVity The new powerhouse in hyperconvergence Boštjan Dolinar HPE Maribor Lancom 2.2.2018 Changing requirements drive the need for Hybrid IT Application explosion Hybrid growth 2014 5,500 2015

More information

Exploring the Hidden Dimension in Graph Processing

Exploring the Hidden Dimension in Graph Processing Exploring the Hidden Dimension in Graph Processing Mingxing Zhang, Yongwei Wu, Kang Chen, *Xuehai Qian, Xue Li, and Weimin Zheng Tsinghua University *University of Shouthern California Graph is Ubiquitous

More information

Data Processing on Emerging Hardware

Data Processing on Emerging Hardware Data Processing on Emerging Hardware Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland 3 rd International Summer School on Big Data, Munich, Germany, 2017 www.systems.ethz.ch

More information

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING EXECUTIVE SUMMARY Today, businesses are increasingly turning to cloud services for rapid deployment of apps and services.

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale.

Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale. Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale. January 2016 Credit Card Fraud prevention is among the most time-sensitive and high-value of IT tasks. The databases

More information

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan FAWN A Fast Array of Wimpy Nodes David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan Carnegie Mellon University *Intel Labs Pittsburgh Energy in computing

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Improving Packet Processing Performance of a Memory- Bounded Application

Improving Packet Processing Performance of a Memory- Bounded Application Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb

More information

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Evaluation of the Chelsio T580-CR iscsi Offload adapter October 2016 Evaluation of the Chelsio T580-CR iscsi iscsi Offload makes a difference Executive Summary As application processing demands increase and the amount of data continues to grow, getting this

More information

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1, Rohan Kadekodi 1, Vijay Chidambaram 1,2, Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research

More information

Terabyte Sort on FPGA-Accelerated Flash Storage

Terabyte Sort on FPGA-Accelerated Flash Storage 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines Terabyte on FPGA-Accelerated Flash Storage Sang-Woo Jun, Shuotao Xu, Arvind Department of Electrical Engineering

More information

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Scalable Distributed Training with Parameter Hub: a whirlwind tour Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC

More information

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration

More information

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid

More information

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015 The Cloud is a Growing Disruptor for HPC Moore s

More information

Flashmatrix Technology

Flashmatrix Technology matrix Technology All-flash Super-Converged Platform By Ram Johri Memory Summit 2017 Santa Clara, CA 1 Traditional Von Neumann vs. Data Centric Architecture Memory Shared Memory Pool Memory matrix: Data

More information

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

New Approach to Unstructured Data

New Approach to Unstructured Data Innovations in All-Flash Storage Deliver a New Approach to Unstructured Data Table of Contents Developing a new approach to unstructured data...2 Designing a new storage architecture...2 Understanding

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

Computational Storage: Acceleration Through Intelligence & Agility

Computational Storage: Acceleration Through Intelligence & Agility Flash Memory Summit Computational Storage: Acceleration Through Intelligence & Agility Dr. Hao Zhong CEO & Co-Founder, ScaleFlux Flash Memory Summit 2018 Santa Clara, CA What s the Big Deal? High Cost

More information

4 Myths about in-memory databases busted

4 Myths about in-memory databases busted 4 Myths about in-memory databases busted Yiftach Shoolman Co-Founder & CTO @ Redis Labs @yiftachsh, @redislabsinc Background - Redis Created by Salvatore Sanfilippo (@antirez) OSS, in-memory NoSQL k/v

More information

Workload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013

Workload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013 Workload Optimized Systems: The Wheel of Reincarnation Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013 Outline Definition Technology Minicomputers Prime Workstations Apollo Graphics

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

BDCC: Exploiting Fine-Grained Persistent Memories for OLAP. Peter Boncz

BDCC: Exploiting Fine-Grained Persistent Memories for OLAP. Peter Boncz BDCC: Exploiting Fine-Grained Persistent Memories for OLAP Peter Boncz NVRAM System integration: NVMe: block devices on the PCIe bus NVDIMM: persistent RAM, byte-level access Low latency Lower than Flash,

More information

Latency-Tolerant Software Distributed Shared Memory

Latency-Tolerant Software Distributed Shared Memory Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

Hardware NVMe implementation on cache and storage systems

Hardware NVMe implementation on cache and storage systems Hardware NVMe implementation on cache and storage systems Jerome Gaysse, IP-Maker Santa Clara, CA 1 Agenda Hardware architecture NVMe for storage NVMe for cache/application accelerator NVMe for new NVM

More information