Big Data Analytics Using Hardware-Accelerated Flash Storage

Similar documents
GraFBoost: Using accelerated flash storage for external graph analytics

BlueDBM: An Appliance for Big Data Analytics*

GraFBoost: Using accelerated flash storage for external graph analytics

HADP Talk BlueDBM: An appliance for Big Data Analytics

FPGP: Graph Processing Framework on FPGA

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

NUMA-aware Graph-structured Analytics

arxiv: v1 [cs.db] 21 Oct 2017

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Database Acceleration Solution Using FPGAs and Integrated Flash Storage

LegUp: Accelerating Memcached on Cloud FPGAs

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

BlueDBM: An Appliance for Big Data Analytics

SociaLite: A Datalog-based Language for

A Framework for Processing Large Graphs in Shared Memory

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020

Caribou: Intelligent Distributed Storage

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

High Performance Packet Processing with FlexNIC

NOHOST: A New Storage Architecture for Distributed Storage Systems. Chanwoo Chung

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

Reconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Flexible Architecture Research Machine (FARM)

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Network Requirements for Resource Disaggregation

An FPGA-based In-line Accelerator for Memcached

CS 5220: Parallel Graph Algorithms. David Bindel

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Service Oriented Performance Analysis

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

Samsung Z-SSD SZ985. Ultra-low Latency SSD for Enterprise and Data Centers. Brochure

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION

GraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC

Toward a Memory-centric Architecture

GPUs and Emerging Architectures

Chapter 2 Parallel Hardware

Accelerating Data Science. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Practical Near-Data Processing for In-Memory Analytics Frameworks

Report. X-Stream Edge-centric Graph processing

Memory-Based Cloud Architectures

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Building NVLink for Developers

Graph Data Management

HPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom

Exploring the Hidden Dimension in Graph Processing

Data Processing on Emerging Hardware

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale.

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel graph traversal for FPGA

Improving Packet Processing Performance of a Memory- Bounded Application

Evaluation of the Chelsio T580-CR iscsi Offload adapter

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Terabyte Sort on FPGA-Accelerated Flash Storage

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Flashmatrix Technology

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

New Approach to Unstructured Data

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Computational Storage: Acceleration Through Intelligence & Agility

4 Myths about in-memory databases busted

Workload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

BDCC: Exploiting Fine-Grained Persistent Memories for OLAP. Peter Boncz

Latency-Tolerant Software Distributed Shared Memory

Building the Most Efficient Machine Learning System

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Hardware NVMe implementation on cache and storage systems

Transcription:

Big Data Analytics Using Hardware-Accelerated Flash Storage Sang-Woo Jun University of California, Irvine (Work done while at MIT) Flash Memory Summit, 2018

A Big Data Application: Personalized Genome Normal Genome Tumor Genome Cancer Patient Next-Generation Sequencing Identified Mutations Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads, Moncunill V. & Gonzalez S., et al., 2014

Cluster System for Personalized Genome Complex Algorithm 16 Machines (2 TB DRAM) Terabytes of Data 6 Hours

A Cheaper Alternative Using Hardware-Accelerated SSD + Collaborating with

Success with Important Applications + VS Graph analytics with billions of vertices 10x Performance 1/3 Power consumption Key-value cache with millions of users Comparable performance 1/5 Power consumption GraFBoost: Using accelerated flash storage for external graph analytics, ISCA 2018 BlueCache: A Distributed Flash-based Key Value Store, VLDB 2017

Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

Storage for Analytics Fine-grained, Irregular access TB DRAM of DRAM Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W

Random Access Challenge in Flash Storage Flash DRAM Bandwidth: ~10 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth!

Reconfigurable Hardware Acceleration Field Programmable Gate Array (FPGA) Program application-specific hardware High performance, Low power Reconfigurable to fit the application

Future of Reconfigurable Hardware Acceleration High Availability Amazon EC2 Intel Accelerated Xeon Accelerated SSD platforms Easy Programmability Bluespec Xilinx HLS Amazon F1 Shell Normal to do HW/SW codesign (Like with GPU computing)

Benefits of In-Storage Acceleration Acceleration Host Storage Lower latency, higher bandwidth to storage Reduce data movement cost Lower engineering cost Most data comes from storage anyways

The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

Large Graphs are Found Everywhere in Nature Human neural network Structure of the internet TB to 100s of TB in size! Social networks 1) Connectomics graph of the brain - Source: Connectomics and Graph Theory Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 3) The Graph of a Social Network Source: Griff s Graphs

Various Models for Graph Analytics Vertex-Centric Pregel, GraphLab, TurboGraph, Mosaic, FlashGraph, GraphChi, Edge-Centric X-Stream, Linear Algebraic CombBLAS, GraphMat, Graphulo, Graph-Centric Giraph++, Frontier-Based Ligra, Gemini, Grunrock, Optimized Algorithms And more Galois,

Vertex-Centric Programming Model Popular model for efficient parallism and distribution Vertex program only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more Active Vertices 5 10 5 5 5 4 9 2 1 1 32 2 Active Vertices 1 4 4

Algorithmic Representation of a Vertex Program Iteration for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src. val, e. weight) v dst. next_val = vertex_update(v dst. next_val, ev) end for end for Random read-modify update into vertex data

Random Access Within an Active Vertex

Random Access Across Active Vertices!! Vertex data must be stored in DRAM? Data size and irregularity limit caching effectiveness

The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Sort-Reduce Acceleration

General Problem of Irregular Array Updates For each Updating < idx, an arg array > xin xs: with a xstream idx = of f(x update idx requests, arg) xs and update function f X xs f Random Updates

Solution Part One - Sort Sort xs according to index X Sort Sorted xs xs Sequential Random Updates Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead

Solution Part Two - Reduce Associative update function f can be interleaved with sort merge merge 1 7 e.g., (A + B) + C = A + (B + C) xs 3 1 0 3 8 3 1 5 2 8 1 3 9 2 1 1 3 8 1 3 3 8 10 7 3 0 1 9 72 5 2 1 13 38 3 8 8 19 10 97 03 7 1 2 Reduced Overhead

Removing Random Access Using Sort-Reduce for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src. val, e. weight) vlist. dst. append(v next_val = dst vertex_update(v, ev) dst. next_val, ev) end for end for v = SortReduce vertex_update (list) No more random access! Associativity requirement is not very restrictive (CombBLAS, PowerGraph, Mosaic, )

Normalized size of update stream Big Benefits from Interleaving Reduction Ratio of data copied at each sort phase 1 0.8 0.6 0.4 0.2 0 Kron32 Kron32 WDC Kron30 WDC Twitter 90% 0 1 2 3 Merge Iteration

The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

Accelerated Graph Analytics Architecture In-storage accelerator reduces data movement and cost Software Host (Server/PC/Laptop) FPGA Vertex Value Update Log (xs) Accelerator-Aware Flash Edge Management Program Multirate Aggregator Edge Weight Multirate 1GB DRAM 16-to-1 Merge-Sorter Sort-Reduce Accelerator Wire-Speed On-chip Sorter Flash Edge Data Vertex Data Partially Sort- Reduced Files Active Vertices

Evaluated Graph Analytic Systems In-memory Semi-External External GraphLab (IN) FlashGraph (SE1) X-Stream (SE2) GraphChi (EX) GraFSoft GraFBoost Distributed GraphLab: a framework for machine learning and data mining in the cloud, VLDB 2012 FlashGraph: Processing billion-node graphs on an array of commodity SSDs, FAST 2015 X-Stream: edge-centric graph processing using streaming partitions, SOSP 2013 GraphChi: Large-scale graph computation on just a PC, USENIX 2012

Evaluation Environment + 32-core Xeon 128 GB RAM 5x 0.5TB PCIe Flash $8000 All software experiments 4-core i5 4 GB RAM $400 + Virtex 7 FPGA 1TB custom flash 1GB on-board RAM $1000 $???

Evaluation Result Overview In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirements Memory, CPU, Power

Normalized Performance Results with a Large Graph: Synthetic Scale 32 Kronecker Graph 0.5 TB in text, 4 Billion vertices 2.5 2 1.5 1 0.5 0 GraphLab (IN) out of memory FlashGraph (SE1) out of memory GraphChi (EX) did not finish 1.7x PR BFS BC X-Stream (SE2) 2.8x 10x GraFBoost GraFSoft

Normalized Performance Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish 4 Only GraFBoost succeeds in both graphs 3 2GraFBoost can run still larger graphs! 1 0 PR BFS BC SE1 SE2 GraFBoost GraFSoft

Normalized Performance Results with Smaller Graphs: Breadth-First Search 0.03 TB 0.04 Billion 0.09 TB 0.3 Billion 0.3 TB 1 Billion 5 4 Fastest! Slowest Slowest Slowest 3 2 1 GraFSoft 0 twitter kron28 kron30 IN SE1 SE2 EX GraFBoost

Normalized Performance Results with a Medium Graph: Against an In-Memory Cluster 6 5 4 3 2 1 0 Synthesized Kronecker Scale 28 0.09 TB in text, 0.3 Billion vertices PR BFS 5xIN SE1 SE2 EX GraFBoost GraFSoft

GB Threads Watts GraFBoost Reduces Resource Requirements 100 80 60 40 20 0 80 2 35 30 25 20 15 10 5 0 32 Conventional Conventional GraFBoost GraFBoost 16 2 External analytics Hardware Acceleration 800 600 400 200 0 720 410 Conventional GraFBoost 160 80 40 External Analytics + Hardware Acceleration

Evaluation Result Recap In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirements Memory, CPU, Power

Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

Key-Value Cache in the Data Center Application Server Application Server Key-Value GET/SET Requests Application Server User Requests KVS node KVS node MySQL Postgres SciDB Cache Layer Backend Storage Layer Transactional DB operations can become bottleneck Key-Value caches like memcached cache DB query results Facebook reported 300 TB of memcached capacity (2010)

1000 Requests\Second Miss Rate (%) Cached Application Performance Example Using the BG social network benchmark * MySQL backend and 16 GB memcached 20% 600 15% 400 10% 200 5% 0% 0 Application KVS Cache Throughput Miss Rate 2 2.5 3 3.5 4 4.5 5 5.5 Number of Active Users (Millions) * Open-source project by Twitter, Inc. 39

Software KV Caches Are Fast, but Not Power-Efficient Achieving high throughput requires a lot of resources MICA Mega-KV 120 Million Requests Per Second 160 Million Requests Per Second 24 cores, 12 10GbE 2 GTX 780 GPUs 400 W 900 W Superscalar OoO pipeline is underutilized for KVS Only 3MB LLC is needed to sustain MICA s throughput

The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

BlueCache: Flash-based KVS Architecture App Server App Server App Server App Server KVS Node Dedicated Storage Network NIC Network stack Engines CPU Accelerators Cores Key-value data access DRAM KV-Index Cache NAND Flash KV-Data Store Store KV pairs in flash storage Pipelined KV cache accelerators Hardware accelerated dedicated storage network engines Log-structured KV data store

Evaluation Setup Frontend Backend BG social network benchmark server MySQL server 32-core Xeon server 64 GB DRAM 1.5 TB PCIe SSDs KV Cache * Open-source project by Twitter, Inc. memcached In-memory 48 Cores 16 GB DRAM FatCache* Flash-based 48 Cores 1 GB DRAM 0.5 TB PCIe Flash BlueCache Flash-based with acceleration 1 GB DRAM 0.5 TB PCIe Flash

1000 Requests\Second Miss Rate (%) Performance Evaluation with More Users 20.00% 500 KVS Cache Miss Rate Application Throughput 15.00% 400 300 10.00% (4.0M, 130KRPS) 200 100 5.00% 4.18X 0.00% 0 At much lower power! (40 W vs. 200 W) 2 2 2.5 2.5 3 3 3.5 3.5 4 4 4.5 5 5.5 Number of of Active Users (Millions) memcached FatCache BlueCache

Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

Our Custom Flash Card for Distributed Accelerated Flash Architectures Requirement 1: Modify flash management Requirement 2: Dedicated storage-area network Requirement 3: In-storage hardware accelerator Artix 7 FPGA (Flash Controller) 0.5 TB Flash FMC Connection to FPGA board (40 Gbps) "minflash: A Minimalistic Clustered Flash Array, DATE 2016 Inter-Controller Network (10Gbps/lane 4)

BlueDBM Cluster Architecture Host Server (24-Core) FPGA (VC707) minflash minflash 1GB RAM Ethernet Host Server (24-Core) FPGA (VC707) minflash minflash 1 TB PCIe Gen2 x8 FMC 10Gbps network 8 Host Server (24-Core) FPGA (VC707) minflash minflash Uniform latency of 100 µs!

The BlueDBM Cluster BlueDBM Storage Device

Our Research Enabled by BlueDBM 1. Scalable Multi-Access Flash Store for Big Data Analytics, FPGA 2012 2. BlueDBM: An Appliance for Big Data Analytics, ISCA 2015 3. A Transport-Layer Network for Distributed FPGA Platforms, FPL 2015 4. Large-scale high-dimensional nearest neighbor search using Flash memory with in-store processing, ReConFig 2015 5. minflash: A Minimalistic Clustered Flash Array, DATE 2016 6. Application-managed flash, FAST 2016 7. In-Storage Embedded Accelerator for Sparse Pattern Processing, HPEC 2016 8. Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2017 9. BlueCache: A Scalable Distributed Flash-based Key-value Store, VLDB 2017 10. GraFBoost: Using accelerated flash storage for external graph analytics, ISCA 2018 11. NoHost: Software-defined Network-attached KV Drives, (Under Review)

Future Work Next generation of BlueDBM Newer SSDs are much faster than 2.4 GB/s Prototype flash chips are aging More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center More applications using accelerated flash SQL acceleration collaboration with Samsung 50

Thank you Flash-Based Analytics Capacity Algorithm Acceleration

A Baseline Hardware Merge Sorter 2-to-1 Sorter Sorter emits one item at every cycle Easy to become bottleneck 52

High-Throughput Hardware Sorter using Sorting Networks Sorter emits sorted tuple at every cycle Bitonic sorter Bitonic half-cleaner Bitonic sorter Merge-Sorts at constant 4 GB/s "Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2016

Hardware 16-to-1 Sorter Reduces Sorting Passes Constructed as a pipelined tree of 2-to-1 Sorters 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter "Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2016

Sort-Reduce Process 1TB Update Stream Sort-Reduced 512MB Blocks Sort-Reduced 8GB Blocks Sort-Reduced 128GB Blocks Fully Sort-Reduced In-memory Sort-Reducers (1 GB Memory) 16-way Sort-Reducer

A Baseline Hardware Reducer delay Vertex Update Reducer consumes up to one item at every cycle Vertex update latency impacts performance 56

Wire-Speed Reducer Using Sorting Networks Single Reducer 2-to-1 Sorter Multi Reducer Single Reducer Single Reducers achieve single element wire-speed Multi Reducer uses sorting networks to achieve multi-rate Reduces at constant 4 GB/s "Wire-Speed Accelerator for Aggregation Operations, (Under Review) 57