Big Data Analytics Using Hardware-Accelerated Flash Storage

Size: px

Start display at page:

Download "Big Data Analytics Using Hardware-Accelerated Flash Storage"

Patience Lambert
5 years ago
Views:

1 Big Data Analytics Using Hardware-Accelerated Flash Storage Sang-Woo Jun University of California, Irvine (Work done while at MIT) Flash Memory Summit, 2018

A Big Data Application: Personalized Genome

structural variations in cancer by directly

2 A Big Data Application: Personalized Genome Normal Genome Tumor Genome Cancer Patient Next-Generation Sequencing Identified Mutations Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads, Moncunill V. & Gonzalez S., et al., 2014

3 Cluster System for Personalized Genome Complex Algorithm 16 Machines (2 TB DRAM) Terabytes of Data 6 Hours

4 A Cheaper Alternative Using Hardware-Accelerated SSD + Collaborating with

performance 1/5 Power consumption GraFBoost: Using accelerated flash storage for

5 Success with Important Applications + VS Graph analytics with billions of vertices 10x Performance 1/3 Power consumption Key-value cache with millions of users Comparable performance 1/5 Power consumption GraFBoost: Using accelerated flash storage for external graph analytics, ISCA 2018 BlueCache: A Distributed Flash-based Key Value Store, VLDB 2017

6 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

7 Storage for Analytics Fine-grained, Irregular access TB DRAM of DRAM Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W

8 Random Access Challenge in Flash Storage Flash DRAM Bandwidth: ~10 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth!

9 Reconfigurable Hardware Acceleration Field Programmable Gate Array (FPGA) Program application-specific hardware High performance, Low power Reconfigurable to fit the application

10 Future of Reconfigurable Hardware Acceleration High Availability Amazon EC2 Intel Accelerated Xeon Accelerated SSD platforms Easy Programmability Bluespec Xilinx HLS Amazon F1 Shell Normal to do HW/SW codesign (Like with GPU computing)

11 Benefits of In-Storage Acceleration Acceleration Host Storage Lower latency, higher bandwidth to storage Reduce data movement cost Lower engineering cost Most data comes from storage anyways

12 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

13 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

14 Large Graphs are Found Everywhere in Nature Human neural network Structure of the internet TB to 100s of TB in size! Social networks 1) Connectomics graph of the brain - Source: Connectomics and Graph Theory Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 3) The Graph of a Social Network Source: Griff s Graphs

15 Various Models for Graph Analytics Vertex-Centric Pregel, GraphLab, TurboGraph, Mosaic, FlashGraph, GraphChi, Edge-Centric X-Stream, Linear Algebraic CombBLAS, GraphMat, Graphulo, Graph-Centric Giraph++, Frontier-Based Ligra, Gemini, Grunrock, Optimized Algorithms And more Galois,

16 Vertex-Centric Programming Model Popular model for efficient parallism and distribution Vertex program only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more Active Vertices Active Vertices 1 4 4

17 Algorithmic Representation of a Vertex Program Iteration for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src. val, e. weight) v dst. next_val = vertex_update(v dst. next_val, ev) end for end for Random read-modify update into vertex data

18 Random Access Within an Active Vertex

19 Random Access Across Active Vertices!! Vertex data must be stored in DRAM? Data size and irregularity limit caching effectiveness

20 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Sort-Reduce Acceleration

21 General Problem of Irregular Array Updates For each Updating < idx, an arg array > xin xs: with a xstream idx = of f(x update idx requests, arg) xs and update function f X xs f Random Updates

22 Solution Part One - Sort Sort xs according to index X Sort Sorted xs xs Sequential Random Updates Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead

23 Solution Part Two - Reduce Associative update function f can be interleaved with sort merge merge 1 7 e.g., (A + B) + C = A + (B + C) xs Reduced Overhead

24 Removing Random Access Using Sort-Reduce for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src. val, e. weight) vlist. dst. append(v next_val = dst vertex_update(v, ev) dst. next_val, ev) end for end for v = SortReduce vertex_update (list) No more random access! Associativity requirement is not very restrictive (CombBLAS, PowerGraph, Mosaic, )

25 Normalized size of update stream Big Benefits from Interleaving Reduction Ratio of data copied at each sort phase Kron32 Kron32 WDC Kron30 WDC Twitter 90% Merge Iteration

26 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

27 Accelerated Graph Analytics Architecture In-storage accelerator reduces data movement and cost Software Host (Server/PC/Laptop) FPGA Vertex Value Update Log (xs) Accelerator-Aware Flash Edge Management Program Multirate Aggregator Edge Weight Multirate 1GB DRAM 16-to-1 Merge-Sorter Sort-Reduce Accelerator Wire-Speed On-chip Sorter Flash Edge Data Vertex Data Partially Sort- Reduced Files Active Vertices

28 Evaluated Graph Analytic Systems In-memory Semi-External External GraphLab (IN) FlashGraph (SE1) X-Stream (SE2) GraphChi (EX) GraFSoft GraFBoost Distributed GraphLab: a framework for machine learning and data mining in the cloud, VLDB 2012 FlashGraph: Processing billion-node graphs on an array of commodity SSDs, FAST 2015 X-Stream: edge-centric graph processing using streaming partitions, SOSP 2013 GraphChi: Large-scale graph computation on just a PC, USENIX 2012

29 Evaluation Environment + 32-core Xeon 128 GB RAM 5x 0.5TB PCIe Flash $8000 All software experiments 4-core i5 4 GB RAM $400 + Virtex 7 FPGA 1TB custom flash 1GB on-board RAM $1000 $???

30 Evaluation Result Overview In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirements Memory, CPU, Power

31 Normalized Performance Results with a Large Graph: Synthetic Scale 32 Kronecker Graph 0.5 TB in text, 4 Billion vertices GraphLab (IN) out of memory FlashGraph (SE1) out of memory GraphChi (EX) did not finish 1.7x PR BFS BC X-Stream (SE2) 2.8x 10x GraFBoost GraFSoft

32 Normalized Performance Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish 4 Only GraFBoost succeeds in both graphs 3 2GraFBoost can run still larger graphs! 1 0 PR BFS BC SE1 SE2 GraFBoost GraFSoft

33 Normalized Performance Results with Smaller Graphs: Breadth-First Search 0.03 TB 0.04 Billion 0.09 TB 0.3 Billion 0.3 TB 1 Billion 5 4 Fastest! Slowest Slowest Slowest GraFSoft 0 twitter kron28 kron30 IN SE1 SE2 EX GraFBoost

34 Normalized Performance Results with a Medium Graph: Against an In-Memory Cluster Synthesized Kronecker Scale TB in text, 0.3 Billion vertices PR BFS 5xIN SE1 SE2 EX GraFBoost GraFSoft

35 GB Threads Watts GraFBoost Reduces Resource Requirements Conventional Conventional GraFBoost GraFBoost 16 2 External analytics Hardware Acceleration Conventional GraFBoost External Analytics + Hardware Acceleration

36 Evaluation Result Recap In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirements Memory, CPU, Power

37 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

38 Key-Value Cache in the Data Center Application Server Application Server Key-Value GET/SET Requests Application Server User Requests KVS node KVS node MySQL Postgres SciDB Cache Layer Backend Storage Layer Transactional DB operations can become bottleneck Key-Value caches like memcached cache DB query results Facebook reported 300 TB of memcached capacity (2010)

39 1000 Requests\Second Miss Rate (%) Cached Application Performance Example Using the BG social network benchmark * MySQL backend and 16 GB memcached 20% % % 200 5% 0% 0 Application KVS Cache Throughput Miss Rate Number of Active Users (Millions) * Open-source project by Twitter, Inc. 39

40 Software KV Caches Are Fast, but Not Power-Efficient Achieving high throughput requires a lot of resources MICA Mega-KV 120 Million Requests Per Second 160 Million Requests Per Second 24 cores, 12 10GbE 2 GTX 780 GPUs 400 W 900 W Superscalar OoO pipeline is underutilized for KVS Only 3MB LLC is needed to sustain MICA s throughput

41 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration

42 BlueCache: Flash-based KVS Architecture App Server App Server App Server App Server KVS Node Dedicated Storage Network NIC Network stack Engines CPU Accelerators Cores Key-value data access DRAM KV-Index Cache NAND Flash KV-Data Store Store KV pairs in flash storage Pipelined KV cache accelerators Hardware accelerated dedicated storage network engines Log-structured KV data store

43 Evaluation Setup Frontend Backend BG social network benchmark server MySQL server 32-core Xeon server 64 GB DRAM 1.5 TB PCIe SSDs KV Cache * Open-source project by Twitter, Inc. memcached In-memory 48 Cores 16 GB DRAM FatCache* Flash-based 48 Cores 1 GB DRAM 0.5 TB PCIe Flash BlueCache Flash-based with acceleration 1 GB DRAM 0.5 TB PCIe Flash

44 1000 Requests\Second Miss Rate (%) Performance Evaluation with More Users 20.00% 500 KVS Cache Miss Rate Application Throughput 15.00% % (4.0M, 130KRPS) % 4.18X 0.00% 0 At much lower power! (40 W vs. 200 W) Number of of Active Users (Millions) memcached FatCache BlueCache

45 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform

46 Our Custom Flash Card for Distributed Accelerated Flash Architectures Requirement 1: Modify flash management Requirement 2: Dedicated storage-area network Requirement 3: In-storage hardware accelerator Artix 7 FPGA (Flash Controller) 0.5 TB Flash FMC Connection to FPGA board (40 Gbps) "minflash: A Minimalistic Clustered Flash Array, DATE 2016 Inter-Controller Network (10Gbps/lane 4)

47 BlueDBM Cluster Architecture Host Server (24-Core) FPGA (VC707) minflash minflash 1GB RAM Ethernet Host Server (24-Core) FPGA (VC707) minflash minflash 1 TB PCIe Gen2 x8 FMC 10Gbps network 8 Host Server (24-Core) FPGA (VC707) minflash minflash Uniform latency of 100 µs!

48 The BlueDBM Cluster BlueDBM Storage Device

49 Our Research Enabled by BlueDBM 1. Scalable Multi-Access Flash Store for Big Data Analytics, FPGA BlueDBM: An Appliance for Big Data Analytics, ISCA A Transport-Layer Network for Distributed FPGA Platforms, FPL Large-scale high-dimensional nearest neighbor search using Flash memory with in-store processing, ReConFig minflash: A Minimalistic Clustered Flash Array, DATE Application-managed flash, FAST In-Storage Embedded Accelerator for Sparse Pattern Processing, HPEC Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM BlueCache: A Scalable Distributed Flash-based Key-value Store, VLDB GraFBoost: Using accelerated flash storage for external graph analytics, ISCA NoHost: Software-defined Network-attached KV Drives, (Under Review)

50 Future Work Next generation of BlueDBM Newer SSDs are much faster than 2.4 GB/s Prototype flash chips are aging More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center More applications using accelerated flash SQL acceleration collaboration with Samsung 50

51 Thank you Flash-Based Analytics Capacity Algorithm Acceleration

52 A Baseline Hardware Merge Sorter 2-to-1 Sorter Sorter emits one item at every cycle Easy to become bottleneck 52

53 High-Throughput Hardware Sorter using Sorting Networks Sorter emits sorted tuple at every cycle Bitonic sorter Bitonic half-cleaner Bitonic sorter Merge-Sorts at constant 4 GB/s "Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2016

54 Hardware 16-to-1 Sorter Reduces Sorting Passes Constructed as a pipelined tree of 2-to-1 Sorters 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter "Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2016

55 Sort-Reduce Process 1TB Update Stream Sort-Reduced 512MB Blocks Sort-Reduced 8GB Blocks Sort-Reduced 128GB Blocks Fully Sort-Reduced In-memory Sort-Reducers (1 GB Memory) 16-way Sort-Reducer

56 A Baseline Hardware Reducer delay Vertex Update Reducer consumes up to one item at every cycle Vertex update latency impacts performance 56

57 Wire-Speed Reducer Using Sorting Networks Single Reducer 2-to-1 Sorter Multi Reducer Single Reducer Single Reducers achieve single element wire-speed Multi Reducer uses sorting networks to achieve multi-rate Reduces at constant 4 GB/s "Wire-Speed Accelerator for Aggregation Operations, (Under Review) 57

GraFBoost: Using accelerated flash storage for external graph analytics

GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature