Big Data Analytics Using Hardware-Accelerated Flash Storage
|
|
- Patience Lambert
- 5 years ago
- Views:
Transcription
1 Big Data Analytics Using Hardware-Accelerated Flash Storage Sang-Woo Jun University of California, Irvine (Work done while at MIT) Flash Memory Summit, 2018
2 A Big Data Application: Personalized Genome Normal Genome Tumor Genome Cancer Patient Next-Generation Sequencing Identified Mutations Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads, Moncunill V. & Gonzalez S., et al., 2014
3 Cluster System for Personalized Genome Complex Algorithm 16 Machines (2 TB DRAM) Terabytes of Data 6 Hours
4 A Cheaper Alternative Using Hardware-Accelerated SSD + Collaborating with
5 Success with Important Applications + VS Graph analytics with billions of vertices 10x Performance 1/3 Power consumption Key-value cache with millions of users Comparable performance 1/5 Power consumption GraFBoost: Using accelerated flash storage for external graph analytics, ISCA 2018 BlueCache: A Distributed Flash-based Key Value Store, VLDB 2017
6 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform
7 Storage for Analytics Fine-grained, Irregular access TB DRAM of DRAM Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W
8 Random Access Challenge in Flash Storage Flash DRAM Bandwidth: ~10 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth!
9 Reconfigurable Hardware Acceleration Field Programmable Gate Array (FPGA) Program application-specific hardware High performance, Low power Reconfigurable to fit the application
10 Future of Reconfigurable Hardware Acceleration High Availability Amazon EC2 Intel Accelerated Xeon Accelerated SSD platforms Easy Programmability Bluespec Xilinx HLS Amazon F1 Shell Normal to do HW/SW codesign (Like with GPU computing)
11 Benefits of In-Storage Acceleration Acceleration Host Storage Lower latency, higher bandwidth to storage Reduce data movement cost Lower engineering cost Most data comes from storage anyways
12 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration
13 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform
14 Large Graphs are Found Everywhere in Nature Human neural network Structure of the internet TB to 100s of TB in size! Social networks 1) Connectomics graph of the brain - Source: Connectomics and Graph Theory Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 3) The Graph of a Social Network Source: Griff s Graphs
15 Various Models for Graph Analytics Vertex-Centric Pregel, GraphLab, TurboGraph, Mosaic, FlashGraph, GraphChi, Edge-Centric X-Stream, Linear Algebraic CombBLAS, GraphMat, Graphulo, Graph-Centric Giraph++, Frontier-Based Ligra, Gemini, Grunrock, Optimized Algorithms And more Galois,
16 Vertex-Centric Programming Model Popular model for efficient parallism and distribution Vertex program only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more Active Vertices Active Vertices 1 4 4
17 Algorithmic Representation of a Vertex Program Iteration for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src. val, e. weight) v dst. next_val = vertex_update(v dst. next_val, ev) end for end for Random read-modify update into vertex data
18 Random Access Within an Active Vertex
19 Random Access Across Active Vertices!! Vertex data must be stored in DRAM? Data size and irregularity limit caching effectiveness
20 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Sort-Reduce Acceleration
21 General Problem of Irregular Array Updates For each Updating < idx, an arg array > xin xs: with a xstream idx = of f(x update idx requests, arg) xs and update function f X xs f Random Updates
22 Solution Part One - Sort Sort xs according to index X Sort Sorted xs xs Sequential Random Updates Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead
23 Solution Part Two - Reduce Associative update function f can be interleaved with sort merge merge 1 7 e.g., (A + B) + C = A + (B + C) xs Reduced Overhead
24 Removing Random Access Using Sort-Reduce for each v src in ActiveList do for each e(v src, v dst ) in G do ev = edge_program (v src. val, e. weight) vlist. dst. append(v next_val = dst vertex_update(v, ev) dst. next_val, ev) end for end for v = SortReduce vertex_update (list) No more random access! Associativity requirement is not very restrictive (CombBLAS, PowerGraph, Mosaic, )
25 Normalized size of update stream Big Benefits from Interleaving Reduction Ratio of data copied at each sort phase Kron32 Kron32 WDC Kron30 WDC Twitter 90% Merge Iteration
26 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration
27 Accelerated Graph Analytics Architecture In-storage accelerator reduces data movement and cost Software Host (Server/PC/Laptop) FPGA Vertex Value Update Log (xs) Accelerator-Aware Flash Edge Management Program Multirate Aggregator Edge Weight Multirate 1GB DRAM 16-to-1 Merge-Sorter Sort-Reduce Accelerator Wire-Speed On-chip Sorter Flash Edge Data Vertex Data Partially Sort- Reduced Files Active Vertices
28 Evaluated Graph Analytic Systems In-memory Semi-External External GraphLab (IN) FlashGraph (SE1) X-Stream (SE2) GraphChi (EX) GraFSoft GraFBoost Distributed GraphLab: a framework for machine learning and data mining in the cloud, VLDB 2012 FlashGraph: Processing billion-node graphs on an array of commodity SSDs, FAST 2015 X-Stream: edge-centric graph processing using streaming partitions, SOSP 2013 GraphChi: Large-scale graph computation on just a PC, USENIX 2012
29 Evaluation Environment + 32-core Xeon 128 GB RAM 5x 0.5TB PCIe Flash $8000 All software experiments 4-core i5 4 GB RAM $400 + Virtex 7 FPGA 1TB custom flash 1GB on-board RAM $1000 $???
30 Evaluation Result Overview In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirements Memory, CPU, Power
31 Normalized Performance Results with a Large Graph: Synthetic Scale 32 Kronecker Graph 0.5 TB in text, 4 Billion vertices GraphLab (IN) out of memory FlashGraph (SE1) out of memory GraphChi (EX) did not finish 1.7x PR BFS BC X-Stream (SE2) 2.8x 10x GraFBoost GraFSoft
32 Normalized Performance Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish 4 Only GraFBoost succeeds in both graphs 3 2GraFBoost can run still larger graphs! 1 0 PR BFS BC SE1 SE2 GraFBoost GraFSoft
33 Normalized Performance Results with Smaller Graphs: Breadth-First Search 0.03 TB 0.04 Billion 0.09 TB 0.3 Billion 0.3 TB 1 Billion 5 4 Fastest! Slowest Slowest Slowest GraFSoft 0 twitter kron28 kron30 IN SE1 SE2 EX GraFBoost
34 Normalized Performance Results with a Medium Graph: Against an In-Memory Cluster Synthesized Kronecker Scale TB in text, 0.3 Billion vertices PR BFS 5xIN SE1 SE2 EX GraFBoost GraFSoft
35 GB Threads Watts GraFBoost Reduces Resource Requirements Conventional Conventional GraFBoost GraFBoost 16 2 External analytics Hardware Acceleration Conventional GraFBoost External Analytics + Hardware Acceleration
36 Evaluation Result Recap In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Medium graphs: Fail Fast Slow Fast Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirements Memory, CPU, Power
37 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform
38 Key-Value Cache in the Data Center Application Server Application Server Key-Value GET/SET Requests Application Server User Requests KVS node KVS node MySQL Postgres SciDB Cache Layer Backend Storage Layer Transactional DB operations can become bottleneck Key-Value caches like memcached cache DB query results Facebook reported 300 TB of memcached capacity (2010)
39 1000 Requests\Second Miss Rate (%) Cached Application Performance Example Using the BG social network benchmark * MySQL backend and 16 GB memcached 20% % % 200 5% 0% 0 Application KVS Cache Throughput Miss Rate Number of Active Users (Millions) * Open-source project by Twitter, Inc. 39
40 Software KV Caches Are Fast, but Not Power-Efficient Achieving high throughput requires a lot of resources MICA Mega-KV 120 Million Requests Per Second 160 Million Requests Per Second 24 cores, 12 10GbE 2 GTX 780 GPUs 400 W 900 W Superscalar OoO pipeline is underutilized for KVS Only 3MB LLC is needed to sustain MICA s throughput
41 The Three Pillars of Competitive Flash-Based Analytics High Performance, Low Cost Capacity Algorithm Acceleration
42 BlueCache: Flash-based KVS Architecture App Server App Server App Server App Server KVS Node Dedicated Storage Network NIC Network stack Engines CPU Accelerators Cores Key-value data access DRAM KV-Index Cache NAND Flash KV-Data Store Store KV pairs in flash storage Pipelined KV cache accelerators Hardware accelerated dedicated storage network engines Log-structured KV data store
43 Evaluation Setup Frontend Backend BG social network benchmark server MySQL server 32-core Xeon server 64 GB DRAM 1.5 TB PCIe SSDs KV Cache * Open-source project by Twitter, Inc. memcached In-memory 48 Cores 16 GB DRAM FatCache* Flash-based 48 Cores 1 GB DRAM 0.5 TB PCIe Flash BlueCache Flash-based with acceleration 1 GB DRAM 0.5 TB PCIe Flash
44 1000 Requests\Second Miss Rate (%) Performance Evaluation with More Users 20.00% 500 KVS Cache Miss Rate Application Throughput 15.00% % (4.0M, 130KRPS) % 4.18X 0.00% 0 At much lower power! (40 W vs. 200 W) Number of of Active Users (Millions) memcached FatCache BlueCache
45 Contents Introduction Flash Storage and Hardware Acceleration Example Applications Graph Analytics Platform Key-Value Cache BlueDBM Architecture Exploration Platform
46 Our Custom Flash Card for Distributed Accelerated Flash Architectures Requirement 1: Modify flash management Requirement 2: Dedicated storage-area network Requirement 3: In-storage hardware accelerator Artix 7 FPGA (Flash Controller) 0.5 TB Flash FMC Connection to FPGA board (40 Gbps) "minflash: A Minimalistic Clustered Flash Array, DATE 2016 Inter-Controller Network (10Gbps/lane 4)
47 BlueDBM Cluster Architecture Host Server (24-Core) FPGA (VC707) minflash minflash 1GB RAM Ethernet Host Server (24-Core) FPGA (VC707) minflash minflash 1 TB PCIe Gen2 x8 FMC 10Gbps network 8 Host Server (24-Core) FPGA (VC707) minflash minflash Uniform latency of 100 µs!
48 The BlueDBM Cluster BlueDBM Storage Device
49 Our Research Enabled by BlueDBM 1. Scalable Multi-Access Flash Store for Big Data Analytics, FPGA BlueDBM: An Appliance for Big Data Analytics, ISCA A Transport-Layer Network for Distributed FPGA Platforms, FPL Large-scale high-dimensional nearest neighbor search using Flash memory with in-store processing, ReConFig minflash: A Minimalistic Clustered Flash Array, DATE Application-managed flash, FAST In-Storage Embedded Accelerator for Sparse Pattern Processing, HPEC Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM BlueCache: A Scalable Distributed Flash-based Key-value Store, VLDB GraFBoost: Using accelerated flash storage for external graph analytics, ISCA NoHost: Software-defined Network-attached KV Drives, (Under Review)
50 Future Work Next generation of BlueDBM Newer SSDs are much faster than 2.4 GB/s Prototype flash chips are aging More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center More applications using accelerated flash SQL acceleration collaboration with Samsung 50
51 Thank you Flash-Based Analytics Capacity Algorithm Acceleration
52 A Baseline Hardware Merge Sorter 2-to-1 Sorter Sorter emits one item at every cycle Easy to become bottleneck 52
53 High-Throughput Hardware Sorter using Sorting Networks Sorter emits sorted tuple at every cycle Bitonic sorter Bitonic half-cleaner Bitonic sorter Merge-Sorts at constant 4 GB/s "Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2016
54 Hardware 16-to-1 Sorter Reduces Sorting Passes Constructed as a pipelined tree of 2-to-1 Sorters 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter 2-to-1 Sorter "Terabyte Sort on FPGA-Accelerated Flash Storage, FCCM 2016
55 Sort-Reduce Process 1TB Update Stream Sort-Reduced 512MB Blocks Sort-Reduced 8GB Blocks Sort-Reduced 128GB Blocks Fully Sort-Reduced In-memory Sort-Reducers (1 GB Memory) 16-way Sort-Reducer
56 A Baseline Hardware Reducer delay Vertex Update Reducer consumes up to one item at every cycle Vertex update latency impacts performance 56
57 Wire-Speed Reducer Using Sorting Networks Single Reducer 2-to-1 Sorter Multi Reducer Single Reducer Single Reducers achieve single element wire-speed Multi Reducer uses sorting networks to achieve multi-rate Reduces at constant 4 GB/s "Wire-Speed Accelerator for Aggregation Operations, (Under Review) 57
GraFBoost: Using accelerated flash storage for external graph analytics
GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature
More informationBlueDBM: An Appliance for Big Data Analytics*
BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting
More informationGraFBoost: Using accelerated flash storage for external graph analytics
: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind Department of Electrical Engineering and Computer Science Massachusetts Institute
More informationHADP Talk BlueDBM: An appliance for Big Data Analytics
HADP Talk BlueDBM: An appliance for Big Data Analytics Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+ John Ankcorn+ Myron King+ Shuotao Xu* Arvind* *MIT Computer Science and Artificial Intelligence
More informationFPGP: Graph Processing Framework on FPGA
FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E. Dept., TNLIST, Tsinghua University dgh14@mails.tsinghua.edu.cn 1 Big graph is widely used Big graph is widely
More informationGridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University
GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationarxiv: v1 [cs.db] 21 Oct 2017
: High-performance external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology
More informationOptimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation
Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory
More informationFlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Da Zheng, Disa Mhembere, Randal Burns Department of Computer Science Johns Hopkins University Carey E. Priebe Department of Applied
More informationBe Fast, Cheap and in Control with SwitchKV. Xiaozhou Li
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level
More informationDatabase Acceleration Solution Using FPGAs and Integrated Flash Storage
Database Acceleration Solution Using FPGAs and Integrated Flash Storage HK Verma, Xilinx Inc. August 2017 1 FPGA Analytics in Flash Storage System In-memory or Flash storage based DB reduce disk access
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationSoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research
SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research 1 The world s most valuable resource Data is everywhere! May. 2017 Values from Data! Need infrastructures for
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationCascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching
Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value
More informationBlueDBM: An Appliance for Big Data Analytics
BlueDBM: An Appliance for Big Data Analytics Sang-Woo Jun Ming Liu Sungjin Lee Jamey Hicks John Ankcorn Myron King Shuotao Xu Arvind Department of Electrical Engineering and Computer Science Massachusetts
More informationSociaLite: A Datalog-based Language for
SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler
More informationA Framework for Processing Large Graphs in Shared Memory
A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More informationPersistent Memory. High Speed and Low Latency. White Paper M-WP006
Persistent Memory High Speed and Low Latency White Paper M-WP6 Corporate Headquarters: 3987 Eureka Dr., Newark, CA 9456, USA Tel: (51) 623-1231 Fax: (51) 623-1434 E-mail: info@smartm.com Customer Service:
More informationJulienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing
Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut
More informationBig Data Systems on Future Hardware. Bingsheng He NUS Computing
Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big
More informationBe Fast, Cheap and in Control with SwitchKV Xiaozhou Li
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Raghav Sethi Michael Kaminsky David G. Andersen Michael J. Freedman Goal: fast and cost-effective key-value store Target: cluster-level storage for
More informationColin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct :45 p.m. - 5:30 p.m. Moscone West - Room 3020
Colin Cunningham, Intel Kumaran Siva, Intel Sandeep Mahajan, Oracle 03-Oct-2017 4:45 p.m. - 5:30 p.m. Moscone West - Room 3020 Big Data Talk Exploring New SSD Usage Models to Accelerate Cloud Performance
More informationCaribou: Intelligent Distributed Storage
: Intelligent Distributed Storage Zsolt István, David Sidler, Gustavo Alonso Systems Group, Department of Computer Science, ETH Zurich 1 Rack-scale thinking In the Cloud ToR Switch Compute Compute + Provisioning
More informationDCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture
DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1,
More informationHigh Performance Packet Processing with FlexNIC
High Performance Packet Processing with FlexNIC Antoine Kaufmann, Naveen Kr. Sharma Thomas Anderson, Arvind Krishnamurthy University of Washington Simon Peter The University of Texas at Austin Ethernet
More informationNOHOST: A New Storage Architecture for Distributed Storage Systems. Chanwoo Chung
NOHOST: A New Storage Architecture for Distributed Storage Systems by Chanwoo Chung B.S., Seoul National University (2014) Submitted to the Department of Electrical Engineering and Computer Science in
More informationAccelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage
Accelerating Real-Time Big Data Breaking the limitations of captive NVMe storage 18M IOPs in 2u Agenda Everything related to storage is changing! The 3rd Platform NVM Express architected for solid state
More informationForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture
ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture Guohao Dai 1, Tianhao Huang 1, Yuze Chi 2, Ningyi Xu 3, Yu Wang 1, Huazhong Yang 1 1 Tsinghua University, 2 UCLA, 3 MSRA dgh14@mails.tsinghua.edu.cn
More informationReconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland
Reconfigurable hardware for big data Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland www.systems.ethz.ch Systems Group 7 faculty ~40 PhD ~8 postdocs Researching all
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming
More informationScaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX
Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationMoneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010
Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationNavigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets
Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming
More informationAn In-Kernel NOSQL Cache for Range Queries Using FPGA NIC
An In-Kernel NOSQL Cache for Range Queries Using FPGA NIC Korechika Tamura Dept. of ICS, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan Email: tamura@arc.ics.keio.ac.jp Hiroki Matsutani Dept.
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationNetwork Requirements for Resource Disaggregation
Network Requirements for Resource Disaggregation Peter Gao (Berkeley), Akshay Narayan (MIT), Sagar Karandikar (Berkeley), Joao Carreira (Berkeley), Sangjin Han (Berkeley), Rachit Agarwal (Cornell), Sylvia
More informationAn FPGA-based In-line Accelerator for Memcached
An FPGA-based In-line Accelerator for Memcached MAYSAM LAVASANI, HARI ANGEPAT, AND DEREK CHIOU THE UNIVERSITY OF TEXAS AT AUSTIN 1 Challenges for Server Processors Workload changes Social networking Cloud
More informationCS 5220: Parallel Graph Algorithms. David Bindel
CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket
More informationDell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark
Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark Testing validation report prepared under contract with Dell Introduction As innovation drives
More informationHewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE
Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE Digital transformation is taking place in businesses of all sizes Big Data and Analytics Mobility Internet of Things
More informationService Oriented Performance Analysis
Service Oriented Performance Analysis Da Qi Ren and Masood Mortazavi US R&D Center Santa Clara, CA, USA www.huawei.com Performance Model for Service in Data Center and Cloud 1. Service Oriented (end to
More informationEnabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center
Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, Paul Chow University of Toronto 1 Cloudy with
More informationSamsung Z-SSD SZ985. Ultra-low Latency SSD for Enterprise and Data Centers. Brochure
Samsung Z-SSD SZ985 Ultra-low Latency SSD for Enterprise and Data Centers Brochure 1 A high-speed storage device from the SSD technology leader Samsung Z-SSD SZ985 offers more capacity than PRAM-based
More informationADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION
TABLE OF CONTENTS 2 WHAT IS IN-MEMORY COMPUTING (IMC) Benefits of IMC Concerns with In-Memory Processing Advanced In-Memory Computing using Supermicro MemX 1 3 MEMX ARCHITECTURE MemX Functionality and
More informationGraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC
GraphQ: Graph Query Processing with Abstraction Refinement -- Scalable and Programmable Analytics over Very Large Graphs on a Single PC Kai Wang, Guoqing Xu, University of California, Irvine Zhendong Su,
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationChapter 2 Parallel Hardware
Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationAccelerating Data Science. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland
Accelerating Data Science Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland Data processing today: Appliances (large machines) Data Centers (many machines) Databases are
More informationSAP HANA. Jake Klein/ SVP SAP HANA June, 2013
SAP HANA Jake Klein/ SVP SAP HANA June, 2013 SAP 3 YEARS AGO Middleware BI / Analytics Core ERP + Suite 2013 WHERE ARE WE NOW? Cloud Mobile Applications SAP HANA Analytics D&T Changed Reality Disruptive
More informationMultipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs
Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1, Daniel Zinn 2, Molham Aref 2, Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationReport. X-Stream Edge-centric Graph processing
Report X-Stream Edge-centric Graph processing Yassin Hassan hassany@student.ethz.ch Abstract. X-Stream is an edge-centric graph processing system, which provides an API for scatter gather algorithms. The
More informationMemory-Based Cloud Architectures
Memory-Based Cloud Architectures ( Or: Technical Challenges for OnDemand Business Software) Jan Schaffner Enterprise Platform and Integration Concepts Group Example: Enterprise Benchmarking -) *%'+,#$)
More informationRhythm: Harnessing Data Parallel Hardware for Server Workloads
Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth
More informationPSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation
PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation 1. Overview of NEC PCIe SSD Appliance for Microsoft SQL Server Page 2 NEC Corporation
More informationMaximizing heterogeneous system performance with ARM interconnect and CCIX
Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationHPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom
HPE SimpliVity The new powerhouse in hyperconvergence Boštjan Dolinar HPE Maribor Lancom 2.2.2018 Changing requirements drive the need for Hybrid IT Application explosion Hybrid growth 2014 5,500 2015
More informationExploring the Hidden Dimension in Graph Processing
Exploring the Hidden Dimension in Graph Processing Mingxing Zhang, Yongwei Wu, Kang Chen, *Xuehai Qian, Xue Li, and Weimin Zheng Tsinghua University *University of Shouthern California Graph is Ubiquitous
More informationData Processing on Emerging Hardware
Data Processing on Emerging Hardware Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland 3 rd International Summer School on Big Data, Munich, Germany, 2017 www.systems.ethz.ch
More informationRACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING
RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING EXECUTIVE SUMMARY Today, businesses are increasingly turning to cloud services for rapid deployment of apps and services.
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationMaximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale.
Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale. January 2016 Credit Card Fraud prevention is among the most time-sensitive and high-value of IT tasks. The databases
More informationFAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan
FAWN A Fast Array of Wimpy Nodes David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan Carnegie Mellon University *Intel Labs Pittsburgh Energy in computing
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationImproving Packet Processing Performance of a Memory- Bounded Application
Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb
More informationEvaluation of the Chelsio T580-CR iscsi Offload adapter
October 2016 Evaluation of the Chelsio T580-CR iscsi iscsi Offload makes a difference Executive Summary As application processing demands increase and the amount of data continues to grow, getting this
More informationPebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees
PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees Pandian Raju 1, Rohan Kadekodi 1, Vijay Chidambaram 1,2, Ittai Abraham 2 1 The University of Texas at Austin 2 VMware Research
More informationTerabyte Sort on FPGA-Accelerated Flash Storage
2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines Terabyte on FPGA-Accelerated Flash Storage Sang-Woo Jun, Shuotao Xu, Arvind Department of Electrical Engineering
More informationScalable Distributed Training with Parameter Hub: a whirlwind tour
Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC
More informationFast packet processing in the cloud. Dániel Géhberger Ericsson Research
Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationCatapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud
Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015 The Cloud is a Growing Disruptor for HPC Moore s
More informationFlashmatrix Technology
matrix Technology All-flash Super-Converged Platform By Ram Johri Memory Summit 2017 Santa Clara, CA 1 Traditional Von Neumann vs. Data Centric Architecture Memory Shared Memory Pool Memory matrix: Data
More informationCamdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa
Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationNew Approach to Unstructured Data
Innovations in All-Flash Storage Deliver a New Approach to Unstructured Data Table of Contents Developing a new approach to unstructured data...2 Designing a new storage architecture...2 Understanding
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationComputational Storage: Acceleration Through Intelligence & Agility
Flash Memory Summit Computational Storage: Acceleration Through Intelligence & Agility Dr. Hao Zhong CEO & Co-Founder, ScaleFlux Flash Memory Summit 2018 Santa Clara, CA What s the Big Deal? High Cost
More information4 Myths about in-memory databases busted
4 Myths about in-memory databases busted Yiftach Shoolman Co-Founder & CTO @ Redis Labs @yiftachsh, @redislabsinc Background - Redis Created by Salvatore Sanfilippo (@antirez) OSS, in-memory NoSQL k/v
More informationWorkload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013
Workload Optimized Systems: The Wheel of Reincarnation Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013 Outline Definition Technology Minicomputers Prime Workstations Apollo Graphics
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationBDCC: Exploiting Fine-Grained Persistent Memories for OLAP. Peter Boncz
BDCC: Exploiting Fine-Grained Persistent Memories for OLAP Peter Boncz NVRAM System integration: NVMe: block devices on the PCIe bus NVDIMM: persistent RAM, byte-level access Low latency Lower than Flash,
More informationLatency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25
More informationBuilding the Most Efficient Machine Learning System
Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationHardware NVMe implementation on cache and storage systems
Hardware NVMe implementation on cache and storage systems Jerome Gaysse, IP-Maker Santa Clara, CA 1 Agenda Hardware architecture NVMe for storage NVMe for cache/application accelerator NVMe for new NVM
More information