StreamBox: Modern Stream Processing on a Multicore Machine
|
|
- Jesse Cannon
- 6 years ago
- Views:
Transcription
1 StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu Lin, Purdue ECE
2 IoT Data centers Humans High velocity of streaming data requires real-time processing
3 Streaming Pipeline Infinite data stream Input Pipeline 3
4 Streaming Pipeline Infinite data stream Input Pipeline 4
5 Streaming Pipeline Infinite data stream Input Pipeline Transform 0 5
6 Streaming Pipeline Infinite data stream Input Pipeline Transform 0 Transform 1 Transform 2 6
7 Streaming Pipeline Infinite data stream Input Pipeline Transform 0 Transform 1 Transform 2 Output 7
8 Why is it hard? Records arrive out- of- order Input Transform-0 Transform-1 Transform-2 Output Infinite-data-stream 8
9 Why is it hard? Records arrive out- of- order High Performance on Multicore Data parallelism Pipeline parallelism Memory locality Core 0 960K B 3584 KB Input Transform-0 Transform-1 Transform-2 Output Core 1 960K B 3584 KB Core 2 960K B 3584 KB Infinite-data-stream Core K B 3584 KB 35MB L3 NUMA% 0 NUMA% 1 NUMA% 2 NUMA% 3 Intel Xeon E v4 9
10 Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 5:00 10
11 Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 10:00 5:00 0:00 11
12 Prior work Out- of- order processing within epochs Processes only one epoch in each transform at a time Pipeline Transform 0 Transform 1 Transform 2 20:00 15:00 10:00 15:00 10:00 5:00 5:00 0:00 12
13 StreamBox insight Out- of- order processing across epochs Process all epochs in all transforms in parallel Pipeline Transform 0 Transform 1 Transform 2 10:00 15:00 10:00 13
14 Prior work vs. StreamBox Processes only one epoch in each transform at a time Process all epochs in all transforms in parallel Pipeline Transform)0 Transform)1 Transform)2 20:00 15:00 10:00 10:00 5:00 0:00 Pipeline Transform)0 Transform)1 Transform)2 10:00 15:00 10:00 StreamBox: High pipeline and data parallel processing system 14
15 Result: StreamBox vs. existing systems on multicore High throughput & utilization of multicore hardware Throughput KRec/s StreamBox Spark Streaming Beam 7K 10K 10K 8K # Cores 15
16 Roadmap Background Stream pipeline, streaming data, window, watermark, and epoch StreamBox Design Invariants to guarantee correctness Out-of-order epoch processing Evaluation 16
17 Streaming pipeline for data analytics Transform a computation that consumes and produces streams Pipeline a dataflow graph of transforms Group by word Count word occurrences Ingress Transform 1 Transform 2 Egress A Simple WordCount Pipeline 17
18 Stream records = data + event time Records arrive out of order Records travel diverse network paths Computations execute at different rates infinite data stream 0:03 0:05 0:02 Processing System 18
19 Window A temporal processing scope of records Chopping up infinite data into finite pieces along temporal boundaries Transforms do computation based on windows Infinite input stream 1:13 1:09 1:11 1:08 1:02 1:03 Window 1:00 1:05 Event Time 19
20 Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 1:03 1:02 Windows by event time 1:00 1:05 20
21 Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 1:03 Windows by event time 1:02 1:00 1:05 21
22 Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 1:08 Windows by event time 1:03 1:02 1:00 1:05 22
23 Window A temporal processing scope of records Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:03 1:02 1:05 1:10 1:00 1:05 23
24 Window A temporal processing scope of records Infinite input stream 1:13 1:09 Windows by event time 1:11 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 24
25 Window A temporal processing scope of records Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 25
26 Window A temporal processing scope of records Infinite input stream Windows by event time 1:13 1:11 1:09 1:08 1:03 1:02 1:10 1:15 1:05 1:10 1:00 1:05 26
27 Out-of-order records Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:05 1:10 27
28 When a window is complete? Infinite input stream 1:13 1:09 Windows by event time 1:11 1:10 1:15 1:08 1:05 1:10 28
29 Watermark Input completeness indicated by data source Watermark X all input data with event times less than X have arrived Infinite input stream Watermark 1:10 Watermark 1:05 1:13 1:09 1:11 1:08 1:03 1:02 29
30 Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 1:09 1:11 1:08 Windows by event time 1:05 1:10 30
31 Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 1:09 1:11 Windows by event time 1:08 1:05 1:10 31
32 Handling out-of-order with watermarks Infinite input stream 1:13 Watermark 1:10 1:09 Watermark 1:05 Windows by event time 1:11 1:10 1:15 1:08 1:05 1:10 32
33 Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:10 1:15 1:05 1:10 33
34 Handling out-of-order with watermarks Watermark 1:10 Watermark 1:05 Infinite input stream 1:13 Windows by event time 1:11 1:09 1:08 1:10 1:15 1:05 1:10 34
35 Epoch A set of records arriving between two watermarks A window may span multiple epochs Infinite input stream Watermark 1:10 Watermark 1:05 1:13 1:09 1:11 1:08 1:03 1:02 An epoch 35
36 Roadmap Background StreamBox Design Invariants to guarantee correctness Out-of-order epoch processing Evaluation 36
37 Stream processing engines Most of stream engines optimize for a distributed system Neglected efficient multicore implementation Assume a single machine incapable of handling stream data 37
38 Goal A stream engine for multicore Multicore hardware with High throughput I/O Terabyte DRAMs A large number of cores A stream engine for multicore Correctness respect dependences with minimal synchronization Dynamic parallelism processes any records in any epochs Target throughput & latency Pipeline Transform)0 Transform)1 Transform)2 Core 0 960K B 3584 KB NUMA% 0 Core 1 960K B 3584 KB 10:00 Core 2 960K B 3584 KB NUMA% 1 35MB L3 15:00 NUMA% 2 10:00 Core K B 3584 KB NUMA% 3 38
39 Challenges Correctness Guarantee watermark semantics by meeting two invariants Throughput Never stall the pipeline Latency Do not relax the watermark Dynamically adjust parallelism to relieve bottlenecks 39
40 Invariant 1 Watermark ordering Transforms consume watermarks in order Transforms consume all records in an epoch before consuming the watermark Epoch 2 Epoch 1 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 40
41 Invariant 2 Respect epoch boundaries Once a transform assigns a record an epoch, the record never changes epochs Epoch 2 Epoch 1 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 41
42 Invariant 2 Respect epoch boundaries Once a transform assigns a record an epoch, the record never changes epochs Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 0:12 0:18 0:05 0:11 42
43 Invariant 2 Respect epoch boundaries What if a record changes to a later epoch? Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:10 0:22 0:12 0:18 0:05 0:11 Transform 0:12 0:18 0:05 0:11 Violate watermark guarantee! 43
44 Invariant 2 Respect epoch boundaries What if records change to an earlier epoch? Epoch 2 Epoch 1 Epoch 2 Epoch 1 0:20 0:10 0:20 0:22 0:12 0:18 0:05 0:11 Transform 0:10 0:12 0:18 0:05 0:11 Relax watermark, and delay window completion! 44
45 Our solution: Cascading containers Each cascading container Corresponds to an epoch Tracks an epoch state and the relationship between records and the watermark Orchestrates worker threads to consume watermarks and records End watermark 20:00 20:00 An epoch A container 45
46 Each transform has multiple containers A transform has multiple epochs Each epoch corresponds to a container Containers Transform 0 20:00 15:00 Newest Oldest 46
47 Link each container to a downstream container defined by the transform Transform 0 20:00 15:00 Transform 1 47
48 Records/watermarks flow through the pipeline by following the links Meets invariant 2: records respect epoch boundary Avoids relaxing watermark Transform 0 20:00 15:00 Transform 1 48
49 A watermark will be processed after all records within the container have been processed Guarantees the invariant 1: watermark ordering Transform 0 20:00 15:00 Transform 1 49
50 Watermarks will be processed in order Guarantees the invariant 1: watermark ordering Transform 0 20:00 15:00 Transform 1 50
51 All records in all containers can be processed in parallel Avoids stalling pipeline Transform 0 20:00 15:00 Transform 1 51
52 Big picture A pipeline: multiple transforms Containers form a network Records/watermarks flow through the links High parallel pipeline Guarantees watermark semantic Avoids stalling pipeline (for throughput) Avoids relaxing watermark (for latency) (Upstream) Transform 0 Transform 1 Transform 2 Transform 3 Newest Oldest 25:00 20:00 15:00 10:00 09:00 04:00 (Downstream) 52
53 Other key optimizations Organizing records into bundles Minimize synchronization Multi-input transforms Defer container ordering in downstream Pipeline scheduling Prioritize externalization to minimize latency Pipeline state management Target NUMA-awareness and coarse-grained allocation 53
54 StreamBox implementation Built from scratch in 22K SLoC of C++11 Supported transforms: Windowing, GroupBy, Aggregation, Mapper, Reducer, Temporal Join, Grep Source C++ libraries Intel TBB, Facebook folly, jemalloc, boost Concurrent hash tables Wrapped TBB s concurrent hash map 54
55 StreamBox implementation Benchmarks: Windowed grep Word count Counting distinct URLs Network latency monitoring Tweets sentiment analysis Machine configurations: 6 cores CM12 6 cores 256GB DRAM CM56 14 cores 14 cores 14 cores 14 cores 256GB DRAM 55
56 Roadmap Background StreamBox Design Evaluation 56
57 Evaluation Throughput and scalability Comparison with existing stream engines Handling out-of-order input streaming data Epoch parallelism effectiveness 57
58 Good throughput and scalability Throughput KRec/s Tweets Sentiment Analysis CM56 (1sec) # Cores 58
59 Good throughput and scalability Throughput KRec/s Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) # Cores 59
60 Good throughput and scalability Throughput KRec/s Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) # Cores 60
61 Good throughput and scalability Windowed Grep 5000 Word Count 5000 Temporal Join Throughput KRec/s CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) Throughput KRec/s CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) Throughput KRec/s CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) Throughput KRec/s Counting Distinct URLs CM56 (1sec) CM56 (50ms) CM12 (1sec) CM12 (50ms) # Cores # Cores Throughput KRec/s # Cores Network Latency Monitoring CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) # Cores Throughput KRec/s # Cores Tweets Sentiment Analysis CM56 (1sec) CM56 (500ms) CM12 (1sec) CM12 (500ms) # Cores 61
62 StreamBox vs. existing stream engines Throughput KRec/s StreamBox Spark Streaming Beam 7K 10K 10K 8K # Cores Spark: v2.1.0 Beam: v0.5.0 StreamBox achieves significantly better throughput and scalability 62
63 Handling out-of-order records Throughput KRec/s % 20% 40% # Cores Drop 7% Throughput KRec/s % 20% 40% # Cores Throughput KRec/s % 20% 40% # Cores WordCount Netmon Tweets StreamBox achieves good throughput even with lots of out-of-order records 63
64 Epoch parallelism is effective Throughput KRec/s StreamBox NO In-order parallel # Cores Drop 87% Throughput KRec/s StreamBox NO In-order parallel # Cores Pipeline Pipeline Transform)0 Transform)1 Transform)2 Transform)0 Transform)1 20:00 15:00 10:00 10:00 5:00 0:00 Prior work 10:00 15:00 10:00 Grep WordCount Transform)2 StreamBox 64
65 Summary: StreamBox on multicores Processes any records in any epochs in parallel by using all CPU cores Achieves high throughput with low latency Millions records per second throughput, on a par with distributed engines on a cluster with a few hundreds of CPU cores Tens of milliseconds latency, 20x shorter than other large-scale engines Pipeline Transform)0 Transform)1 10:00 15:00 10:00 Core 0 960K B 3584 KB Core 1 960K B 3584 KB Core 2 960K B 3584 KB 35MB L3 Core K B 3584 KB Transform)2 NUMA% 0 NUMA% 1 NUMA% 2 NUMA%
StreamBox: Modern Stream Processing on a Multicore Machine
StreamBox: Modern Stream Processing on a Multicore Machine Hongyu Miao and Heejin Park, Purdue ECE; Myeongjae Jeon and Gennady Pekhimenko, Microsoft Research; Kathryn S. McKinley, Google; Felix Xiaozhu
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More informationRiSE: Relaxed Systems Engineering? Christoph Kirsch University of Salzburg
RiSE: Relaxed Systems Engineering? Christoph Kirsch University of Salzburg Application: >10k #threads, producer/consumer, blocking Hardware: CPUs, cores, MMUs,, caches Application: >10k #threads, producer/consumer,
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationBe Fast, Cheap and in Control with SwitchKV. Xiaozhou Li
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed
ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation
More informationPREDICTIVE DATACENTER ANALYTICS WITH STRYMON
PREDICTIVE DATACENTER ANALYTICS WITH STRYMON Vasia Kalavri kalavriv@inf.ethz.ch QCon San Francisco 14 November 2017 Support: ABOUT ME Postdoc at ETH Zürich Systems Group: https://www.systems.ethz.ch/ PMC
More informationDesigning Hybrid Data Processing Systems for Heterogeneous Servers
Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk University
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationSystem Design for a Million TPS
System Design for a Million TPS Hüsnü Sensoy Global Maksimum Data & Information Technologies Global Maksimum Data & Information Technologies Focused just on large scale data and information problems. Complex
More informationSoftware within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual
Software within building physics and ground heat storage HEAT3 version 7 A PC-program for heat transfer in three dimensions Update manual June 15, 2015 BLOCON www.buildingphysics.com Contents 1. WHAT S
More informationBIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE
BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest
More informationSAP HANA. Jake Klein/ SVP SAP HANA June, 2013
SAP HANA Jake Klein/ SVP SAP HANA June, 2013 SAP 3 YEARS AGO Middleware BI / Analytics Core ERP + Suite 2013 WHERE ARE WE NOW? Cloud Mobile Applications SAP HANA Analytics D&T Changed Reality Disruptive
More informationAccelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin
Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most
More informationPacket-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel
Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO Oliver Michel Network monitoring is important Security issues Performance issues Equipment failure Analytics Platform
More informationHow Data Volume Affects Spark Based Data Analytics on a Scale-up Server
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server Ahsan Javed Awan EMJD-DC (KTH-UPC) (https://www.kth.se/profile/ajawan/) Mats Brorsson(KTH), Vladimir Vlassov(KTH) and Eduard Ayguade(UPC
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationReducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet
Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems
More informationCSE 124: Networked Services Lecture-17
Fall 2010 CSE 124: Networked Services Lecture-17 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/30/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationWorkload Characterization and Optimization of TPC-H Queries on Apache Spark
Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially
More informationWORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS
WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationSpark Streaming. Guido Salvaneschi
Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive
More informationVoltDB vs. Redis Benchmark
Volt vs. Redis Benchmark Motivation and Goals of this Evaluation Compare the performance of several distributed databases that can be used for state storage in some of our applications Low latency is expected
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationPocket: Elastic Ephemeral Storage for Serverless Analytics
Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 7: Parallel Computing Cho-Jui Hsieh UC Davis May 3, 2018 Outline Multi-core computing, distributed computing Multi-core computing tools
More informationRack-scale Data Processing System
Rack-scale Data Processing System Jana Giceva, Darko Makreshanski, Claude Barthels, Alessandro Dovis, Gustavo Alonso Systems Group, Department of Computer Science, ETH Zurich Rack-scale Data Processing
More informationParallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach
Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach Tiziano De Matteis, Gabriele Mencagli University of Pisa Italy INTRODUCTION The recent years have
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationNaiad (Timely Dataflow) & Streaming Systems
Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed Data Systems Mon, Nov 7th 2016 Amine Mhedhbi What is Timely Dataflow?! What is its significance? Dataflow?! Dataflow?!
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationDeterministic Process Groups in
Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +
More informationThe dark powers on Intel processor boards
The dark powers on Intel processor boards Processing Resources (3U VPX) Boards with Multicore CPUs: Up to 16 cores using Intel Xeon D-1577 on TR C4x/msd Boards with 4-Core CPUs and Multiple Graphical Execution
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationConcurrency: what, why, how
Concurrency: what, why, how May 28, 2009 1 / 33 Lecture about everything and nothing Explain basic idea (pseudo) vs. Give reasons for using Present briefly different classifications approaches models and
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationMulti-threaded Queries. Intra-Query Parallelism in LLVM
Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted
More informationScalable Distributed Training with Parameter Hub: a whirlwind tour
Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC
More informationClosing the Performance Gap Between Volatile and Persistent K-V Stores
Closing the Performance Gap Between Volatile and Persistent K-V Stores Yihe Huang, Harvard University Matej Pavlovic, EPFL Virendra Marathe, Oracle Labs Margo Seltzer, Oracle Labs Tim Harris, Oracle Labs
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationOverview. About CERN 2 / 11
Overview CERN wanted to upgrade the data monitoring system of one of its Large Hadron Collider experiments called ALICE (A La rge Ion Collider Experiment) to ensure the experiment s high efficiency. They
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationMEAP Edition Manning Early Access Program Flink in Action Version 2
MEAP Edition Manning Early Access Program Flink in Action Version 2 Copyright 2016 Manning Publications For more information on this and other Manning titles go to www.manning.com welcome Thank you for
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationYCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores
YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores Swapnil Patil Milo Polte, Wittawat Tantisiriroj, Kai Ren, Lin Xiao, Julio Lopez, Garth Gibson, Adam Fuchs *, Billie
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationAll routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.
technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance
More informationDiscretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important
More informationModule 6: INPUT - OUTPUT (I/O)
Module 6: INPUT - OUTPUT (I/O) Introduction Computers communicate with the outside world via I/O devices Input devices supply computers with data to operate on E.g: Keyboard, Mouse, Voice recognition hardware,
More informationHiTune. Dataflow-Based Performance Analysis for Big Data Cloud
HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241
More informationIncrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?
Incrementally Parallelizing Database Transactions with Thread-Level Speculation Todd C. Mowry Carnegie Mellon University (in collaboration with Chris Colohan, J. Gregory Steffan, and Anastasia Ailamaki)
More informationLazyBase: Trading freshness and performance in a scalable database
LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 11. Advanced Aspects of Big Data Management Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/
More informationMoore s Law. Computer architect goal Software developer assumption
Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer
More informationChapter 8: Main Memory
Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:
More informationFAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017
Liang 1 Jintian Liang CS244B December 13, 2017 1 Introduction FAWN as a Service FAWN, an acronym for Fast Array of Wimpy Nodes, is a distributed cluster of inexpensive nodes designed to give users a view
More informationLearning with Purpose
Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationWhen, Where & Why to Use NoSQL?
When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),
More informationArchitectural challenges for building a low latency, scalable multi-tenant data warehouse
Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationApache Spark Graph Performance with Memory1. February Page 1 of 13
Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationHANA Performance. Efficient Speed and Scale-out for Real-time BI
HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business
More informationLightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University
Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More informationCrossing the Chasm: Sneaking a parallel file system into Hadoop
Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work Compare and contrast large
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationEVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache
EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache 90 seconds What do caches touch? Signing up* Logging in Choosing a profile Picking liked videos
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationHardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT
More informationcustinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader
custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationA closer look at network structure:
T1: Introduction 1.1 What is computer network? Examples of computer network The Internet Network structure: edge and core 1.2 Why computer networks 1.3 The way networks work 1.4 Performance metrics: Delay,
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationIntroduction to MapReduce (cont.)
Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationScalable Streaming Analytics
Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3 Instructors: Bernhard Boser & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/ 10/24/16 Fall 2016 - Lecture #16 1 Software
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationBig Data Platforms. Alessandro Margara
Big Data Platforms Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Data Science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More information