Building Interactive Distributed Processing Applications at a Global Scale

Size: px
Start display at page:

Download "Building Interactive Distributed Processing Applications at a Global Scale"

Transcription

1 Building Interactive Distributed Processing Applications at a Global Scale Shadi A. Noghabi Prelim Committee: Prof. Roy Campbell (advisor), Prof. Indy Gupta (advisor), Prof. Klara Nahrstedt, Dr. Victor Bahl 1

2 Many Interactive Applications < few seconds 100s of PBs data < few minutes Trillions of events < few milliseconds Millions - billions of devices 2

3 Low Latency at Large Scale Latency-sensitive, reaching low latency is an important for interactivity 100ms latency (Amazon) 1% loss in revenue $4M per millisec! * Large scale processing, at large and global scales My focus: Building Interactive systems at a production scale * 3

4 At Scale, Low Latency is Challenging Heterogeneity Data Machines Operations Dynamism Workload Environment Distribution State & Replication Imbalance 4

5 Thesis Statement Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale while addressing myriad challenges on heterogeneity, dynamism, state, and imbalance. 5

6 Latency-driven Design Achieving latency has the highest priority PACELC 1,2 Latency Latency Latency Consistency Availability Throughput Isolation 1. Abadi, Consistency Tradeoffs in Modern Distributed Database System Design, Computer Brewer, Gilbert, and Lynch Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, Sigact News

7 Related Work Latency-Driven in other frameworks Trend of NoSQL systems: Cassandra, Amazon s DynamoDB, Pileus. Latency-Driven does not fit all Consistency driven: Consistent Storage: BigTable, Spanner, Yahoo s PNUTS, Espresso Exactly-once Stream processing: Flink, Millwheel, Spark Streaming, Trident Throughput driven: Batch processing: Hadoop, Spark, Tez, Pig and Hive Micro-batch streaming: Spark Streaming, media processing High-throughput Storage: HDFS and GFS 7

8 Latency-driven Techniques Background processing Locality Partitioning & parallelism Tiered data access Load balancing Latency Driven Adaptive Scaling Opportunistic processing 8

9 Latency-driven at All Layers Edge Edge DC 1 DC 2 Storage DC n Processing Networking Edge Ambry [SIGMOD 16] Samza [VLDB 17] Steel Edge [HotCloud 18] FreeFlow [HotNets 17] General Used Main Integrated container production ~15 Companies, networking with system production with >4 applicable years 100s Azure & of 500M applications to and Docker, users Windows at Kubernetes, at LinkedIn IoT etc. 9

10 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 10

11 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 11

12 Samza: Stateful Scalable Stream Processing at LinkedIn VLDB 17 Shadi A. Noghabi*, Kartik Paramasivam^, Yi Pan^, Navina Ramesh^, Jon Bringhurst^, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ LinkedIn Corp. +

13 Stream Processing* Interactive feeds or ads Security & Monitoring Internet of Things *Stream processing is the processing of data in motion, i.e., computing on data immediately as it is produced 13

14 Stream Applications need State Access Stream (IP, access time) DoS Identifier IP Count DoS Stream (Attacker IP) State: durable data read/written along with the processing Many cases need large state Aggregation: aggregating counts over a window Join: two (or more) streams over a window Other: machine learning model 14

15 Scale of Processing Large scale of input, in Kafka alone: 2.1 Trillion msg/day, 16 Million msg/sec peaks 0.5 PB in, 2 PB out per day Many applications: 100s of production applications Over 10,000 containers Large scale of state Several 100s of TBs for a single application 15

16 Stateful Stream Processing Need to handle state with low latency, at large-scale, and with correct (consistent) results Opensource Apache project Powers hundreds of apps in LinkedIn s production In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, VMware, Optimizely, Redfin, etc. 16

17 Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 17

18 Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 18

19 Processing from a Partitioned Source Stream of events Ex: page access, ad views, etc. 19

20 Processing from a Partitioned Source Stream of events Partition 1 Partition 2 Ex: page access, ad views, etc. Partition k 20

21 Processing from a Partitioned Source Stream of events Partition 1 Partition 2 Ex: page access, ad views, etc. Partition k 21

22 Processing from a Partitioned Source Stream of events Partition 1 Task 1 Ex: page access, ad views, etc. Partition 2 Partition k Task 2 Partitioning & parallelism Task k 22

23 Multi-Stage Dataflow Locality Application logic: Count number of Page accesses for each ip domain in a 5 minute window Page Access in stream Application DAG in tasks DoS Stream out stream Window count Filter SendTo Data Parallelism: each task performs the entire pipeline on its own chunk of input data 23

24 Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 24

25 Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 25

26 Stateful Processing Samza Application Task 1 Task 2 Task k Store count of access per ip domain State: Ip access count But, using a remote store is slow 26

27 Stateful Processing Samza Application Task 1 Task 2 Task k State: Ip access count 27

28 Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [ ] [ ] Local State: use the local storage of tasks State: Ip access count Each task, state corresponding to their partition. Ex: task 1, processing input [1-100], state [1-100] Locality 28

29 Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [ ] [ ] Local Storage: in-memory: fast, but low capacity (GBs) on-disk: high capacity (TBs) and persistent, but slow Disk-mem: Disk (persistent store) + memory (cache) Fast, high capacity, and persistent Tiered data access 29

30 Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [ ] [ ] What about failures? How to not lose state? 30

31 Fault-Tolerance Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

32 Fault-Tolerance X Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

33 Fault-Tolerance Task 1 Task 2 Task k X = 10 Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

34 Fault-Tolerance Background processing Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... X = 10 partition k partition 2 partition 1

35 But, what is the cost? Latency Consistency Availability 35

36 But, what is the cost? Latency Consistency Replayable input + Consistent snapshot Availability Longer failure recovery 36

37 Consistency Guarantees Task 1 Offset 1005 Task 2 Task k Offsets e.g., Kafka topic Offset 1005 Changelog e.g., Kafka topic... X = 10 partition k partition 2 partition 1

38 Availability Optimizations Availability is compromised. To improve we use: Parallel recovery Background compaction Partitioning & parallelism Background processing Host stickiness Locality 38

39 Fast Restarts with Host Stickiness Input Stream Job Change-log Task-1 Task-4 Task-2 Task-3 Host-A Host-B Host-C Durable : Task-Container-Host Mapping Task 1, Task 4 -> Host-A Task 2 -> Host-B Task 3 -> Host-C Host Stickiness in YARN : - Try to place task on same host after restart - Minimize state rebuilding Overhead

40 Related Work Stateless: do not support state Apache Storm [SIGMOD 14], Heron [SIGMOD 15], S4 [ICDM 10] Remote Store: relying on external storage Trident, MillWheel [VLDB 13], Dataflow [VLDB 15] High latency overhead per input message but, fast recovery (better availability) Local state, but, full state checkpointing: Flink [TCDE 15], Spark Streaming [HotCloud 12], IBM Streams [VLDB 16], Streamscope [NDSI 16], System S [DEBS 11] Does not scale for large application state but, easier repeatable results at failure 40

41 Evaluation 41

42 Evaluation Setup Production Cluster 500 node YARN cluster real world applications Small Cluster 8 node cluster: 64GB RAM, 24 core CPUs, a 1.6 TB SSD micro-benchmarks Read-only workload ~ join with table Read-write workload ~ aggregation over time 42

43 Local State -- Latency > 2 orders of magnitude slower compared to local state Stores: in-mem on-disk disk-mem remote changelog adds minimal overhead on disk w/ caching comparable with in memory Fault-tolerance: None changelog (Clog) remote 43

44 Failure Recovery Sticky Sticky Growing linearly with size of state Almost constant overhead with host stickiness Overhead independent of % of failures (parallel recovery) 44

45 Summary of State in Samza Pros 100x better latency Better resource utilization no need for remote DB resources Parallel and independent processing Cons Dependent to input partitioning Does NOT work when state is large and not co-partitionable in input stream Auto-scaling becomes harder Recovery is slower (vs remote)

46 Conclusion Need to handle state with low latency, at large-scale, and with correct (consistent) results Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction Locality Background processing Partitioning & parallelism 46

47 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 47

48 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 48

49 Steel: Unified and Optimized Edge- Cloud Environment HotCloud 18 Shadi A. Noghabi*, John Kolb^, Peter Bodik**, Eduardo Cuervo**, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ University of California, Berkeley ** Microsoft Research 49

50 Cloud is Not an One-Size-Fits-All Latency < 80ms < 20ms The Cloud is simply too far! > 70 ms round trip time < 10 ms 50

51 Cloud is Not an One-Size-Fits-All Latency Resources < 80ms < 10 ms < 20ms Bandwidth 10s 100s GB/s Y Battery Energy Heating Not enough resources to use the Cloud 51

52 Cloud is Not an One-Size-Fits-All < 80ms Latency Resources < 20ms Bandwidth 10s 100s GB/s Y Privacy & Security < 10 ms Battery Energy Heating 52

53 Edge Computing Edge 3 Cloud Edge 1 Edge 2 Satyanarayanan, M., Bahl, P., et. al. The Case for VM-based Cloudlets in Mobile Computing, Pervasive Computing

54 However Industry is at its infancy of building Edge-Cloud applications No proper unification of Edge & Cloud Hard to configure, deploy, and monitor Manual and redundant optimizations Integrated with production Azure services Steel: A unified edge cloud framework with modular and automated optimizations 54

55 Applications Steel Abstraction (Logical Spec) A B C D Fabric Compile Deploy Monitor & Analyze Optimization Modules Load Placement Communication Balancing Edge-Cloud Ecosystem 55

56 Optimizer Modules Placement Where (edge/cloud) to place? Adapt to long-term changes Communication configure edge-cloud links Adapt to short-term spikes Location and vicinity aware Background profiling & what-if analysis Background batching & compression Partitioned and parallel optimizations Change strategy opportunistically (available resources) Locality Background processing Partitioning & parallelism Opportunistic processing 56

57 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 57

58 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 58

59 Ambry: LinkedIn s Scalable Geo- Distributed Object Store SIGMOD 16 Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil Gupta*, Roy H. Campbell* * University of Illinois at Urbana-Champaign SRG: and DPRG: + LinkedIn Corporation Data Infrastructure team 59

60 Massive Media Objects are Everywhere 100s of Millions of users 100s of TBs to PBs From all around the world 60

61 How to handle all these objects? We need a geographically distributed system that stores and retrieves objects in an low-latency and scalable manner Ambry In production for >4 years and >500 M users! Open source: 61

62 Geo-Distribution Locality Count success Alice Asynchronous writes Write synchronously only to local datacenter Asynchronously replicate others Reduce user perceived latency Minimize cross-dc traffic Datacenter 1 Datacenter 2 Datacenter 3 item_i 62

63 Geo-Distribution Locality Background processing Alice Asynchronous writes Write synchronously only to local datacenter Asynchronously replicate others Count success Background Replication Reduce user perceived latency Minimize cross-dc traffic Datacenter 1 Datacenter 2 Datacenter 3 item_i item_i item_i 63

64 Other Techniques Ambry is a large production system with many other techniques: Partitioning & Logical data partitioning parallelism Caching, indexing, bloom filters Background load balancing. Tiered data access Load balancing 64

65 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 65

66 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 66

67 FreeFlow: High performance container networking HotNets 16 Tianlong Yu ^, Shadi A. Noghabi *, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar ^ * University of Illinois at Urbana-Champaign Microsoft Research ^ Carnegie Mellon University 67

68 Containers are popular Containers provide good portability and isolation 68

69 FreeFlow Opportunistic processing Node 1 Node 2 Locality Container 1 Container 2 Container 3 Shared memory RDMA FreeFlow Network 69

70 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 70

71 Future Directions One global system seamlessly across the globe Holistic and cross-layer approach Latency sensitivity-aware approaches (not treating all objects/apps equally) E.g., dynamic use of compression, erasure coding, compaction in storage for old objects E.g., multi-tenant stream processing environments Auto-pilot systems coping with changes automatically and dynamically 71

72 Thanks to Many Collaborators Advisors: Indranil Gupta and Roy Campbell LinkedIn: Sriram Subramanian, Priyesh Narayanan, Kartik Paramasivam, Yi Pan, Navina Ramesh, et al. Microsoft Research: Victor Bahl, Peter Bodik, Eduardo Cuervo, Hongqiang Liu, Jitu Padhye, et. al. 72

73 Publications 1. Steel: Unified Management and Optimization of Edge-Cloud Applications, SA Noghabi, J Kolb, P Bodik, E Cuervo, I Gupta, RH Campbell, (in progress) 2. Steel: Simplified Development and Deployment of Edge-Cloud Applications, SA Noghabi, J Kolb, P Bodik, E Cuervo, HotCloud Samza: Stateful Scalable Stream Processing at LinkedIn, SA Noghabi, K Paramasivam, Y Pan, N Ramesh, J Bringhurst, I Gupta, RH Campbell, VLDB Ambry: LinkedIn s Scalable Geo-Distributed Object Store, SA Noghabi, S Subramanian, P Narayanan, S Narayanan, G Holla, M Zadeh, T Li, I Gupta, RH Campbell, SIGMOD Freeflow: High performance container networking, T Yu, SA Noghabi, S Raindel, H Liu, J Padhye, V Sekar, HotNets Steel: Unified Management and Optimization of Edge-Cloud IoT Applications, NSDI poster To edge or not to edge?, F Kalim, SA Noghabi, S Verma, SoCC poster Building a Scalable Distributed Online Media Processing Environment, SA Noghabi, PhD workshop at VLDB Toward Fabric: A Middleware Implementing High-level Description Languages on a Fabric-like Network, SH Hashemi, SA Noghabi, J Bellessa, RH Campbell, ANCS Towards enabling cooperation between scheduler and storage layer to improve job performance, M Pundir, J Bellessa, SA Noghabi, CL Abad, RH Campbell, PDSW

74 Summary of Thesis Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale Background processing Locality Partitioning & parallelism Processing Samza [VLDB 17] Steel [HotCloud 18] Tiered data access Load balancing Latency Driven Adaptive Scaling Opportunisti c processing Storage Ambry [SIGMOD 16] Networking FreeFlow [HotNets 17] 74

75 Back up Slides 75

76 Theoretical Background Latency-driven design can be modeled theoretically with Queueing Theory* System: A hierarchical queuing system Latency: Response time (response_processed - req_enter_time) Techniques: queueing theory models/techniques Parallelism & Partitioning multiple queues, parallel servers Adaptive scaling adding more servers * Gross, Donald. Fundamentals of queueing theory. John Wiley & Sons,

77 Other Techniques Batch + Stream in one System Reprocess parts/all stream or Database bugs, upgrades, logic changes Batch as a finite stream + conflict resolution, scaling & throttling Adaptive Scaling 77

78 Local State -- Throughput remote state x worse than local state on disk w/ caching comparable with in memory changelog adds minimal overhead 78

79 Snapshot vs Changelog 79

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Ambry: LinkedIn s Scalable Geo- Distributed Object Store Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Building Durable Real-time Data Pipeline

Building Durable Real-time Data Pipeline Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services

More information

CS 655 Advanced Topics in Distributed Systems

CS 655 Advanced Topics in Distributed Systems Presented by : Walid Budgaga CS 655 Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 Outline Problem Solution Approaches Comparison Conclusion 2 Problem 3

More information

Auto Management for Apache Kafka and Distributed Stateful System in General

Auto Management for Apache Kafka and Distributed Stateful System in General Auto Management for Apache Kafka and Distributed Stateful System in General Jiangjie (Becket) Qin Data Infrastructure @LinkedIn GIAC 2017, 12/23/17@Shanghai Agenda Kafka introduction and terminologies

More information

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented

More information

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data

More information

Efficient On-Demand Operations in Distributed Infrastructures

Efficient On-Demand Operations in Distributed Infrastructures Efficient On-Demand Operations in Distributed Infrastructures Steve Ko and Indranil Gupta Distributed Protocols Research Group University of Illinois at Urbana-Champaign 2 One-Line Summary We need to design

More information

Storage for HPC, HPDA and Machine Learning (ML)

Storage for HPC, HPDA and Machine Learning (ML) for HPC, HPDA and Machine Learning (ML) Frank Kraemer, IBM Systems Architect mailto:kraemerf@de.ibm.com IBM Data Management for Autonomous Driving (AD) significantly increase development efficiency by

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Deploying Applications on DC/OS

Deploying Applications on DC/OS Mesosphere Datacenter Operating System Deploying Applications on DC/OS Keith McClellan - Technical Lead, Federal Programs keith.mcclellan@mesosphere.com V6 THE FUTURE IS ALREADY HERE IT S JUST NOT EVENLY

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

VoltDB vs. Redis Benchmark

VoltDB vs. Redis Benchmark Volt vs. Redis Benchmark Motivation and Goals of this Evaluation Compare the performance of several distributed databases that can be used for state storage in some of our applications Low latency is expected

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference

More information

Upgrade Your MuleESB with Solace s Messaging Infrastructure

Upgrade Your MuleESB with Solace s Messaging Infrastructure The era of ubiquitous connectivity is upon us. The amount of data most modern enterprises must collect, process and distribute is exploding as a result of real-time process flows, big data, ubiquitous

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services Deep Dive Amazon Kinesis Ian Meyers, Principal Solution Architect - Amazon Web Services Analytics Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

More information

Next-Generation Cloud Platform

Next-Generation Cloud Platform Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology

More information

Performance and Scalability with Griddable.io

Performance and Scalability with Griddable.io Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.

More information

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS @unterstein @dcos @bedcon #bedcon Operating microservices with Apache Mesos and DC/OS 1 Johannes Unterstein Software Engineer @Mesosphere @unterstein @unterstein.mesosphere 2017 Mesosphere, Inc. All Rights

More information

Windows Azure Services - At Different Levels

Windows Azure Services - At Different Levels Windows Azure Windows Azure Services - At Different Levels SaaS eg : MS Office 365 Paas eg : Azure SQL Database, Azure websites, Azure Content Delivery Network (CDN), Azure BizTalk Services, and Azure

More information

Cloud Analytics and Business Intelligence on AWS

Cloud Analytics and Business Intelligence on AWS Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Intra-cluster Replication for Apache Kafka. Jun Rao

Intra-cluster Replication for Apache Kafka. Jun Rao Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

More information

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio CASE STUDY Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio Xueyan Li, Lei Xu, and Xiaoxu Lv Software Engineers at Qunar At Qunar, we have been running Alluxio in production for over

More information

Esper EQC. Horizontal Scale-Out for Complex Event Processing

Esper EQC. Horizontal Scale-Out for Complex Event Processing Esper EQC Horizontal Scale-Out for Complex Event Processing Esper EQC - Introduction Esper query container (EQC) is the horizontal scale-out architecture for Complex Event Processing with Esper and EsperHA

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

PREDICTIVE DATACENTER ANALYTICS WITH STRYMON

PREDICTIVE DATACENTER ANALYTICS WITH STRYMON PREDICTIVE DATACENTER ANALYTICS WITH STRYMON Vasia Kalavri kalavriv@inf.ethz.ch QCon San Francisco 14 November 2017 Support: ABOUT ME Postdoc at ETH Zürich Systems Group: https://www.systems.ethz.ch/ PMC

More information

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Pocket: Elastic Ephemeral Storage for Serverless Analytics Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1

More information

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage

More information

Apache BookKeeper. A High Performance and Low Latency Storage Service

Apache BookKeeper. A High Performance and Low Latency Storage Service Apache BookKeeper A High Performance and Low Latency Storage Service Hello! I am Sijie Guo - PMC Chair of Apache BookKeeper Co-creator of Apache DistributedLog Twitter Messaging/Pub-Sub Team Yahoo! R&D

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Practical Big Data Processing An Overview of Apache Flink

Practical Big Data Processing An Overview of Apache Flink Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013

More information

Apache Flink- A System for Batch and Realtime Stream Processing

Apache Flink- A System for Batch and Realtime Stream Processing Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink

More information

Modern hyperconverged infrastructure. Karel Rudišar Systems Engineer, Vmware Inc.

Modern hyperconverged infrastructure. Karel Rudišar Systems Engineer, Vmware Inc. Modern hyperconverged infrastructure Karel Rudišar Systems Engineer, Vmware Inc. 2 What Is Hyper-Converged Infrastructure? - The Ideal Architecture for SDDC Management SDDC Compute Networking Storage Simplicity

More information

Hortonworks and The Internet of Things

Hortonworks and The Internet of Things Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Memory-Based Cloud Architectures

Memory-Based Cloud Architectures Memory-Based Cloud Architectures ( Or: Technical Challenges for OnDemand Business Software) Jan Schaffner Enterprise Platform and Integration Concepts Group Example: Enterprise Benchmarking -) *%'+,#$)

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Architectural challenges for building a low latency, scalable multi-tenant data warehouse Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs Overview Cloud Strategy Cloud Services Cloud Research Partnerships - 2 - Yahoo! Cloud Strategy 1. Optimizing for Yahoo-scale

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

WHITEPAPER. MemSQL Enterprise Feature List

WHITEPAPER. MemSQL Enterprise Feature List WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure

More information

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT : Choices for high availability and disaster recovery on Microsoft Azure By Arnab Ganguly DataCAT March 2019 Contents Overview... 3 The challenge of a single-region architecture... 3 Configuration considerations...

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014 Cassandra @ Spotify Scaling storage to million of users world wide! Jimmy Mårdell October 14, 2014 2 About me Jimmy Mårdell Tech Product Owner in the Cassandra team 4 years at Spotify

More information

Over the last few years, we have seen a disruption in the data management

Over the last few years, we have seen a disruption in the data management JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017. Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud? DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing Slide 1 Slide 3 ➀ What is Cloud Computing? ➁ X as a Service ➂ Key Challenges ➃ Developing for the Cloud Why is it called Cloud? services provided

More information

Modern Stream Processing with Apache Flink

Modern Stream Processing with Apache Flink 1 Modern Stream Processing with Apache Flink Till Rohrmann GOTO Berlin 2017 2 Original creators of Apache Flink da Platform 2 Open Source Apache Flink + da Application Manager 3 What changes faster? Data

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2019 by Leonidas Fegaras Cloud Computing Fundamentals Based on: J. Freire s class notes on Big Data http://vgc.poly.edu/~juliana/courses/bigdata2016/

More information

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache Databases on AWS 2017 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services,

More information

Warehouse- Scale Computing and the BDAS Stack

Warehouse- Scale Computing and the BDAS Stack Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,

More information

CIT 668: System Architecture. Amazon Web Services

CIT 668: System Architecture. Amazon Web Services CIT 668: System Architecture Amazon Web Services Topics 1. AWS Global Infrastructure 2. Foundation Services 1. Compute 2. Storage 3. Database 4. Network 3. AWS Economics Amazon Services Architecture Regions

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017 Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google

More information

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based

More information

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Data Centers and Cloud Computing. Slides courtesy of Tim Wood Data Centers and Cloud Computing Slides courtesy of Tim Wood 1 Data Centers Large server and storage farms 1000s of servers Many TBs or PBs of data Used by Enterprises for server applications Internet

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Elevating the Edge to be a Peer of the Cloud Kishore Ramachandran Embedded Pervasive Lab, Georgia Tech August 2, 2018 Acknowledgements Enrique

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017 UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Aurora, RDS, or On-Prem, Which is right for you

Aurora, RDS, or On-Prem, Which is right for you Aurora, RDS, or On-Prem, Which is right for you Kathy Gibbs Database Specialist TAM Katgibbs@amazon.com Santa Clara, California April 23th 25th, 2018 Agenda RDS Aurora EC2 On-Premise Wrap-up/Recommendation

More information

Data Centers and Cloud Computing. Data Centers

Data Centers and Cloud Computing. Data Centers Data Centers and Cloud Computing Slides courtesy of Tim Wood 1 Data Centers Large server and storage farms 1000s of servers Many TBs or PBs of data Used by Enterprises for server applications Internet

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports University of Washington Distributed storage systems

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at

More information

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino Monitoring system for geographically distributed datacenters based on Openstack Gioacchino Vino Tutor: Dott. Domenico Elia Tutor: Dott. Giacinto Donvito Borsa di studio GARR Orio Carlini 2016-2017 INFN

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC https://ignite.apache.org @apacheignite @dsetrakyan Agenda About In- Memory Computing Apache Ignite

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Building a Data-Friendly Platform for a Data- Driven Future

Building a Data-Friendly Platform for a Data- Driven Future Building a Data-Friendly Platform for a Data- Driven Future Benjamin Hindman - @benh 2016 Mesosphere, Inc. All Rights Reserved. INTRO $ whoami BENJAMIN HINDMAN Co-founder and Chief Architect of Mesosphere,

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

The Stream Processor as a Database. Ufuk

The Stream Processor as a Database. Ufuk The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect

More information

Apache Kafka Your Event Stream Processing Solution

Apache Kafka Your Event Stream Processing Solution Apache Kafka Your Event Stream Processing Solution Introduction Data is one among the newer ingredients in the Internet-based systems and includes user-activity events related to logins, page visits, clicks,

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved. Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear

More information