Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM
|
|
- Chastity Green
- 6 years ago
- Views:
Transcription
1 Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM
2 With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015 About
3 Contacts Alexey_Zinovyev@epam.com Facebook: vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs 3
4 Spark Family 4
5 Spark Family 5
6 Pre-summary Before RealTime Spark + Cassandra Sending messages with Kafka DStream Kafka Consumer Structured Streaming in Spark 2.1 Kafka Writer in Spark 2.2 6
7 < REAL-TIME 7
8 Modern Java in 2016 Big Data in
9 Batch jobs produce reports. More and more.. 9
10 But customer can wait forever (ok, 1h) 10
11 Big Data in
12 Machine Learning EVERYWHERE 12
13 Data Lake in promotional brochure 13
14 Data Lake in production 14
15 Simple Flow in Reporting/BI systems 15
16 Let s use Spark. It s fast! 16
17 MapReduce vs Spark 17
18 MapReduce vs Spark 18
19 Simple Flow in Reporting/BI systems with Spark 19
20 Spark handles last year logs with ease 20
21 Where can we store events? 21
22 Let s use Cassandra to store events! 22
23 Let s use Cassandra to read events! 23
24 CREATE KEYSPACE myspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE test; Cassandra CREATE TABLE logs ( application TEXT, time TIMESTAMP, message TEXT, PRIMARY KEY (application, time)); 24
25 val dataset = sqlcontext.read.format("org.apache.spark.sql.cassandra") Cassandra to Spark.options(Map( "table" -> "logs", "keyspace" -> "myspace" )).load() dataset.filter("message = 'Log message'").show() 25
26 Simple Flow in Pre-Real-Time systems 26
27 Spark cluster over Cassandra Cluster 27
28 More events every second! 28
29 SENDING MESSAGES 29
30 Your Father s Messaging System 30
31 Your Father s Messaging System 31
32 Your Father s Messaging System InitialContext ctx = new InitialContext(); QueueConnectionFactory f = (QueueConnectionFactory)ctx.lookup( qcfactory"); QueueConnection con = f.createqueueconnection(); con.start(); 32
33 KAFKA 33
34 Kafka messaging system 34
35 Kafka messaging system distributed 35
36 Kafka messaging system distributed supports Publish-Subscribe model 36
37 Kafka messaging system distributed supports Publish-Subscribe model persists messages on disk 37
38 Kafka messaging system distributed supports Publish-Subscribe model persists messages on disk replicates within the cluster (integrated with Zookeeper) 38
39 The main benefits of Kafka Scalability with zero down time Zero data loss due to replication 39
40 Kafka Cluster consists of brokers (leader or follower) topics ( >= 1 partition) partitions partition offsets replicas of partition producers/consumers 40
41 Kafka Components with topic messages #1 Producer Thread #1 Producer Thread #2 Producer Thread #3 Data Data Data Topic: Messages Zookeeper 41
42 Kafka Components with topic messages #2 Producer Thread #1 Producer Thread #2 Producer Thread #3 Data Data Data Leader Part #1 Topic: Messages Leader Part #2 Part #1 Part #2 Broker #1 Zookeeper 42
43 Kafka Components with topic messages #3 Part #1 Producer Thread #1 Producer Thread #2 Data Data Part #1 Topic: Messages Part #2 Broker #1 Producer Thread #3 Data Part #2 Part #1 Broker #2 Part #2 Zookeeper 43
44 Why do we need Zookeeper? 44
45 Kafka Demo 45
46 REAL TIME WITH DSTREAMS 46
47 RDD Factory 47
48 From socket to console with DStreams 48
49 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 49
50 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 50
51 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 51
52 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 52
53 val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.sockettextstream("localhost", 9999) DStream val words = lines.flatmap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordcounts = pairs.reducebykey(_ + _) wordcounts.print() ssc.start() ssc.awaittermination() 53
54 Kafka as a main entry point for Spark 54
55 DStreams Demo 55
56 How to avoid DStreams with RDD-like API? 56
57 SPARK 2.2 DISCUSSION 57
58 Continuous Applications 58
59 Continuous Applications cases Updating data that will be served in real time Extract, transform and load (ETL) Creating a real-time version of an existing batch job Online machine learning 59
60 The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data. 60
61 // Read JSON once from S3 Batch Spark 2.2 logsdf = spark.read.json("s3://logs") // Transform with DataFrame API and save logsdf.select("user", "url", "date").write.parquet("s3://out") 61
62 // Read JSON continuously from S3 Real Time Spark 2.2 logsdf = spark.readstream.json("s3://logs") // Transform with DataFrame API and save logsdf.select("user", "url", "date").writestream.parquet("s3://out").start() 62
63 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 63
64 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 64
65 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 65
66 val lines = spark.readstream.format("socket").option("host", "localhost") WordCount from Socket.option("port", 9999).load() Don t forget to start Streaming val words = lines.as[string].flatmap(_.split(" ")) val wordcounts = words.groupby("value").count() 66
67 Unlimited Table 67
68 WordCount with Structured Streaming [Complete Mode] 68
69 Kafka -> Structured Streaming -> Console 69
70 Kafka To Console Demo 70
71 OPERATIONS 71
72 You can filter sort aggregate join foreach explain 72
73 Operators Demo 73
74 How it works? 74
75 Deep Diving in Spark Internals Dataset 75
76 Deep Diving in Spark Internals Dataset Logical Plan 76
77 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan 77
78 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical Physical Plan Plan 78
79 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Planner Logical Plan Logical Plan Logical Physical Plan Plan Selected Physical Plan 79
80 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Logical Plan Logical Plan Logical Physical Plan Plan Selected Physical Plan 80
81 Deep Diving in Spark Internals Dataset Logical Plan Optimized Logical Plan Incremental #1 Logical Plan Logical Plan Logical Physical Plan Plan Incremental #2 Selected Physical Plan Incremental #3 81
82 Incremental Execution: Planner polls Planner 82
83 Incremental Execution: Planner runs Planner R U N Offset: 1-5 Incremental #1 Count: 5 83
84 Incremental Execution: Planner runs #2 Planner R U N Offset: 1-5 Incremental #1 Count: 5 Offset: 6-9 Incremental #2 Count: 4 84
85 Aggregation with State Planner R U N Offset: 1-5 Incremental #1 Count: 5 5 Offset: 6-9 Incremental #2 Count: 5+4=9 85
86 DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=final,isdistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=partial,isdistinct=false)], output=[cut#20,color#21,sum#58,count#59l]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----) 86
87 What s the difference between Complete and Append output modes? 87
88 COMPLETE, APPEND & UPDATE 88
89 There are two main modes and one in future append (default) 89
90 There are two main modes and one in future append (default) complete 90
91 There are two main modes and one in future append (default) complete update [in dreams] 91
92 Aggregation with watermarks 92
93 SOURCES & SINKS 93
94 Spark Streaming is a brick in the Big Data Wall 94
95 Let's save to Parquet files 95
96 Let's save to Parquet files 96
97 Let's save to Parquet files 97
98 File to Memory 98
99 Can we write to Kafka? 99
100 Nightly Build 100
101 Kafka-to-Kafka 101
102 J-K-S-K-S-C 102
103 Pipeline Demo 103
104 I didn t find sink/source for XXX 104
105 import org.apache.spark.sql.foreachwriter Console Foreach Sink val customwriter = new ForeachWriter[String] { override def open(partitionid: Long, version: Long) = true override def process(value: String) = println(value) override def close(errorornull: Throwable) = {} } stream.writestream.queryname( ForeachOnConsole").foreach(customWriter).start 105
106 Pinch of wisdom check checkpointlocation don t use MemoryStream think about GC pauses be careful about nighty builds use.groupby.count() instead count() use console sink instead.show() function 106
107 We have no ability join two streams work with update mode make full outer join take first N rows sort without pre-aggregation 107
108 Roadmap 2.2 Support other data sources (not only S3 + HDFS) Transactional updates Dataset is one DSL for all operations GraphFrames + Structured MLLib KafkaWriter TensorFrames 108
109 IN CONCLUSION 109
110 Final point Scalable Fault-Tolerant Real-Time Pipeline with Spark & Kafka is ready for usage 110
111 A few papers about Spark Streaming and Kafka Introduction in Spark + Kafka 111
112 Contacts Alexey_Zinovyev@epam.com Facebook: vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs 112
113 Any questions? 113
Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationSpark Streaming. Big Data Analysis with Scala and Spark Heather Miller
Spark Streaming Big Data Analysis with Scala and Spark Heather Miller Where Spark Streaming fits in (1) Spark is focused on batching Processing large, already-collected batches of data. For example: Where
More informationSpark Streaming. Professor Sasu Tarkoma.
Spark Streaming 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Spark Streaming Spark extension of accepting and processing of streaming high-throughput live data streams Data is accepted from various sources
More informationStructured Streaming. Big Data Analysis with Scala and Spark Heather Miller
Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationSpark Streaming and GraphX
Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationStreaming vs. batch processing
COSC 6339 Big Data Analytics Introduction to Spark (III) 2 nd homework assignment Edgar Gabriel Fall 2018 Streaming vs. batch processing Batch processing: Execution of a compute job without manual intervention
More informationBig Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data
Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationAn Overview of Apache Spark
An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationCS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim
CS 398 ACC Streaming Prof. Robert J. Brunner Ben Congdon Tyler Kim MP3 How s it going? Final Autograder run: - Tonight ~9pm - Tomorrow ~3pm Due tomorrow at 11:59 pm. Latest Commit to the repo at the time
More informationData at the Speed of your Users
Data at the Speed of your Users Apache Cassandra and Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014 Rustam Aliyev Solution Architect at. @rstml Big Data? Photo: Flickr
More informationStarting with Apache Spark, Best Practices and Learning from the Field
Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal Engineer + Spark Committer Spark@Microsoft Best Practices Enterprise Solutions Resilient - Fault tolerant
More informationIndex. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /
Index A ACID, 251 Actor model Akka installation, 44 Akka logos, 41 OOP vs. actors, 42 43 thread-based concurrency, 42 Agents server, 140, 251 Aggregation techniques materialized views, 216 probabilistic
More informationAgenda. Spark Platform Spark Core Spark Extensions Using Apache Spark
Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationChapter 5: Stream Processing. Big Data Management and Analytics 193
Chapter 5: Big Data Management and Analytics 193 Today s Lesson Data Streams & Data Stream Management System Data Stream Models Insert-Only Insert-Delete Additive Streaming Methods Sliding Windows & Ageing
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More information1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions
Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop
More informationApache Bahir Writing Applications using Apache Bahir
Apache Big Data Seville 2016 Apache Bahir Writing Applications using Apache Bahir Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing
More informationApache Spark and Scala Certification Training
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationWHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS
WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016 Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science
More informationHigher level data processing in Apache Spark
Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationIntra-cluster Replication for Apache Kafka. Jun Rao
Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture
More informationLecture Notes to Big Data Management and Analytics Winter Term 2018/2019 Stream Processing
Lecture Notes to Big Data Management and Analytics Winter Term 2018/2019 Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour, Julian
More informationApache Spark Tutorial
Apache Spark Tutorial Reynold Xin @rxin BOSS workshop at VLDB 2017 Apache Spark The most popular and de-facto framework for big data (science) APIs in SQL, R, Python, Scala, Java Support for SQL, ETL,
More informationSpark Streaming. Guido Salvaneschi
Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationSparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY
SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies
More informationSpark Streaming: Hands-on Session A.A. 2017/18
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Spark Streaming: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationDistributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationThe Stream Processor as a Database. Ufuk
The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect
More informationIMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES
IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES by PRAMOD KUMAR GUDIPATI B.E., OSMANIA UNIVERSITY (OU), INDIA, 2012 A REPORT submitted in partial fulfillment of the requirements of the
More informationLet the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka. Materna GmbH
Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka Wer ist Frank Pientka? Dipl.-Informatiker (TH Karlsruhe) Verheiratet, 2 Töchter Principal Software Architect in Dortmund Fast
More informationUn'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018
Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018 R E T H I N K I N G Stream Processing with Apache Kafka Kafka the Streaming Data Platform 1.0 Enterprise
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationOver the last few years, we have seen a disruption in the data management
JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,
More information빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기
빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net) D4 http://www.openwith.net 2 Hive http://www.openwith.net 3 What is Hive? 개념 a data warehouse infrastructure tool to process structured data in Hadoop. Hadoop
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationiway iway Big Data Integrator New Features Bulletin and Release Notes Version DN
iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,
More informationKafka Streams: Hands-on Session A.A. 2017/18
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Kafka Streams: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationKhadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016
Khadija Souissi Auf z Systems 07. 08. November 2016 @ IBM z Systems Mainframe Event 2016 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
More informationTurning Relational Database Tables into Spark Data Sources
Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following
More informationEsper EQC. Horizontal Scale-Out for Complex Event Processing
Esper EQC Horizontal Scale-Out for Complex Event Processing Esper EQC - Introduction Esper query container (EQC) is the horizontal scale-out architecture for Complex Event Processing with Esper and EsperHA
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under
More informationWHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka
WHITE PAPER Reference Guide for Deploying and Configuring Apache Kafka Revised: 02/2015 Table of Content 1. Introduction 3 2. Apache Kafka Technology Overview 3 3. Common Use Cases for Kafka 4 4. Deploying
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationStreaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_
Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_ About Us At GetInData, we build custom Big Data solutions Hadoop, Flink, Spark, Kafka and more Our team is today represented
More informationAbout Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals
More informationRESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
More informationLightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast
Lightning Fast Cluster Computing Michael Armbrust - @michaelarmbrust Reflections Projections 2015 Michast What is Apache? 2 What is Apache? Fast and general computing engine for clusters created by students
More informationData Analytics Job Guarantee Program
Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction
More informationEPL660: Information Retrieval and Search Engines Lab 11
EPL660: Information Retrieval and Search Engines Lab 11 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Introduction to Apache Spark Fast and general engine for
More informationReactive App using Actor model & Apache Spark. Rahul Kumar Software
Reactive App using Actor model & Apache Spark Rahul Kumar Software Developer @rahul_kumar_aws About Sigmoid We build realtime & big data systems. OUR CUSTOMERS Agenda Big Data - Intro Distributed Application
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (1/2) March 27, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationLecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka
Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka What problem does Kafka solve? Provides a way to deliver updates about changes in state from one service to another
More informationStorm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter
Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Storm at Twitter Twitter Web Analytics Before Storm Queues Workers Example (simplified) Example Workers schemify tweets and
More informationStorm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter
Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2700 watchers on Github
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationA BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION
A BIG DATA STREAMING RECIPE WHAT TO CONSIDER WHEN BUILDING A REAL TIME BIG DATA APPLICATION Konstantin Gregor / konstantin.gregor@tngtech.com ABOUT ME So ware developer for TNG in Munich Client in telecommunication
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationConquering Big Data with Apache Spark
Conquering Big Data with Apache Spark Ion Stoica November 1 st, 2015 UC BERKELEY The Berkeley AMPLab January 2011 2017 8 faculty > 50 students 3 software engineer team Organized for collaboration achines
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationIntroduction to Kafka (and why you care)
Introduction to Kafka (and why you care) Richard Nikula VP, Product Development and Support Nastel Technologies, Inc. 2 Introduction Richard Nikula VP of Product Development and Support Involved in MQ
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationSpark and distributed data processing
Stanford CS347 Guest Lecture Spark and distributed data processing Reynold Xin @rxin 2016-05-23 Who am I? Reynold Xin PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD),
More informationDiscretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important
More informationDistributed Data Analytics Stream Processing
G-3.1.09, Campus III Hasso Plattner Institut Types of Systems Services (online systems) Accept requests and send responses Performance measure: response time and availability Expected runtime: milliseconds
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More information