Streaming vs. batch processing
|
|
- Sandra Jenkins
- 5 years ago
- Views:
Transcription
1 COSC 6339 Big Data Analytics Introduction to Spark (III) 2 nd homework assignment Edgar Gabriel Fall 2018 Streaming vs. batch processing Batch processing: Execution of a compute job without manual intervention ( non-interactive) Best suited for solving problems that are based on large, but static data Streaming: Continuous execution of an application to analyze an incoming stream of data Best suited for instant analysis if data, often with realtime constraints (low-latency processing ) 1
2 Streaming reliability models Every data item can be analyzed: At most once Message may be lost and never delivered At least once Messages will never be lost but could be redelivered Exactly once Messages are never lost Messages are never redelivered Stream Processing System 2
3 Spark streaming Extension of the core Spark API to enable streaming applications Runs a streaming computation as a series of very small batches (micro-batch) Supports exactly-once semantics Spark Streaming Discretized Streams (DStream) Core Spark streaming abstraction Based on micro-batches of RDD s Input DStreams Represents the stream of raw data received from the streaming source Data can come from many sources (e.g. TCP sockets, Twitter, Flume, ) Operations Same as on regular Spark RDD Some additional transformations (e.g. window based transformations) 3
4 from pyspark import SparkContext from pyspark.streaming import StreamingContext sc = SparkContext(appName="NetworkWordCount") # Create a local StreamingContext with # a batch interval of 2 second ssc = StreamingContext(sc, 2) # Create a DStream that will connect to hostname:port lines = ssc.sockettextstream("localhost", 9999) words = lines.flatmap(lambda line: line.split(" ")) pairs = words.map(lambda word: (word, 1)) wordcounts = pairs.reducebykey(lambda x, y: x + y) wordcounts.print() ssc.start() # Start the computation ssc.awaittermination() # Wait for the computation to # terminate Spark Streaming Windows windowed computations: apply transformations over a sliding window of data any window operation needs to specify two parameters. window length: duration of the window sliding interval: interval at which the window operation is performed reducebywindow(func, windowlength, slideinterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. reducebykeyandwindow(func, windowlength, slideinterval, [numtasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. 4
5 Streaming and Checkpointing A streaming application must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). Need to checkpoint enough information to reliable storage system to recover from failures Metadata checkpointing - Saving of the information defining the streaming computation. Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary fpr stateful transformations that combine data across multiple batches. Spark Streaming Other considerations: Executors must be configured with sufficient memory to hold the received data dependent on the time interval! Configuring automatic restart of the application driver to recover from a driver failure Write ahead logs: all data received from a receiver can be written into a write ahead log in the configuration checkpoint directory. Prevents data loss on driver recovery Setting the max receiving rate: receivers can be rate limited by setting a maximum rate limit in terms of records / sec 5
6 Data Source Reliability A data source is considered unreliable if there is no means to replay a previously received message A data source is considered reliable if it can somehow replay a message if processing fails at any point in time A data source is considered durable if it can replay any message or set of messages given a selection criteria 2 nd Homework Rules Each student should deliver Source code (.py files) compressed to a zip or tar.gz file Source code has to be using python , spark Documentation (.pdf,.doc, or.txt file) explanations to the code answers to questions Deliver electronically on blackboard Expected by Sunday, October 21, 11.59pm In case of questions: ask, ask, ask! 6
7 Duplicate detection Discovery of multiple representations of the same realworld object Problems: Representations are not identical (Fuzzy Duplicates) Data sets are large: quadratic complexity if comparing every pair of records Similarity measures: Domain-dependant vs. domain independent solutions Avoid comparisons by partitioning Parallel computing Slide based on a lecture by Felix Nauman (University of Potsdam): Origins of duplicates Slide based on a lecture by Felix Nauman (University of Potsdam): 7
8 Ironically, Duplicate Detection has many Duplicates Slide based on a lecture by Felix Nauman (University of Potsdam): Part 1: a. Write a pyspark code to determine the 1,000 most popular words in the document collection provided. Please ensure that your code removes any special symbols, converts everything to lower case (and if possible: remove stop words, destemming, etc.). Stop word removal can be done either by using some nltk or creating your own list of words to be removed (e.g. and,or,the,it, ) 8
9 Part 2: a. Write a pyspark code to create an inverted index for the 1,000 words determined in Part 1. The inverted index is supposed to be of the form term1: doc1:weight 1_1,doc2:weight 2_1,doc3:weight 3_1, term2: doc1:weight 1_2,doc2:weight 2_2,doc3:weight 3_2, where weight x_y is: no. of occurrences of termx in document y / total number of words in document y b. Measure the execution time of the code for the large data set for 5, 10 and 15 executors Notes: revisit the Advanced MapReduce lecture for the inverted index. Part 3: a. Write a pyspark code to calculate the similarity matrix S with each entry of S being S(docx, docy) = tϵv(weight t_docx weight t_docy ) With V being the vocabulary (determined in part 1) and the weights having been determined in part 2 b. Measure the execution time for the large data set for 5, 10, and 15 executors Notes: See the following paper for the full algorithm: Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Pairwise document similarity in large collections with MapReduce 9
10 Part4: Provide a list of the 10 most similar (or identical) pair of documents from the large data set Input files Small input ( 22 short documents) available in hdfs in /cosc6339_hw2/gutenberg-22/ The small dataset contains 2 pair of documents that are identical, and 2 pair of documents that only have very minor differences. Your code should be able to identify them! Large data set available in hdfs in /cosc6339_hw2/large-dataset/ Only use large input file after you have confirmed that your code runs correctly with the small input file Dataset not yet there, but will be within the next 48 hours. Output: remember to write your results in the /bigdxy directory, not directly to / 10
11 Documentation The Documentation should contain (Brief) Problem description Solution strategy Description of how to run your code Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. The output files 11
12 Additional resources Python: Spark: Whale cluster: 12
Spark Streaming. Professor Sasu Tarkoma.
Spark Streaming 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Spark Streaming Spark extension of accepting and processing of streaming high-throughput live data streams Data is accepted from various sources
More informationSpark Streaming. Big Data Analysis with Scala and Spark Heather Miller
Spark Streaming Big Data Analysis with Scala and Spark Heather Miller Where Spark Streaming fits in (1) Spark is focused on batching Processing large, already-collected batches of data. For example: Where
More informationCS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim
CS 398 ACC Streaming Prof. Robert J. Brunner Ben Congdon Tyler Kim MP3 How s it going? Final Autograder run: - Tonight ~9pm - Tomorrow ~3pm Due tomorrow at 11:59 pm. Latest Commit to the repo at the time
More informationChapter 5: Stream Processing. Big Data Management and Analytics 193
Chapter 5: Big Data Management and Analytics 193 Today s Lesson Data Streams & Data Stream Management System Data Stream Models Insert-Only Insert-Delete Additive Streaming Methods Sliding Windows & Ageing
More informationLecture Notes to Big Data Management and Analytics Winter Term 2018/2019 Stream Processing
Lecture Notes to Big Data Management and Analytics Winter Term 2018/2019 Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour, Julian
More information25/05/2018. Spark Streaming is a framework for large scale stream processing
25/5/18 Spark Streaming is a framework for large scale stream processing Scales to s of nodes Can achieve second scale latencies Provides a simple batch-like API for implementing complex algorithm Can
More informationSpark Streaming. Guido Salvaneschi
Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive
More informationSpark Streaming: Hands-on Session A.A. 2017/18
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Spark Streaming: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica
More informationBig Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data
Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed
More informationCOSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?
COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications
More informationState and DStreams. Big Data Analysis with Scala and Spark Heather Miller
State and DStreams Big Data Analysis with Scala and Spark Heather Miller State? So far, we ve approached Spark Streaming in the same way we have approached regular Spark. Assumption so far: Functional
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationCOSC 6339 Big Data Analytics. NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment. Edgar Gabriel Spring 2017.
COSC 6339 Big Data Analytics NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment Edgar Gabriel Spring 2017 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in Tables Tables
More informationKafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM
Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015 About Contacts E-mail
More informationSparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY
SparkStreaming Large scale near- realtime stream processing Tathagata Das (TD) UC Berkeley UC BERKELEY Motivation Many important applications must process large data streams at second- scale latencies
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationData at the Speed of your Users
Data at the Speed of your Users Apache Cassandra and Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014 Rustam Aliyev Solution Architect at. @rstml Big Data? Photo: Flickr
More informationDistributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationSpark Streaming and GraphX
Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming
More informationApache Spark Streaming with Twitter (and Python) LinkedIn
From left to right:, Russell Hanson, Sascha Ishikawa, Asa Wilks, James Liu, Angel Martinez, & Scot Hickey. Photo copyright (c) 2017 by L. Weichberger Apache Spark Streaming with Twitter (and Python) Published
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationPyspark standalone code
COSC 6339 Big Data Analytics Introduction to Spark (II) Edgar Gabriel Spring 2017 Pyspark standalone code from pyspark import SparkConf, SparkContext from operator import add conf = SparkConf() conf.setappname(
More informationAgenda. Spark Platform Spark Core Spark Extensions Using Apache Spark
Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationAn Overview of Apache Spark
An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationTensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team
TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team What is TensorFlowOnSpark Why TensorFlowOnSpark at Yahoo? Major contributor to open-source
More informationAnnouncements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems
Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic
More informationCOSC 6374 Parallel Computation. Edgar Gabriel Fall Each student should deliver Source code (.c file) Documentation (.pdf,.doc,.tex or.
COSC 6374 Parallel Computation 1 st homework assignment Edgar Gabriel Fall 2015 1 st Homework Rules Each student should deliver Source code (.c file) Documentation (.pdf,.doc,.tex or.txt file) explanations
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark
More informationDiscretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important
More informationFunctional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14
Functional Comparison and Performance Evaluation Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14 Overview Streaming Core MISC Performance Benchmark Choose your weapon! 2 Continuous Streaming Micro-Batch
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationStructured Streaming. Big Data Analysis with Scala and Spark Heather Miller
Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly
More informationFunctional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10
Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10 Overview Streaming Core MISC Performance Benchmark Choose your weapon! 2 Continuous Streaming Ack per Record Storm* Twitter Heron* Storage
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationData processing in Apache Spark
Data processing in Apache Spark Pelle Jakovits 21 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Streaming
More informationAn exceedingly high-level overview of ambient noise processing with Spark and Hadoop
IRIS: USArray Short Course in Bloomington, Indian Special focus: Oklahoma Wavefields An exceedingly high-level overview of ambient noise processing with Spark and Hadoop Presented by Rob Mellors but based
More informationCS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Return type for collect()? Can
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationDeep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services
Deep Dive Amazon Kinesis Ian Meyers, Principal Solution Architect - Amazon Web Services Analytics Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationApache Spark and Scala Certification Training
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationBig Data Analytics at OSC
Big Data Analytics at OSC 04/05/2018 SUG Shameema Oottikkal Data Application Engineer Ohio SuperComputer Center email:soottikkal@osc.edu 1 Data Analytics at OSC Introduction: Data Analytical nodes OSC
More informationOver the last few years, we have seen a disruption in the data management
JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,
More informationSpark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationDept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [SPARK STREAMING] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Can Spark repartition
More informationIntroduction to Spark
Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationData-intensive computing systems
Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors
More informationAn Introduction to Big Data Analysis using Spark
An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationProcessing 11 billions events a day with Spark. Alexander Krasheninnikov
Processing 11 billions events a day with Spark Alexander Krasheninnikov Badoo facts 46 languages 10M Photos added daily 320M registered users 190 countries 21M daily active users 3000+ servers 2 data-centers
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationCOSC 6385 Computer Architecture. - Homework
COSC 6385 Computer Architecture - Homework Fall 2008 1 st Assignment Rules Each team should deliver Source code (.c,.h and Makefiles files) Please: no.o files and no executables! Documentation (.pdf,.doc,.tex
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationRESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationApache Bahir Writing Applications using Apache Bahir
Apache Big Data Seville 2016 Apache Bahir Writing Applications using Apache Bahir Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing
More informationBig Data Analytics with Hadoop and Spark at OSC
Big Data Analytics with Hadoop and Spark at OSC 09/28/2017 SUG Shameema Oottikkal Data Application Engineer Ohio SuperComputer Center email:soottikkal@osc.edu 1 Data Analytics at OSC Introduction: Data
More informationFast, Interactive, Language-Integrated Cluster Computing
Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationCIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench
CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More informationBig Data Analytics with Apache Spark. Nastaran Fatemi
Big Data Analytics with Apache Spark Nastaran Fatemi Apache Spark Throughout this part of the course we will use the Apache Spark framework for distributed data-parallel programming. Spark implements a
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More information15.1 Data flow vs. traditional network programming
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and
More informationEPL660: Information Retrieval and Search Engines Lab 11
EPL660: Information Retrieval and Search Engines Lab 11 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Introduction to Apache Spark Fast and general engine for
More informationApache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides
Apache Spark CS240A T Yang Some of them are based on P. Wendell s Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming
More informationReal-time data processing with Apache Flink
Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:
More informationAnalytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationTemporal Random Testing for Spark Streaming
Temporal Random Testing for Spark Streaming A. Riesco and J. Rodríguez-Hortalá Universidad Complutense de Madrid, Madrid, Spain 12th International Conference on integrated Formal Methods, ifm 2016 June
More informationIntroduction to Apache Spark. Patrick Wendell - Databricks
Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More information18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationApache Flink: Distributed Stream Data Processing
Apache Flink: Distributed Stream Data Processing K.M.J. Jacobs CERN, Geneva, Switzerland 1 Introduction The amount of data is growing significantly over the past few years. Therefore, the need for distributed
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationChase Wu New Jersey Institute of Technology
CS 644: Introduction to Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Institute of Technology Some of the slides were provided through the courtesy of Dr. Ching-Yung Lin at Columbia
More informationLecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci,
More informationStreaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_
Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_ About Us At GetInData, we build custom Big Data solutions Hadoop, Flink, Spark, Kafka and more Our team is today represented
More informationOracle Big Data Fundamentals Ed 2
Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies
More informationReal Time Recommendations using Spark Streaming. Elliot Chow
Real Time Recommendations using Spark Streaming Elliot Chow Why? - React more quickly to changes in interest - Time-of-day effects - Real-world events Feedback Loop UI Recommendation Systems Data Systems
More informationQunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio
CASE STUDY Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio Xueyan Li, Lei Xu, and Xiaoxu Lv Software Engineers at Qunar At Qunar, we have been running Alluxio in production for over
More informationDiscretized Streams: Fault-Tolerant Streaming Computation at Scale
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica University of California, Berkeley Abstract Many big
More informationAbout the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark
About the Tutorial Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark community released a tool, PySpark. Using PySpark, you can work with RDDs in Python
More informationApache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science
Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More information