Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.
|
|
- Evangeline Brooks
- 5 years ago
- Views:
Transcription
1 Applied Spark From Concepts to Bitcoin Analytics Andrew F.
2 My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2
3 Additionally Member, Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 3
4 Additionally Founder, Data Fluency Software consultancy specializing in appropriate data solutions for startups and smb's Help clients make good decisions and leverage power traditionally accessible only to big business 3/28/16 QCON-SP Andrew Hart 4
5 Previously NASA Jet Propulsion Laboratory Building data management pipelines for research missions in many domains (Climate, Cancer, Mars, Radioastronomy, etc.) 3/28/16 QCON-SP Andrew Hart 5
6 Apache Spark 3/28/16 QCON-SP Andrew Hart 6
7 Spark is General purpose cluster computing software that maximizes use of cluster memory to process data Used by hundreds of organizations to realize performance gains over previous-generation cluster compute platforms, particularly for mapreduce style problems 3/28/16 QCON-SP Andrew Hart 7
8 Spark Was developed at the Algorithms, Machines, and People laboratory (AMP Lab) at the University of California, Berkeley in 2009 Is open source software presently under the governance of the Apache Software Foundation 3/28/16 QCON-SP Andrew Hart 8
9 Why Spark Exists Confluence of three trends: Increased volume of digital data Decreasing cost of computer memory (RAM) Data processing technology liberation 3/28/16 QCON-SP Andrew Hart 9
10 Digital data volume Early days low-resolution sensors, comparatively few people on the internet proprietary data, custom solutions, expensive, custom hardware 3/28/16 QCON-SP Andrew Hart 10
11 Digital data volume Modern era ubiquitous, high-resolution cameras mobile devices packed with sensors i.o.t. open source software, cheap commodity hardware 3/28/16 QCON-SP Andrew Hart 11
12 We've gone from this Internet 3/28/16 QCON-SP Andrew Hart 12
13 To this (Humanity's global communication platform) 3/28/16 QCON-SP Andrew Hart 13
14 1.5 Billion connected PCs 3.2 Billion connected people 6 Billion connected mobile devices 3/28/16 QCON-SP Andrew Hart 14
15 What are we doing with all of this? Every minute: 18,000 votes cast on Reddit 51,000 apps downloaded by Apple users 350,000 tweets posted on Twitter 4,100,000 likes recorded on Facebook 3/28/16 QCON-SP Andrew Hart 15
16 We are awash in data Monetizing this data is a core competency for many businesses Need tools to do this effectively at today's scale 3/28/16 QCON-SP Andrew Hart 16
17 2. Tool support and technology liberation 3/28/16 QCON-SP Andrew Hart 17
18 How far do you want to go back? We have always used tools to help us cope with data 3/28/16 QCON-SP Andrew Hart 18
19 VisiCalc Early "big data" tool Allowed business to move from the chalk board to the digital spreadsheet Phenomenal increase in productivity running numbers for business 3/28/16 QCON-SP Andrew Hart 19
20 Modern-era Spreadsheet Tech Microsoft Excel 1,048,576 rows x 16,384 columns 3/28/16 QCON-SP Andrew Hart 20
21 Open Source Alternatives Exist Microsoft Excel 1,048,576 rows x 16,384 columns Apache OpenOffice 1,048,576 rows x 1,024 columns 3/28/16 QCON-SP Andrew Hart 21
22 Relational Database Systems Support thousands of tables, millions of rows 3/28/16 QCON-SP Andrew Hart 22
23 Relational Database Systems Support thousands of tables, millions of rows Viable open source alternatives exist for many use cases 3/28/16 QCON-SP Andrew Hart 23
24 Modern Big Data Era MapReduce Algorithm (2004) parallelize large-scale computation across clusters of servers 3/28/16 QCON-SP Andrew Hart 24
25 Modern Big Data Era Hadoop Open source processing framework for MapReduce applications to run on large clusters of commodity (unreliable) hardware 3/28/16 QCON-SP Andrew Hart 25
26 3. Commoditization of computer memory 3/28/16 QCON-SP Andrew Hart 26
27 Early Days Main memory was hand made. You could see each bit. 3/28/16 QCON-SP Andrew Hart 27
28 Modern Era AWS EC2 r3.8xlarge 244 GiB for US$1.41/hr* 3/28/16 QCON-SP Andrew Hart 28
29 Why talk about memory? Ability to use memory efficiently distinguishes Spark from Hadoop, contributes to its speed advantages in many scenarios 3/28/16 QCON-SP Andrew Hart 29
30 How Spark Works 3/28/16 QCON-SP Andrew Hart 30
31 Primary abstraction in Spark: Resilient Distributed Datasets (RDD) Immutable (read-only), partitioned dataset Processed in parallel on each cluster node Fault-tolerant resilient to node failure 3/28/16 QCON-SP Andrew Hart 31
32 Primary abstraction in Spark: Resilient Distributed Datasets (RDD) Uses the distributed memory of the cluster to store the state of a computation as a sharable object across jobs ( instead of serializing to disk) 3/28/16 QCON-SP Andrew Hart 32
33 Traditional MapReduce: HDFS Read Map-1 HDFS Write HDFS Read Reduce-1 HDFS Write Data on Disk Map-2... Tuples on Disk Reduce-2... Tuples on Disk Map-n Reduce-n 3/28/16 QCON-SP Andrew Hart 33
34 Spark RDD Architecture: HDFS Read Map-1 Reduce-1 HDFS Write Data on Disk Map-2... Cluster Memory Reduce-2... Data on Disk Map-n RDD Reduce-n 3/28/16 QCON-SP Andrew Hart 34
35 Unified computational model: Spark unifies batch & streaming models, which traditionally require different architectures Sort of like the limit theorem in Calculus If you imagine a time-series set of RDDs with ever smaller windows of time, you can approximate streaming workflows 3/28/16 QCON-SP Andrew Hart 35
36 Two ways to create an RDD in Spark Programs: "Parallelize" an existing collection (e.g.: Python array) in the driver program Reference a dataset on external storage text files on disk anything with a supported Hadoop InputFormat 3/28/16 QCON-SP Andrew Hart 36
37 RDDs from RDDs: RDDs are immutable (read-only) The result of applying transformations and actions to an RDD is a new RDD RDD's can be persisted to memory reducing need to re-compute RDD every time 3/28/16 QCON-SP Andrew Hart 37
38 Writing Spark Programs: Think of spark programs as a way to describe a sequence of transformations and actions that should be applied to an RDD 3/28/16 QCON-SP Andrew Hart 38
39 Writing Spark Programs: Transformations create a new dataset from an existing one (e.g.: map) Actions return a value after running a computation (e.g.: reduce) 3/28/16 QCON-SP Andrew Hart 39
40 Writing Spark Programs: Spark provides a rich set of transformations: map flatmap filter sample union intersection distinct groupbykey sortbykey cogroup pipe join cartesian coalesce 3/28/16 QCON-SP Andrew Hart 40
41 Writing Spark Programs: Spark provides a rich set of actions: reduce collect count first take takesample takeordered saveastextfile countbykey foreach 3/28/16 QCON-SP Andrew Hart 41
42 Writing Spark Programs: Transformations are lazily evaluated. They are only computed when a subsequent action (which must return a result) is run. Transformations get recomputed each time an action is run (unless you explicitly persist the resulting RDD to memory) 3/28/16 QCON-SP Andrew Hart 42
43 Structure of Spark Programs: Spark programs have two principal components: Driver Program Worker function 3/28/16 QCON-SP Andrew Hart 43
44 Structure of Spark Programs: Driver program Executes on the master node Establishes context 3/28/16 QCON-SP Andrew Hart 44
45 Structure of Spark Programs: Worker (processing) function Executes on each worker node Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 45
46 Structure of Spark Programs: SparkContext Holds all of the information about the cluster Manages what gets shipped to nodes Makes life extremely easy for developers 3/28/16 QCON-SP Andrew Hart 46
47 Structure of Spark Programs: Shared Variables Broadcast variables: efficiently share static data with the cluster nodes Accumulators: write-only variables that serve as counters 3/28/16 QCON-SP Andrew Hart 47
48 Structure of Spark Programs: Worker (processing) function Executes on each worker node Computes transformations and actions on RDD partitions 3/28/16 QCON-SP Andrew Hart 48
49 Interacting with Spark 3/28/16 QCON-SP Andrew Hart 49
50 Spark APIs: Scala Java Python R SQL Spark provides APIS for several languages 3/28/16 QCON-SP Andrew Hart 50
51 Using Spark: There are two main ways to leverage Spark Interactively through the included command line interface Programmatically via standalone programs submitted as jobs to the master node 3/28/16 QCON-SP Andrew Hart 51
52 Using Spark: Spark is GREAT for experimenting Run experimental programs on small sample datasets on a single machine To scale up, simply re-target the SparkContext to point to the master node of a Spark cluster 3/28/16 QCON-SP Andrew Hart 52
53 Bitcoin 3/28/16 QCON-SP Andrew Hart 53
54 Bitcoin is A decentralized digital currency value is exchanged directly between individuals with no need for a traditional central authority (e.g.: bank) A global payment network transactions are broadcast and verified by network peers, with each using a complete copy of the network transaction history (the "blockchain") 3/28/16 QCON-SP Andrew Hart 54
55 Bitcoin is A decentralized digital currency value is exchanged directly between individuals with no need for a traditional central authority (e.g.: bank) A global payment network transactions are broadcast and verified by network peers, with each using a complete copy of the network transaction history (the "blockchain") 3/28/16 QCON-SP Andrew Hart 55
56 Bitcoin Protocol Open source software ( for processing peer-to-peer financial transactions with zero trust The backbone of an experimental financial system secured by math (instead of a trusted authority) 3/28/16 QCON-SP Andrew Hart 56
57 Bitcoin: Open Data Unprecedented transparency into the workings of a financial network Peer-to-peer payments made using Bitcoin Trading markets speculating on Bitcoin value Global, 24x7, Free Interesting research datasets abound 3/28/16 QCON-SP Andrew Hart 57
58 Bitcoin Exchange Companies 3/28/16 QCON-SP Andrew Hart 58
59 Bitcoin exchange Logos 3/28/16 QCON-SP Andrew Hart 59
60 Bitcoin exchange Logos 3/28/16 QCON-SP Andrew Hart 60
61 3/28/16 QCON-SP Andrew Hart 61
62 Spark Demos 3/28/16 QCON-SP Andrew Hart 62
63 The Hardware: Master Node 6 CPU Cores 8GB RAM 6 CPU Cores 8GB RAM 6 CPU Cores 8GB RAM 3/28/16 QCON-SP Andrew Hart 63
64 The Goal: Demonstrate experimenting with Spark CLI Demonstrate program execution on a standalone Spark cluster 3/28/16 QCON-SP Andrew Hart 64
65 The Data: 1 month of log files containing transactions from 10 global exchanges (in 3 currencies) 1 trade per line, JSON encoded Organized into files by exchange, currency, and date 3/28/16 QCON-SP Andrew Hart 65
66 The Question: What is the cumulative value of all "buy"-type transactions in the dataset? Bonus: bucketed by fiat currency 3/28/16 QCON-SP Andrew Hart 66
67 3/28/16 QCON-SP Andrew Hart 67
68 Wrap-up 3/28/16 QCON-SP Andrew Hart 68
69 Things to keep in mind using Spark: Speed comes from avoiding writes to disk Allocate enough memory in your cluster to hold your data completely in memory Once data fits in memory, adding more is not going to boost performance 3/28/16 QCON-SP Andrew Hart 69
70 Things to keep in mind using Spark: Once data fits in memory, most apps are CPU or network bound Allocate more cores in your cluster to increase parallelism (and tune Spark to use them!) Have disks around to handle spillover, configure to reduce unnecessary writes 3/28/16 QCON-SP Andrew Hart 70
71 image credit: Spark ecosystem: 3/28/16 QCON-SP Andrew Hart 71
72 image credit: Spark ecosystem (BDAS Stack): 3/28/16 QCON-SP Andrew Hart 72
73 Spark at the ASF: Entered Incubator in 2013 Graduated to "Top Level" project committers, 800+ contributors Outstanding documentation Community mailing lists Updated project status 3/28/16 QCON-SP Andrew Hart 73
74 Thank you! Contact: Web: 3/28/16 QCON-SP Andrew Hart 74
Processing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationAn Introduction to Big Data Analysis using Spark
An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,
More informationLambda Architecture with Apache Spark
Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR
More informationA very short introduction
A very short introduction General purpose compute engine for clusters batch / interactive / streaming used by and many others History developed in 2009 @ UC Berkeley joined the Apache foundation in 2013
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationCOSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?
COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationAn Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.
An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationData-intensive computing systems
Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors
More informationDistributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationMassive Online Analysis - Storm,Spark
Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R
More information15.1 Data flow vs. traditional network programming
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationWarehouse- Scale Computing and the BDAS Stack
Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,
More informationCloud, Big Data & Linear Algebra
Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes
More informationTime Series Storage with Apache Kudu (incubating)
Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationBacktesting with Spark
Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationAnalytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics
More informationIntroduction to Spark
Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationAgenda. Spark Platform Spark Core Spark Extensions Using Apache Spark
Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationData Platforms and Pattern Mining
Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,
More informationApache Spark Internals
Apache Spark Internals Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Apache Spark Internals 1 / 80 Acknowledgments & Sources Sources Research papers: https://spark.apache.org/research.html Presentations:
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationIntegrate MATLAB Analytics into Enterprise Applications
Integrate Analytics into Enterprise Applications Aurélie Urbain MathWorks Consulting Services 2015 The MathWorks, Inc. 1 Data Analytics Workflow Data Acquisition Data Analytics Analytics Integration Business
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More information빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기
빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net) D4 http://www.openwith.net 2 Hive http://www.openwith.net 3 What is Hive? 개념 a data warehouse infrastructure tool to process structured data in Hadoop. Hadoop
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationLecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci,
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More informationSpark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google
More informationAnalyzing Flight Data
IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo
More informationWHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3
WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationKhadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016
Khadija Souissi Auf z Systems 07. 08. November 2016 @ IBM z Systems Mainframe Event 2016 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory
More informationBig Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data
Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed
More informationAbout Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals
More informationSystems research enables application development by defining and implementing abstractions:
61A Lecture 36 Announcements Unix Computer Systems Systems research enables application development by defining and implementing abstractions: Operating systems provide a stable, consistent interface to
More informationHadoop course content
course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail
More information08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters
are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is running an application contains
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationApache Spark and Scala Certification Training
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationCSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101
CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would
More informationDistributed Computing with Spark and MapReduce
Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How
More informationApache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science
Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing
More informationTwitter data Analytics using Distributed Computing
Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationBringing Data to Life
Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model
More informationSpark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationStream Processing on IoT Devices using Calvin Framework
Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationIntroduction to Apache Spark. Patrick Wendell - Databricks
Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage
More informationTurning Relational Database Tables into Spark Data Sources
Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following
More informationThe Datacenter Needs an Operating System
UC BERKELEY The Datacenter Needs an Operating System Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter Needs an Operating System 3. Mesos,
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationSpark Streaming. Guido Salvaneschi
Spark Streaming Guido Salvaneschi 1 Spark Streaming Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark s batch and interactive
More informationNew Developments in Spark
New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level
More informationUNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in
More informationTHE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Apache Spark Lorenzo Di Gaetano THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION What is Apache Spark? A general purpose framework for big data processing It interfaces
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationDistributed systems for stream processing
Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine
More informationMATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2
1 Senior Application Engineer The MathWorks Korea 2017 The MathWorks, Inc. 2 Data Analytics Workflow Business Systems Smart Connected Systems Data Acquisition Engineering, Scientific, and Field Business
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More information2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters
are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is used to run an application
More informationBig Data Analytics at OSC
Big Data Analytics at OSC 04/05/2018 SUG Shameema Oottikkal Data Application Engineer Ohio SuperComputer Center email:soottikkal@osc.edu 1 Data Analytics at OSC Introduction: Data Analytical nodes OSC
More information