Cloudera Improvements in Apache Spark
|
|
- Rosamond Gilbert
- 5 years ago
- Views:
Transcription
1 Cloudera Improvements in Apache Spark Brian Baillod Sales Engineer 1
2 Agenda Spark One PlaCorm Spark Overview and Improvements Spark Proof of Concept Kudu and Record Service 2
3 Cloudera company snapshot Founded 2008, by former employees of Employees Today 900+ worldwide World Class Support More than 75 24x7 global staff Cloudera University Over 40,000 trained We help code Hadoop Cloudera employees are leading developers & contributors to the complete Apache Hadoop ecosystem of projects We help fix Hadoop Cloudera fixed 60% of all Hadoop JIRA bugs 3
4 Hadoop Categories of Hadoop adop/on Free/Developer Over 2.5 million downloads Subscrip/on Training 60% of Fortune 100 a`ended Cloudera training, over 40,000 trained since 2009 Business Need Training Free/ Developer Services & Support Big Data Maturity Service & Support 9/10 for support ability to solve technical issues #1 recommenda/on Over 2x revenue of nearest 90% renewal rate 4
5 What is Spark Fast general purpose processing engine for large data Provides API s in Java, Scala and Python Includes an advanced DAG execu@on engine that supports in- memory compu@ng Includes high level tools like SparkSQL, Mllib, GraphX, and Spark Streaming Can run in a cluster, standalone, or local Latest version is Spark.apache.org LOTS of momentum 5
6 Cloudera One PlaCorm Cloudera is doubling down on Spark Outlining a vision for the future Kudu, Record Service, Auto- tuning, Security, Kaia integra@on Challenging other vendors to par@cipate in Spark Development 6
7 Cloudera s Engineering Commitment to Spark Spark CommiPers by Hadoop Distribu/on* Spark Patches by Hadoop Distribu/on Hortonworks 17% Intel 17% Cloudera 67% Cloudera, 370 Hortonworks, 4 IBM, 12 MapR, 1 Intel, 400 * IBM and MapR have 0 commipers 7
8 Spark will replace MapReduce To become the standard engine for Hadoop 8
9 The Future of Data Processing on Hadoop Spark complemented by specialized fit- for- purpose engines General Data Processing w/ Spark Fast Batch Processing, Machine Learning, and Stream Processing Full- Text Search w/solr Querying textual data Analy/c Database w/ Impala Low- Latency Massively Concurrent Queries On- Disk Processing w/mapreduce Jobs at extreme scale and extremely disk IO intensive Shared: Data Storage Metadata Resource Management Security Governance 9
10 Why is Cloudera leading this Cloudera was the first Hadoop vendor to ship and support Spark Spark is a fully integrated part of Cloudera s placorm Shared data, metadata, resource management, administra@on, security, and governance Cloudera is the first Hadoop vendor to offer Spark training Trained more customers than any other vendor Cloudera has more Spark customers in produc@on than all other companies combined 10
11 Spark Overview and Improvements 11
12 Apache Spark Flexible, in- memory data processing for Hadoop Easy Development Rich APIs for Scala, Java, and Python shell Flexible Extensible API APIs for different types of workloads: Batch Streaming Machine Learning Graph Fast Batch & Stream Processing In- Memory processing and caching 12
13 Easy Development High Language Support Python lines = sc.textfile(...) lines.filter(lambda s: ERROR in s).count() Scala val lines = sc.textfile(...) lines.filter(s => s.contains( ERROR )).count() Java JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count(); Na@ve support for mul@ple languages with iden@cal APIs Scala, Java, Python Use of closures, itera@ons, and other common language constructs to minimize code 2-5x less code 13
14 Python Or Scala? Use Python for prototyping Spark Python API is slower than Scala Use Scala for development Steep learning curve for programming 14
15 Easy Development Use percolateur:spark srowen$./bin/spark-shell --master local[*]... Welcome to / / / / \ \/ _ \/ _ `/ / '_/ / /. /\_,_/_/ /_/\_\ version SNAPSHOT /_/ Using Scala version (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51) Type in expressions to have them evaluated. Type :help for more information.... scala> val words = sc.textfile("file:/usr/share/dict/words")... words: org.apache.spark.rdd.rdd[string] = MapPartitionsRDD[1] at textfile at <console>:21 Interac@ve explora@on of data for data scien@sts No need to develop applica@ons Developers can prototype applica@on on live system scala> words.count... res0: Long = scala> 15
16 Easy Development Expressive API map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save 16
17 Memory Management for Greater Performance Memory can be enabler for high performance big data applica/ons GB RAM 50 GB per second Trends: ½ price every 18 months 2x bandwidth every 3 years 16 cores 17
18 Spark Concepts RDD Resilient Distributed Dataset Caching DataFrames Spark Streaming SparkSQL Pluggable Spark 18
19 Resilient Distributed Dataset (RDD) Read- only of records Created through: of data in storage of RDDs Contains lineage to compute from storage Lazy Users control persistence and 19
20 RDD Transforma/ons create new RDD from an one Ac/ons run on RDD and return a value Transforma@ons are lazy Ac@ons materialize RDDs by compu@ng transforma@ons RDDs can be cached to avoid re- compu@ng 20
21 Example Transforma/ons Map Filter Sample Join Ac/ons Reduce Count First, Take SaveAs 21
22 Fault- Tolerance RDDs contain lineage Lineage: Source and list of Lost can be re- computed from source data msgs = textfile.filter(lambda s: s.startswith( ERROR )).map(lambda s: s.split( \t )[2]) HDFS File Filtered RDD Mapped RDD filter (func = startswith( )) map (func = split(...)) 22
23 Caching Storage Levels Different provide tradeoffs between memory usage and CPU efficiency. Cache when using algorithms. MEMORY_ONLY most CPU efficient, data has to fit in memory MEMORY_ONLY_SER More space efficient but reasonably fast MEMORY_AND_DISK MEMORY_AND_DISK_SER DISK_ONLY MEMORY_ONLY_2, MEMORY_AND_DISK_2 23
24 Data Frames Distributed of rows organized into named columns Spark SQL s Data Source API can read and write Data Frames using a variety of formats Hive, JSON, Parquet, HDFS Calling the DataFrame API can let you Select the columns you want Join data sources Aggregate and Filter Spark 1.5 lets you access the Hive Metastore to read/write schemas directly. 24
25 Spark Streaming What is it? Run con,nuous processing of data using Spark s core API Extends Spark concepts to fault- tolerant, transformable streams Adds rolling window opera@ons Example: Compute rolling averages or counts for data over last five minutes Benefits: Same programming paradigm for streaming and batch Excellent throughput Scale easily to support large volumes of data ingest Common Use Cases: On- the- fly ETL as data is ingested into Hadoop/HDFS Detect anomalous behavior and trigger alerts Con@nuous repor@ng of summary metrics for incoming data 25
26 Spark Streaming Architectures Spark Stream Processing Data Sources Integra/on Layer Data Prep / Scoring Ingest Flume Kaia Transformed Results HBase Real- Time Result Serving HDFS Spark Long- Term Analy/cs/ Model Building 26
27 SparkSQL Machine Learning Goal: Spark/Java Developers and Data can inline SQL into Spark apps Designed for: Ease of development for Spark developers Handful of concurrent Spark jobs Strengths: Ease of embedding SQL into Java or Scala SQL for common in developer flow (eg. filters, samples) 27
28 Impala Remains Tool of Choice for SQL Time (in seconds) Single User vs 10 User Response Time/Impala Times Faster (Lower bars = be`er) Single User, 5 10 Users, 11 Single User, x 10 Users, x Impala Spark SQL Presto Hive- on- Tez Single User, x 10 Users, x Single User, x 10 Users, x 28
29 Pluggable Spark replace MapReduce Cloudera is leading community development to port components to Spark: Stage 1 Stage 2 Stage 3 Crunch on Spark Search on Spark Hive on Spark (beta) Spark on HBase (beta) Pig on Spark (alpha) Sqoop on Spark Spark on Kudu 29
30 Spark Customer Use Cases Core Spark Financial Services Health PorColio Risk Analysis ETL Pipeline Speed- Up 20+ years of stock data disease- causing genes in the full human genome Calculate Jaccard scores on health care data sets Spark Streaming Financial Services Health Online Fraud Incident for Sepsis ERP Character and Bill Retail Online Systems Real- Time Inventory Management 1010 Data Services Trend analysis Document (LDA) Fraud Ad Tech Real- Time Ad Performance Analysis 30
31 Doing the Math Executors and Cores Determine the resource for the Spark job 4 Core 4 Core C 1 Core for Applica@on Master Cores for Executors 1 Executor 4 Cores x 15 Cores 3 Executors with 4 Cores Each (Leaves 3 Cores un- u@lized) Other Ra@os may lead to be`er resource u@liza@on 4 Core 1 Executor 2 Cores x 15 Cores 7 Executors with 2 Cores Each (Leaves 1 Core un- u@lized) 4 Core Don t exceed= 5 Cores per Executor h`p://blog.cloudera.com/blog/2015/03/how- to- tune- your- apache- spark- jobs- part- 2/ 16 Total Cores in Cluster Core Alloca@on Allocate Executors 31
32 Kudu 32
33 Kudu Design Goals High throughput for big scans (columnar storage and Goal: Within 2x of Parquet Low- latency for short accesses (primary key indexes and quorum design) Goal: 1ms read/write on SSD Database- like single- row ACID) Rela/onal data model SQL query NoSQL style scan/insert/update (Java client) 33
34 Kudu Storage for Fast on Fast Data DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS BATCH STREAM SQL SEARCH MODEL ONLINE SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE UNIFIED DATA SERVICES RESOURCE MANAGEMENT YARN SECURITY SENTRY New column store for Hadoop Simplifies the architecture for building on changing data Designed for fast performance integrated with Hadoop FILESYSTEM HDFS DATA INTEGRATION & STORAGE RELATIONAL KUDU NoSQL HBASE Apache- licensed open source (intent to donate to ASF) INGEST SQOOP, FLUME, KAFKA Beta now available 34
35 Kudu Trade- Offs Random updates will be slower HBase model allows random updates without incurring a disk seek Kudu requires a key lookup before update, Bloom lookup before insert Single- row reads may be slower Columnar design is op@mized for scans Future: may introduce column groups for applica@ons where single- row access is more important 35
36 Resources Join the community h`p://getkudu.io Download the Beta cloudera.com/downloads Read the Whitepaper getkudu.io/kudu.pdf 36
37 RecordService 37
38 Hadoop started out with zero security Didn t need it for the Silicon Valley applica@ons Does need it for Corporate applica@ons Cloudera is working on providing full featured Spark Security 38
39 Comprehensive, Compliance- Ready Security Audit, and Compliance Perimeter Access Visibility Data Guarding access to the cluster itself Defining what users and can do with data on where data came from and how it s being used Protec@ng data in the cluster from unauthorized visibility Technical Concepts: Authen@ca@on Network isola@on Technical Concepts: Permissions Authoriza@on Technical Concepts: Audi@ng Lineage Technical Concepts: Encryp@on, Tokeniza@on, Data masking Cloudera Manager Apache Sentry & RecordService Cloudera Navigator Navigator Encrypt & Key Trustee Partners 39
40 Directory and Kerberos Directory Manages Users, Groups, and Services Provides username / password authen@ca@on Group membership determines Service access Kerberos Trusted and standard third- party Authen@cated users receive Tickets Tickets gain access to Services User [ssmith] Password[***** ] User authen@cates to AD Authen@cated user gets Kerberos Ticket Ticket grants access to Services e.g. Impala 40
41 Fine- Grained Access Control in HDFS Across All Hadoop Paths Columns: column visibility varies by role (Ex. credit card numbers) Managers: Call Center: XXXX XXXX XXXX 5678 Analysts: XXXX XXXX XXXX XXXX Others: No access to credit card column Rows: Different user groups need access to different records European privacy laws Government security clearance Financial 41
42 RecordService Unified Access Control Enforcement DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS BATCH STREAM SQL SEARCH MODEL ONLINE SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE UNIFIED DATA SERVICES RESOURCE MANAGEMENT YARN SECURITY SENTRY, RECORDSERVICE New high performance security layer that centrally enforces access control policies across Hadoop Complements Apache Sentry s unified policy defini@on Row- and column- based security Dynamic data masking DATA INTEGRATION & STORAGE FILESYSTEM HDFS INGEST SQOOP, FLUME, KAFKA NoSQL HBASE Apache- licensed open source Beta now available 42
43 Fine- Grained HDFS Access without RecordService Split the original file Use HDFS permissions to limit access Date//me Accnt # SSN Asset Trade Country 09:33: :33: :12: :22: :55: :22: :45: :03: :55: AAPL Sell US TBT Buy EU IBM Sell UK INTC Buy US F Buy US UA Buy UK AMZN Sell EU TMV Buy US MA Buy UK Date//me Accnt # SSN Asset Trade Country 09:33: :22: :55: :03: Date//me Accnt # SSN Asset Trade Country 11:33: :45: Date//me Accnt # SSN Asset Trade Country 14:12: :22: :55: AAPL Sell US INTC Buy US F Buy US TMV Buy US TBT Buy EU AMZN Sell EU IBM Sell UK UA Buy UK MA Buy UK 43
44 Fine- Grained HDFS Access Control with RecordService Apply controls to the master data file Row, column, and sub- column (masking) controls Enforce these across all access paths Column- Level Controls What U.S. Brokers See Column- Level Controls Date//me Accnt # SSN Asset Trade Country Date//me Accnt # SSN Asset Trade Country 09:33: AAPL Sell US 09:33: XXX- XX AAPL Sell US 11:33: :12: :22: :55: TBT Buy EU IBM Sell EU INTC Buy US F Buy US Row- Level Controls 11:33: :12: :22: :55: TBT Buy group IBM Sell group XXX- XX INTC Buy US XXX- XX F Buy US Row- Level Controls 10:22: UA Buy EU 10:22: UA Buy group3 13:45: AMZN Sell EU 13:45: AMZN Sell group2 44
45 Spark Resources Learn Spark Spark Cookbook by Rishi Yadav O Reilly Advanced Analy@cs with Spark ebook (wri`en by Clouderans) Cloudera Developer Blog cloudera.com/spark Get Trained Cloudera Spark Training 45
Enabling Secure Hadoop Environments
Enabling Secure Hadoop Environments Fred Koopmans Sr. Director of Product Management 1 The future of government is data management What s your strategy? 2 Cloudera s Enterprise Data Hub makes it possible
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationIntroduction to Apache Spark. Patrick Wendell - Databricks
Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationIntroducing Apache Kudu and RecordService (incubating)
Introducing Apache Kudu and RecordService (incubating) Guido Oswald Sales Engineer, Switzerland April 2016, Swiss Big Data User Group Meetup 18 @GuidoOswald 1 Current storage landscape in Hadoop HDFS excels
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under
More informationLambda Architecture with Apache Spark
Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationSubmitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay
Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunny Chung Presented by: Sonal Deshmukh Jay Upadhyay What is Apache Survey shows huge popularity spike for Apache
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationBig Data security, tools and tips to protect information assets
Big Data security, tools and tips to protect information assets Eddie Garcia Chief Security Architect 1 A Big Data Revolution is Happening as We Speak Industrial Revolution Data Revolution 2 The Benefits
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationData processing in Apache Spark
Data processing in Apache Spark Pelle Jakovits 5 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Frameworks
More informationData processing in Apache Spark
Data processing in Apache Spark Pelle Jakovits 8 October, 2014, Tartu Outline Introduction to Spark Resilient Distributed Data (RDD) Available data operations Examples Advantages and Disadvantages Frameworks
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationNew Developments in Spark
New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level
More informationStay Informed During and AEer OpenWorld
Stay Informed During and AEer OpenWorld TwiIer: @OracleBigData, @OracleExadata, @Infrastructure Follow #CloudReady LinkedIn: Oracle IT Infrastructure Oracle Showcase Page Oracle Big Data Oracle Showcase
More informationSpark and distributed data processing
Stanford CS347 Guest Lecture Spark and distributed data processing Reynold Xin @rxin 2016-05-23 Who am I? Reynold Xin PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD),
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationData processing in Apache Spark
Data processing in Apache Spark Pelle Jakovits 21 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Streaming
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationApache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides
Apache Spark CS240A T Yang Some of them are based on P. Wendell s Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationAn Overview of Apache Spark
An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using
More informationApache Spark. Easy and Fast Big Data Analytics Pat McDonough
Apache Spark Easy and Fast Big Data Analytics Pat McDonough Founded by the creators of Apache Spark out of UC Berkeley s AMPLab Fully committed to 100% open source Apache Spark Support and Grow the Spark
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationOLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks
OLTP on Hadoop: Reviewing the first Hadoop- based TPC- C benchmarks Monte Zweben Co- Founder and Chief Execu6ve Officer John Leach Co- Founder and Chief Technology Officer September 30, 2015 The Tradi6onal
More informationMapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin
Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationDistributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationBacktesting with Spark
Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationJialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat
H5Spark H5Spark: Bridging the I/O Gap between Spark and Scien9fic Data Formats on HPC Systems Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationAgenda. Spark Platform Spark Core Spark Extensions Using Apache Spark
Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationBig Data Analytics. Map-Reduce and Spark. Slides by: Joseph E. Gonzalez Guest Lecturer: Vikram Sreekanti
Big Data Analytics Map-Reduce and Spark Slides by: Joseph E. Gonzalez jegonzal@cs.berkeley.edu? Guest Lecturer: Vikram Sreekanti From SQL to Big Data (with SQL) Ø A few weeks ago Ø Databases Ø (Relational)
More informationBig Data Infrastructures & Technologies Hadoop Streaming Revisit.
Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationIntroduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent
Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage
More informationApplied Spark. From Concepts to Bitcoin Analytics. Andrew F.
Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationSpark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationOracle Big Data Fundamentals Ed 1
Oracle University Contact Us: +0097143909050 Oracle Big Data Fundamentals Ed 1 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big Data
More informationExam Questions
Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationAnalytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationCloudera Kudu Introduction
Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)
More informationEvolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0
Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0 Chris Roderick Marcin Sobieszek Piotr Sowinski Nikolay Tsvetkov Jakub Wozniak Courtesy IT-DB Agenda Intro to CALS System Hadoop
More informationDistributed Computing with Spark and MapReduce
Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How
More informationHBase... And Lewis Carroll! Twi:er,
HBase... And Lewis Carroll! jw4ean@cloudera.com Twi:er, LinkedIn: @jw4ean 1 Introduc@on 2010: Cloudera Solu@ons Architect 2011: Cloudera TAM/DSE 2012-2013: Cloudera Training focusing on Partners and Newbies
More informationKhadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016
Khadija Souissi Auf z Systems 07. 08. November 2016 @ IBM z Systems Mainframe Event 2016 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.
More informationTime Series Storage with Apache Kudu (incubating)
Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationTurning Relational Database Tables into Spark Data Sources
Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following
More informationBring Context To Your Machine Data With Hadoop, RDBMS & Splunk
Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may
More informationHadoop course content
course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationSecurity and Performance advances with Oracle Big Data SQL
Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,
More informationOracle Big Data Fundamentals Ed 2
Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationHortonworks and The Internet of Things
Hortonworks and The Internet of Things Dr. Bernhard Walter Solutions Engineer About Hortonworks Customer Momentum ~700 customers (as of November 4, 2015) 152 customers added in Q3 2015 Publicly traded
More informationMapReduce, Apache Hadoop
NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/ svoboda/courses/2016-1-ndbi040/ Lecture 2 MapReduce, Apache Hadoop Marn Svoboda svoboda@ksi.mff.cuni.cz 11. 10. 2016 Charles University
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationAnalyzing Flight Data
IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationBringing Data to Life
Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model
More informationImpala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam
Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?
More informationBigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite
BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du, Wanling Gao, Rui Ren, Yaodong Cheng, Zhifei Zhang, Zhen Jia, Peijian Wang and Jianfeng Zhan INSTITUTE
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More information