SciSpark Tutorial 101

Similar documents
SciSpark 201. Searching for MCCs

SciSpark 301. Build Your Own Climate Metrics

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Spark Overview. Professor Sasu Tarkoma.

CSE 444: Database Internals. Lecture 23 Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Chapter 4: Apache Spark

Processing of big data with Apache Spark

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Cloud Computing & Visualization

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

MapReduce, Hadoop and Spark. Bompotas Agorakis

Scalable Tools - Part I Introduction to Scalable Tools

The Evolution of Big Data Platforms and Data Science

An Introduction to Apache Spark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Spark, Shark and Spark Streaming Introduction

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

An Introduction to Big Data Analysis using Spark

Hadoop Development Introduction

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Big data systems 12/8/17

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

An Introduction to Apache Spark

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Big Data Infrastructures & Technologies

Big Data Analytics using Apache Hadoop and Spark with Scala

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Analyzing Flight Data

Accelerating Spark Workloads using GPUs

Specialist ICT Learning

Accelerate Big Data Insights

Scientific Applications of the Regional Climate Model Evaluation System (RCMES)

Apache Spark Internals

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Python Certification Training

Lecture 11 Hadoop & Spark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Turning Relational Database Tables into Spark Data Sources

A very short introduction

A Tutorial on Apache Spark

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Data Platforms and Pattern Mining

Dell In-Memory Appliance for Cloudera Enterprise

Jupyter and Spark on Mesos: Best Practices. June 21 st, 2017

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Resilient Distributed Datasets

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Big Data Architect.

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.

Oracle Big Data Fundamentals Ed 2

An Overview of Apache Spark

Databases and Big Data Today. CS634 Class 22

Massive Online Analysis - Storm,Spark

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Spark.jl: Resilient Distributed Datasets in Julia

Geospatial Analytics at Scale

15.1 Data flow vs. traditional network programming

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Employing HPC DEEP-EST for HEP Data Analysis. Viktor Khristenko (CERN, DEEP-EST), Maria Girone (CERN)

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Spark Streaming. Guido Salvaneschi

microsoft

/ Cloud Computing. Recitation 13 April 17th 2018

Spark and HPC for High Energy Physics Data Analyses

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

2/4/2019 Week 3- A Sangmi Lee Pallickara

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Cloud, Big Data & Linear Algebra

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap

Oracle Big Data Connectors

Big Data Hadoop Course Content

Big Data Analytics with Apache Spark. Nastaran Fatemi

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Apache Spark 2.0. Matei

Introduction to Apache Spark. Patrick Wendell - Databricks

Outline. CS-562 Introduction to data analysis using Apache Spark

Techno Expert Solutions An institute for specialized studies!

Working with Scientific Data in ArcGIS Platform

New Developments in Spark

The Regional Climate Model Evalua4on System (RCMES): Introduc4on and Demonstra4on

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

CompSci 516: Database Systems

Introduction to Apache Spark

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Transcription:

SciSpark Tutorial 101 Introduction to Spark Super Cool Parallel Computing In-Memory Map-Reduce It Slices, It Dices, It Minces,... So Fast, You Won t Believe It!!! ORDER NOW!!!

Agenda for 101: Intro. to SciSpark Project Access your SciSpark & Notebook VM (personal sandbox) What is Spark? Lightning-Fast Cluster (Parallel) Computing Simple programming in python or scala SciSpark Extensions scitensor: N-dimensional arrays in Spark srdd: scientific RDD s Notebook Technology Tutorial Notebooks: 101-0: Spark Warmup 101-1: Intro. to Spark (in scala & python) 101-2: Spark SQL and DataFrames 101-3: Simple srdd Notebook (optional) 101-4: Parallel Statistical Rollups for a Time-Series of Grids (pyspark)

Claim your SciSpark VM Go to: http://is.gd/scispark You ll have to click through! Enter your name next to an IP address to claim it Enter the URL in your browser to access the SciSpark Notebook Chrome or Firefox preferred

Agenda for 201: Access your SciSpark & Notebook VM (personal sandbox) Quick recap. of SciSpark Project What is Spark? SciSpark Extensions scitensor: N-dimensional arrays in Spark srdd: scientific RDD s Implementation of Mesoscale Convective Complexes (MCC) Analytics Find cloud elements, analyze evolution, connect the graph Tutorial Notebooks: 201-1: Using scitensor and srdd 201-2: More SciSpark API & Lazy Evaluation 201-3: End-to-end analysis of Mesoscale Convective Systems 201-4: MCS Cloud Element Analytics

Agenda for 301: Access your SciSpark & Notebook VM (personal sandbox) Quick recap. of SciSpark Project PDF Clustering Use Case K-means Clustering of atmospheric behavior using a full probability density function (PDF) or just moments: histogram versus (std dev, skewness) SciSpark Challenges & Lessons Learned Tutorial Notebooks: 301-1, 2, 3: PDF Clustering in PySpark, in Scala, Side by Side 301-4: Regional Climate Model Evaluation (RCMES) - (optional) 301-5: CMDA Workflow Example 301-6: Lessons Learned Notebook Wrap-up Discussion: How do you do X in SciSpark? Bring your Use Cases and questions

SciSpark AIST Project AIST-14-0034 Funded by Mike Little PI: Christian Mattmann Project Mgr: Brian Wilson JPL Team: Kim Whitehall, Sujen Shah, Maziyar Boustani, Alex Goodman, Wayne Burke, Gerald Manipon, Paul Ramirez, Rishi Verma, Lewis McGibbney Interns: Rahul Palamuttam, Valerie Roth, Harsha Manjunatha, Renato Marroquin

How did we get involved with Spark?

Motivation for SciSpark Experiment with in memory and frequent data reuse operations Regridding, Interactive Analytics such as MCC search, and variable clustering (min/max) over decadal datasets could benefit from in-memory testing (rather than frequent disk I/O) Data Ingestion (preparation, formatting)

SciSpark: Envisioned Architecture

SciSpark: Motivation Climate and weather directly impacts all aspects of society To understand average climate conditions, and extreme weather events, we need data from climate models and observations These data are voluminous & heterogeneous - varying spatial and temporal resolutions in various scientific data formats Provides important information for policyand decision- makers, as well as for Earth science and climate change researchers "AtmosphericModelSchematic" by NOAA - http: //celebrating200years.noaa. gov/breakthroughs/climate_model/atmosphericmo delschematic.png. Licensed under Public Domain via Commons - https://commons.wikimedia. org/wiki/file:atmosphericmodelschematic. png#/media/file:atmosphericmodelschematic.png

SciSpark: an analytical engine for science data I/O is major performance bottleneck Reading voluminous hierarchical datasets Batch processing limits interaction and exploration of scientific datasets Regridding, interactive analytics, and variable clustering (min/max) over decadal datasets could benefit from in-memory computation Spark Resilient Distributed Dataset (RDD) Enables re-use of intermediate results Replicated, distributed, and reconstructed In memory for fast interactivity 14

SciSpark: A Two-Pronged Approach to Spark Extend Native Spark on the JVM (scala, java) Handle Earth Science geolocated arrays (variables as a function of time, lat, lon) Provide netcdf & OPeNDAP data ingest Provide array operators like numpy Implement two complex Use Cases: Mesoscale Convective Systems (MCS) - discover cloud elements, connect graph PDF Clustering of atmospheric state over N. America Exploit PySpark (Spark jobs using python gateway) Exploit numpy & scipy ecosystem Port and parallelize existing python analysis codes in PySpark Statistical Rollups over Long Time-Series Regional Climate Model Evaluation System (RCMES) Many others

PySpark Gateway Has full Python ecosystem available Numpy N-dimensional arrays Scipy algorithms However, now have two runtimes: Python & JVM Py4J gateway moves simple data structures back-and-forth At expense of communication overhead, format conversions Apache Spark actually supports three languages: Scala, Python, Java Can translate from Python to Scala almost line-by-line

Three Challenges using Spark for Earth Science Adapting Spark RDD to geospatial 2D & 3D (dense) arrays srdd, with loaders from netcdf & HDF files scitensor using ND4J or Breeze Reworking complex science algorithms (like GTG) into Spark s map-, filter-, and reduce- paradigm Generate parallel work at expense of duplicating some data Port or redevelop key algorithms from python/c to Scala Performance for bundles of dense arrays in Spark JVM Large arrays in memory require large JVM heap Garbage collection overheads

Parallel Computing Styles Task-Driven: Recursively Divide work into small Tasks Execute them in parallel Data-Driven: First, Partition the data across the Cluster Then Compute using map-filter-reduce Repartition (shuffle) data as necessary using groupbykey

Sources of Parallel Work (Independent Data) Parallelize Over Time: Variables (grids) over a Long 10-30 Year Time-Series (keep grid together) Parallelize Over Space (and/or Altitude): By Pixel: Independent Time-Series, Massive Parallelism By Spatial Tiles: Subsetting, Area Averaging, Stencils (NEXUS) By Blocks: Blocked Array operations (harder algorithms) Parallelize Over Variable, Model, Metrics, Parameters, etc. Ensemble Model evaluation with multiple metrics Search Parameter Space with many analyses and runs

What is Apache Spark? - In-memory Map-Reduce 20

What is Apache Spark? - In-memory Map-Reduce Datasets partitioned across a compute cluster by key Shard by time, space, and/or variable RDD: Resilient Distributed Dataset Fault-tolerant, parallel data structures Intermediate results persisted in memory User controls the partitioning to optimize data placement New RDD s computed using pipeline of transformations Resilience: Lost shards can be recomputed from saved pipeline Rich set of operators on RDD s Parallel: Map, Filter, Sample, PartitionBy, Sort Reduce: GroupByKey, ReduceByKey, Count, Collect, Union, Join Computation is implicit (Lazy) until answers needed Pipeline of Transformations implicitly define a New RDD RDD computed only when needed (Action): Count, Collect, Reduce Persistence Hierarchy (SaveTo) Implicit Pipelined RDD, In-Memory, On fast SSD, On Hard Disk 21

Apache Spark Transformations & Actions

SciSpark Project Contributions Parallel Ingest of Science Data from netcdf and HDF Using OPeNDAP and Webification URL s to slice arrays Scientific RDD s for large arrays (srdd s) Bundles of 2,3,4-dimensional arrays keyed by name Partitioned by time and/or space More Operators ArraySplit by time and space, custom statistics, etc. Sophisticated Statistics and Machine Learning Higher-Order Statistics (skewness, kurtosis) Multivariate PDF s and histograms Clustering, Graph algorithms Partitioned Variable Cache (in development) Store named arrays in distributed Cassandra db or HDFS Interactive Statistics and Plots Live code window submits jobs to SciSpark Cluster Incremental statistics and plots stream into the browser UI

Spark Split Methods Sequence files: text files, log files, sequences of records If in HDFS, already split across the cluster (original use for Hadoop & Spark) sc.parallelize(list, numpartitions=32) Split a list or array into chunks across the cluster sc.textfile(path, numpartitions=32), sc.binaryfile() Split a text (or binary) file into chunks of records Each chunk is set of lines or records sc.wholetextfiles(), sc.wholebinaryfiles() Split list of files across the cluster, whole files are kept intact Custom Partition methods Write your own

SciSpark Extensions for netcdf scisparkcontext.netcdffile([path or DAP URL], numpartitions=32) Read multiple variables (arrays) & attributes into a scitensor scisparkcontext.netcdfdfsfile([list of netcdf files in local dir. or HDFS]) Split a list or array into chunks across the cluster scisparkcontext.openpath([all netcdf files nested under top directory]) scisparkcontext.readmergfile() Read a custom binary file

SciSpark Front-end: Notebooks

SciSpark Front-end: Notebooks Multi-interpreter support (by cell) Scala (default), Python (%pyspark), Spark SQL (%sql), etc. Documentation cells in markdown (%md), HTML, with images Notebook automatically connects to spark-shell (cluster) SparkContext available as sc Progress bars while executing Spark tasks Other features Auto-saved as you type Can share, copy & modify, group edit & view Run your own Notebook server using Spark standalone

SciSpark Infrastructure Stack Leverages open-sourced projects

SciSpark Extensions To Core Ecosystem SciSpark srdd scitensor

Notebook 101-0: Spark Warmup

Notebook 101-1: Intro. to Spark

Spark also does SQL and DataFrames This is useful for Earth Science Datasets because: Heterogeneity of data: In delimited file formats such as.csv files (e.g. NOAA s NCEI StormData, some station data) In text formats (e.g. NHC reports) In SQL databases (e.g. NWS and WMO databases for climate variables) Analysis with these datasets is usually limited to single core processing in one environment and involves: subsetting (required) data from the DB or the files and storing in tabular format performing analysis available for tables transferring the results to final locations. This process is usually not very reproducible.

Spark also does SQL and DataFrames The notebook interface backed by Spark allows for: Loading (all) the data into the parallelized environment Discovery through the ability to access to all the data within an SQL type of environment or a frame Analysis through querying the data and visualizations

Notebook 101-2: Spark SQL and DataFrames

Challenge of handling Earth Science data Heterogeneity of the data: Some instruments produce point source data - stored in ASCII formats and tabular formats Some instruments produce multi-dimensional data for the FOV as a function of time, level, lat, lon. These are stored in data formats hierarchical data formats that are self-describing, multidimensional and usually multivariable. Common formats are netcdf, HDF. Models usually produce multi-dimensional data as well. The Spark engine does not inherently handle N-dimensional arrays Size of the data: As instruments and models improve in resolution, the data grow significantly Getting N-dimensional data into SciSpark

SciSpark Accomplishments srdd architecture - Built srdd and scitensor to handle scientific data formats Demonstrated functionality to read hierarchical data Extend N-Dimensional Array operations - Provide common interface for both Breeze and ND4J Support loading scientific data from multiple sources & formats SciSpark Use Case: Distributed GTG - Generate a graph (nodes are cloud areas & edges are time) using SciSpark (run time 10-60 seconds) 36

Garbage Collection (GC) Performance Lessons 37

Notebook 101-3: Simple srdd Notebook

Parallel Statistical Rollups for a Time-Series of Grids General Problem: Compute per-pixel statistics for a physical variable over time for a global or regional lat/lon grid. Specific Steps: Given daily files or OpenDAP URL's over 10-30 years, load the grids and mask missing values. Then compute statistics of choice: count, mean, standard deviation, minimum, maximum, skewness, kurtosis, etc. Optionally "roll up" the statistics over time by sub-periods of choice: month, season, year, 5-year, and total period.

Parallel Statistical Rollups for Earth Time-Series 40

Notebook 101-4: Parallel Statistical Rollups for Time-Series of Earth Science Grids

Discussion in 301: How do you do X in SciSpark? What are your Use Cases? Bring them to SciSpark 301 Who is using Spark or Hadoop? Questions?

Notebook Exercises Try some of the exercises Just Play Ask Questions Notebook exercises: 101-0: Spark Warmup 101-1: Intro. to Spark (in scala & python) 101-2: Spark SQL and DataFrames 101-3: Simple srdd Notebook 101-4: Parallel Statistical Rollups for a Time-Series of Grids (pyspark)