IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Similar documents
IBM dashdb Local. Using a software-defined environment in a private cloud to enable hybrid data warehousing. Evolving the data warehouse

Cloud Computing & Visualization

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

The Evolution of Big Data Platforms and Data Science

CSE 444: Database Internals. Lecture 23 Spark

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Big data systems 12/8/17

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Processing of big data with Apache Spark

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

DATA SCIENCE USING SPARK: AN INTRODUCTION

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Spark Overview. Professor Sasu Tarkoma.

Specialist ICT Learning


Unifying Big Data Workloads in Apache Spark

Accelerating Spark Workloads using GPUs

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Oracle Big Data Connectors

Spark, Shark and Spark Streaming Introduction

An Introduction to Apache Spark

Chapter 4: Apache Spark

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Big Data Infrastructures & Technologies

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

A Tutorial on Apache Spark

IBM TS7700 grid solutions for business continuity

Apache Spark Internals

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

microsoft

Accelerate Big Data Insights

Webinar Series TMIP VISION

MapReduce, Hadoop and Spark. Bompotas Agorakis

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Dell In-Memory Appliance for Cloudera Enterprise

Migrate from Netezza Workload Migration

An Introduction to Big Data Analysis using Spark

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

IBM Aspera for Microsoft SharePoint

Cloud, Big Data & Linear Algebra

Apache Spark 2.0. Matei

Big Data Architect.

Scalable Tools - Part I Introduction to Scalable Tools

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Introduction to Spark

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Evolution From Shark To Spark SQL:

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Exam Questions

An Introduction to Apache Spark

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Data processing in Apache Spark

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Integrate MATLAB Analytics into Enterprise Applications

Oracle Big Data Fundamentals Ed 2

Big Data Hadoop Stack

Oracle R Advanced Analytics for Hadoop Release Notes. Oracle R Advanced Analytics for Hadoop Release Notes

Remove complexity in protecting your virtual infrastructure with. IBM Spectrum Protect Plus. Data availability made easy. Overview

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

New Developments in Spark

Introduction to Big-Data

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Apache Spark and Scala Certification Training

Project Design. Version May, Computer Science Department, Texas Christian University

Information empowerment for your evolving data ecosystem

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Distributed Computing with Spark and MapReduce

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Beyond MapReduce: Apache Spark Antonino Virgillito

Lecture 11 Hadoop & Spark

IBM BigInsights Security Implementation: Part 1 Introduction to Security Architecture

Oracle Big Data Fundamentals Ed 1

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

An Overview of Apache Spark

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Stages of Data Processing

Hadoop course content

Business Intelligence & Predictive Analytics In Big Data For Big Insights

The strategic advantage of OLAP and multidimensional analysis

CompSci 516: Database Systems

Outline. CS-562 Introduction to data analysis using Apache Spark

Stellar performance for a virtualized world

IBM SPSS Statistics and open source: A powerful combination. Let s go

Big Data Analytics using Apache Hadoop and Spark with Scala

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Transcription:

IBM Data Science Experience White paper R Transforming R into a tool for big data analytics

2 R Executive summary This white paper introduces R, a package for the R statistical programming language that enables programmers and data scientists to access large-scale in-memory data processing. The R runtime is single-threaded and can therefore normally only run on a single computer, processing data sets that fit within that machine s memory. By providing a bridge to s distributed computation engine, R enables large R jobs to run across multiple cores within a single machine or across nodes in massively parallel clusters, with access to all the memory in the cluster. Using R, programmers and data scientists can transform R into a tool for big data analytics, taking advantage of parallel processing and near-linear scaling to tackle much larger challenges than would normally be possible. For example, when training a machine-learning model on a very large initial set of data, using R would enable highly parallelized parameter tuning to accelerate the process of finding the best-fit model for the given dataset. With the R processing split across potentially thousands of computational nodes in a cluster, and with access to terabytes of data, this can result in orders-of-magnitude improvements in performance. Requiring little or no change to existing code, R is an easy way to augment the power of R and scale it up to the challenge of large datasets. Users can keep their existing programs and workflows, seamlessly flexing up to use when distributed processing or large amounts of data are involved, then flexing back down to local resources. In this way, data scientists can easy balance performance and cost, and keep a single language for both local and distributed analysis. The paper concludes with an invitation for readers to try IBM Data Science Experience, which provides an all-in-one web-based platform for data scientists, enabling the easy use of R and other open-source tools running on robust cloud resources. what it is, and why it s great news for data scientists Apache is an open-source processing engine designed for robust, in-memory machine learning, graph and streaming analytics. Developed to utilize distributed, in-memory data structures to improve data processing speeds for most workloads, can perform up to 100 times faster than Hadoop MapReduce for iterative algorithms or interactive data mining. And while MapReduce requires the relatively low-level Java programming language, also supports Scala, Python and R which are easier to learn and more commonly used by data scientists. What s more, jobs will run on any environment, from a laptop right up to the very largest clusters, whereas MapReduce jobs are often specifically written for a particular cluster configuration, and are significantly less portable. provides a single architecture capable of seamlessly handling a wide range of data processing scenarios from complex analytics through algorithmic processing to stream computing. This removes the cost and complexity of maintaining different technology stacks, ensures that metrics are consistent, and makes it faster and easier to share data. Py can work with data in a distributed storage system for example, HDFS and it can also take local data and parallelize it across the cluster to accelerate computations. With its ability to compute in real-time, can enable faster decisions for example, identifying why a transactional website is running slowly so that it can be fixed before financial losses are incurred. Similarly, for queries on streaming data, s real-time processing can help organizations detect financial fraud or distributed denial of service (DDoS) attacks.

IBM Data Science Experience 3 Worker Node Cache Driver Program Context Cluster Manager Worker Node Cache Figure 1: architecture. applications run as independent sets of processes on a cluster, coordinated by the Context object in your main program (called the driver program). From https://spark.apache.org/docs/latest/cluster-overview.html. What is R, and what benefits does it offer? R is an R package that provides a light-weight frontend for using Apache directly from R. Interactive data analysis in R is usually limited by the fact that its runtime is single-threaded. In practical terms, this means that it cannot take advantage of multiple cores let alone multiple machines in a cluster and it can only process datasets that fit in a single computer s memory. By providing direct access to s distributed computation engine from the R shell, R enables much faster analysis of much larger datasets by splitting the work across multiple cores or cluster nodes, and caching in memory across all the available memory. As figure 2 shows, R s architecture consists of two main components: an R-to-JVM binding on the driver that allows R programs to submit jobs to a cluster, and support for running R on the executors. The central component of R is a distributed implementation of the standard R data frame. DataFrames (note the different capitalization) are distributed collections of data organized into named columns; using them from R allows multiple nodes to run R operations such as selection, filtering and aggregation in parallel on large datasets. DataFrames can be constructed from a wide array of sources, including structured data files, Hive tables, JSON files, external Local Workers Context R-JVM bridge Java Context Figure 2: Architectural overview of R.

4 R databases, or existing local R data frames. R also supports distributed machine learning using MLlib, making it easy for R programmers to take advantage of pre-built methods, in particular for solving complex mathematical optimization problems. Using R, programmers and data scientists can access s scale-out parallel runtime, leverage its input and output formats, and make calls directly into SQL. This means that, rather than being restricted to working only on a single computer, R jobs can now run across multiple cores in single machines or across massively parallel clusters. DataFrames automatically inherit all optimizations made to the computation engine in terms of code generation and memory management. R exposes the API for the resilient distributed dataset (RDD) as distributed lists in R, enabling input files to be read from HDFS and processed using lapply on an RDD. In addition to lapply, R also allows closures to be applied on every partition using lapplywithpartition. Other supported RDD functions include operations such as reduce, reducebykey, groupbykey and collect. R automatically serializes the necessary variables to execute a function on the cluster. For example, if global variables are used in a function passed to lapply, R automatically captures these variables and copies them to the cluster. R enables easy use of existing R packages inside closures. The includepackage command can be used to indicate packages that should be loaded before every closure is executed on the cluster. To improve performance during the analysis of large datasets, R performs lazy evaluation on DataFrame operations whereby transformations (for example: map, filter, reducebykey) are not executed until required by an Action (for example: count, reduce, saveas). This avoids the materialization of large intermediate results, and minimizes the number of passes over the data during complex multi-stage analyses. Typical use cases for R R will appeal to any R users who are constrained by single-threading or lack of memory on local resources. For some analytical tasks, the analysis may be relatively simple but the initial set of data may be too large. In such scenarios, users can use R to perform data cleaning, aggregation and sampling using cluster resources rented in the cloud, then bring the downsampled dataset back to the local resources for building models or performing statistical analysis. For ensemble learning, parameter tuning or bootstrap aggregation, users will likely need to execute a particular function in parallel across different partitions and then combine the results from each partition using an aggregation function. Using R to run partition aggregate workflows could dramatically increase the speed in such scenarios. Even where the amount of input data is relatively small, using R to parallelize the workflow will help in cases where the data is evaluated with a large number of parameter values.

IBM Data Science Experience 5 Finally, when using machine learning algorithms it may be vital to use as much raw data as possible rather than a downsampling when training a machine learning system, or when building and training a recommendation engine, for example. In such scenarios, R can handle the preprocessing of the data to generate the training features, giving the labels as input to a machine learning algorithm (for example, one of those accessible to via the MLlib library). The algorithm will then fit the data to a model, typically creating a much smaller entity suitable for running back on local resources to serve predictions. Performing the required cleaning, filtering and aggregating on a large volume of data would be very difficult in R. Equally, model training is slow on R, given its single-threaded nature. Model training involves applying a number of parameters to learn which parameter value works best. R enables this parameter tuning to run in parallel across multiple nodes in a cluster, potentially improving performance by several orders of magnitude when the task is to determine the best model for a given dataset. In practical terms, IBM Data Science Experience, which provides an all-in-one web-based platform for data scientists, makes it easy to use R on cloud resources without needing to set up clusters manually. This avoids all the potential headaches involved in getting corporate IT functions to set up the hardware with the correct software and tools IBM Data Science Experience will automatically provide fully functional remote clusters so that data scientists can simply fire up the resources they want when they need them. The seamless binding between R and makes it possible to have a hybrid approach to analytics: using local resources when the datasets and the computational tasks are appropriately sized, and flexing up to use distributed processing and large memory resources in a rented cluster whenever required. This approach also keeps control of the costs, which are naturally higher for a large computational cluster than for a single local machine. Using R The shell provides an interactive way to quickly prototype operations without having to develop a full program, package it and then deploy it. Users can choose to connect R programs to clusters from Rstudio, the R shell, Rscript and other R IDEs. The entry point into R is the Session, which connects R programs to clusters and enables users to add options such as the application name, dependent spark packages and so on. The Context is what actually allows the application to use computational resources in a cluster (either via a resource manager such as YARN, or using s own cluster manager). To create a Context, users must first create a Conf, which stores the configuration parameters that the driver application will pass to Context. Some of these parameters define properties of the driver application and some are used by to allocate resources on the cluster. also allows users to create empty Conf files so that the configuration can be defined dynamically.

For structured data processing, users can take advantage of the SQL module, which uses to distribute SQL queries across nodes in a cluster. Unlike the standard RDD API, the interfaces in SQL include information about the structure of the data and the computation, and this information is used to perform additional optimizations. uses the same execution engine regardless of the API or language used to express the computation. This provides enormous flexibility, enabling developers to choose their preferred way to express each transformation, and potentially mix and match different approaches within the same project. Get up and running with R To get started with R, please try IBM Data Science Experience, which provides a single point of access and control across multiple open-source data-transformation and analytics tools running on robust cloud resources. For more information, please visit datascience.ibm.com Copyright IBM Corporation 2016 IBM Corporation Route 100 Somers, NY 10589 Produced in the United States of America December 2016 IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. THE INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANT- ABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. Please Recycle CDW12367-USEN-00