Integrated Machine Learning in the Kepler Scientific Workflow System

Similar documents
A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Big Data Applications using Workflows for Data Parallel Computing

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Hands-on tutorial on usage the Kepler Scientific Workflow System

Kepler and Grid Systems -- Early Efforts --

San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

A High-Level Distributed Execution Framework for Scientific Workflows

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Sliding Window Calculations on Streaming Data using the Kepler Scientific Workflow System

Big Data Analytics at OSC

Assignment 5: Collaborative Filtering

DATA SCIENCE USING SPARK: AN INTRODUCTION

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Cloud Computing & Visualization

Twitter data Analytics using Distributed Computing

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

A Cloud Framework for Big Data Analytics Workflows on Azure

MACHINE LEARNING Example: Google search

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Available online at ScienceDirect. Procedia Computer Science 79 (2016 )

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Project Design. Version May, Computer Science Department, Texas Christian University

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

A Process-Centric Data Mining and Visual Analytic Tool for Exploring Complex Social Networks

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Introduction to Big-Data

Data Processing on Large Clusters. By: Stephen Cardina

Actionable User Intentions for Real-Time Mobile Assistant Applications

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

CSCI6900 Assignment 3: Clustering on Spark

Big Data Infrastructures & Technologies

Machine Learning With Spark

TerraSwarm. A Machine Learning and Op0miza0on Toolkit for the Swarm. Ilge Akkaya, Shuhei Emoto, Edward A. Lee. University of California, Berkeley

Fitting Fragility Functions to Structural Analysis Data Using Maximum Likelihood Estimation

Big Data Analytics with Hadoop and Spark at OSC

Specialist ICT Learning

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson YuLing Chen, Cisco

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

FINAL PROJECT #3: GEO-LOCATION CLUSTERING IN SPARK

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Integrate MATLAB Analytics into Enterprise Applications

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Cost-benefit analysis and exploration of cost-energy-performance trade-offs in scientific computing infrastructures

Python Certification Training

An Introduction to Apache Spark

Scalable Machine Learning in R. with H2O

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

ScienceDirect A NOVEL APPROACH FOR ANALYZING THE SOCIAL NETWORK

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data

A High-Level Distributed Execution Framework for Scientific Workflows

Analyzing Big Data with Microsoft R

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

Cloud, Big Data & Linear Algebra

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Webinar Series TMIP VISION

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Adding Cloud Based Interactive Compute Capabilities to Globus Endpoints

Measuring and Monitoring Our Networks Performance & Reliability And Prototyping Affordable Cloud Storage/Compute

MLI - An API for Distributed Machine Learning. Sarang Dev

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing

Big data systems 12/8/17

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Advance analytics and Comparison study of Data & Data Mining

Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: an Ecological Example

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Clustering algorithms and autoencoders for anomaly detection

Project Requirements

Massive Online Analysis - Storm,Spark

Fast, Interactive, Language-Integrated Cluster Computing

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Overview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::

Deep Learning Frameworks with Spark and GPUs

2/12/11. Addendum (different syntax, similar ideas): XML, JSON, Motivation: Why Scientific Workflows? Scientific Workflows

Stream Processing on IoT Devices using Calvin Framework

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Data Platforms and Pattern Mining

A Tutorial on Apache Spark

Advanced Data Visualization

Oracle Big Data Connectors

Data Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Big Data SONY Håkan Jonsson Vedran Sekara

Parallelization of K-Means Clustering Algorithm for Data Mining

Unifying Big Data Workloads in Apache Spark

Transcription:

Procedia Computer Science Volume 80, 2016, Pages 2443 2448 ICCS 2016. The International Conference on Computational Science Integrated Machine Learning in the Kepler Scientific Workflow System Mai H. Nguyen *, Daniel Crawl, Tahereh Masoumi, and Ilkay Altintas San Diego Supercomputer Center University of California, San Diego {mhnguyen, crawl, altintas}@sdsc.edu, shaida.masoumi@gmail.com Abstract We present a method to integrate multiple implementations of a machine learning algorithm in Kepler actors. This feature enables the user to compare accuracy and scalability of various implementations of a machine learning technique without having to change the workflow. These actors are based on the Execution Choice actor. They can be incorporated into any workflow to provide machine learning functionality. We describe a use case where actors that provide several implementations of k-means clustering can be used in a workflow to process sensor data from weather stations for predicting wildfire risks. Keywords: scientific workflow, Kepler, machine learning, execution choice, k-means clustering 1 Introduction Machine learning techniques present an approach for analyzing data to discover relationships and uncover hidden patterns. Machine learning can be applied to any scientific problem to gain insight into the processes captured in the data. We are developing a module in Kepler that allows users to incorporate machine learning functionality into a workflow. In this demonstration paper, we present actors from that module which provide a machine learning method based on different implementations. This feature offers an integrated view of a machine learning technique and enables the user to compare accuracy and scalability for various implementations of an algorithm without having to change the workflow. * Correspondence should be sent to mhnguyen@sdsc.edu. The work presented in this paper was supported in part by NSF 1331615 under CI, Information Technology Research and SEES Hazards programs and NSF DBI-1062565 under CI Reuse and Advances in Bioinformatics programs. Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2016 c The Authors. Published by Elsevier B.V. 2443 doi:10.1016/j.procs.2016.05.545

2 Approach To present our approach, we start with the description of the machine learning algorithm we are modeling in Kepler, followed by descriptions of the actors that implement this algorithm. 2.1 k-means Clustering The algorithm that we are interested in modeling is k-means clustering [4]. This is a technique used to segment samples into k clusters such that the sum of squares from samples to the assigned clusters centers is minimized. In other words, the objective function of k-means is as follows: where is an n-dimensional sample, S i is one of k sets of grouped samples, and is the mean of samples in S i. k-means uses an iterative method to alternatively assign samples to clusters and update the cluster centers. The standard k-means algorithm assigns samples to clusters based on the smallest Euclidean distance between a sample and each cluster center. The initial cluster centers are randomly selected. Since clustering results are sensitive to initial conditions, several random starts are commonly used to obtain the most optimal results for a given value of k. The algorithm stops when the assignment of samples to clusters no longer changes or the specified maximum number of iterations has been reached. 2.2 Execution Choice Actor The actor that we have implemented in Kepler to perform k-means clustering is called kmeans-all ( kmeans for the algorithm, and all for all implementations). This actor is an instance of the Execution Choice actor. The Execution Choice actor allows for the use of different sub-workflows to implement a particular functionality. Each sub-workflow is independent of the others, and can use different tools to implement the same functionality represented by the Execution Choice actor. The Execution Choice actor is available in Kepler as part of the biokepler 1 suite, which contains actors for bioinformatics applications. 2.3 kmeans-all Actor Being an instance of the Execution Choice actor, kmeans-all allows for different implementations of k-means clustering. As currently implemented, kmeans-all offers k-means clustering based on four different machine learning platforms: R, MLlib, Mahout, and KNIME. R 2 is a language and environment for statistical computing and graphics. It is a widely used application for statistical analysis and machine learning. MLlib 3 is a scalable machine learning library built on Spark, which is a general framework for cluster computing. Spark uses a distributed in-memory architecture that provides fast and scalable processing of iterative operations, which is ideal for machine learning algorithms. Spark and MLlib have received much attention in the machine learning community recently due to their offering of machine learning algorithms that can scale up to very large datasets. 1 biokepler: http://www.biokepler.org/ 2 R: https://cran.r-project.org/ 3 MLlib: http://spark.apache.org/mllib/ 2444

Mahout 4 is another scalable machine learning library. Most of Mahout s machine learning algorithms are implemented on top of Apache Hadoop using the map-reduce programming paradigm. Although Mahout has now incorporated Spark into its newer algorithms, the implementation of k- means that we use in our kmeans-all actor is based on map-reduce. KNIME 5 is a data analytics platform that provides a graphical user interface-based approach to modeling. KNIME offers modules for data input/output, analysis, and visualization. Using a dragand-drop approach, KNIME users can put together these modules in a workflow. At the top level, kmeans-all presents a very simple workflow that takes as input a text file of samples, performs k-means clustering, and outputs the sum-of-squares error (SSE), cluster sizes, and cluster centers. Figure 1 shows kmeans-all at the top level. Parameters that are common across all implementations of k-means, such as the number of clusters and number of iterations, are available in Figure 1: Top level of kmeans-all actor the Shared Options tab in kmeans-all. Input and output file parameters can also be specified in the Shared Options tab since they are used by all implementation choices. In addition, the choice of implementation to use for clustering can be specified using the drop-down menu for the Choice parameter. Parameters specific to each implementation can be specified in implementation-specific tabs. Right-clicking on kmeans-all and selecting Open Actor allows navigation to the subworkflows of the different implementations. The R implementation of k-means is shown in Figure 2. This sub-workflow uses the following components: ReadTable: Uses R s readtable() to read in the input data. kmeans: Uses R s rpart() to perform k-means clustering. clusplot: Uses R s clustplot() to plot cluster samples and centers. radialplot: Uses various R commands to create a plot of the cluster center values. The sum-of-squares error, cluster sizes, and cluster centers are output from the kmeans actor, and passed back to the top level of kmeans-all for display. Figure 3 shows the MLlib implementation of k-means. In our machine learning module, we have actors that use MLlib s APIs to implement machine learning algorithms. The CreateVectorRDD actor creates a vector of RDDs from the input data. An RDD (resilient distributed dataset) is a programming abstraction in Spark used to represent a logical collection of items that can be distributed across multiple nodes and processed in parallel. This vector of RDDs is then passed to the KMeansModel actor, which uses MLlib s API to implement k-means clustering. As with all sub- 4 Mahout: http://mahout.apache.org/ 5 KNIME: https://www.knime.org/ 2445

workflows, the resulting sum-of-squares error, cluster centers, and cluster sizes are passed back to the top level of kmeans-all for display. Figure 2: R sub-workflow in kmeans-all actor Figure 3: MLlib sub-workflow in kmeans-all actor For brevity, we will only briefly describe the remaining sub-workflows here. The Mahout subworkflow uses a series of command-line calls to Mahout to perform k-means clustering. These actors convert the input data in a text file to the format that Mahout requires, perform k-means clustering, and convert the results to text. The results are then parsed and post-processed to get the sum-ofsquares error, cluster sizes, and cluster centers, which are passed back to the top level of kmeans-all. 2446

For the KNIME sub-workflow, a separate KNIME workflow for k-means clustering is built outside of Kepler. This KNIME workflow is then invoked within Kepler with a command-line call to KNIME using the External Execution actor. The results from the KNIME workflow are then parsed and postprocessed to get the desired output for the top level of the kmeans-all actor. The demo workflow as shown in Figure 1 above is available in Kepler as part of biokepler 1.2. biokepler 1.2 is distributed with Spark 1.5.0, thus the MLlib sub-workflow of kmeans-all requires no additional software. To execute the R, KNIME, and Mahout sub-workflows, the corresponding software would need to be installed. The kmeans-all demo workflow was implemented using R 3.1.3, KNIME 2.11.2, and Mahout 0.9. 2.4 kmeans-all-apply Actor In most machine learning tasks, there are two phases: the training phase in which the model is built from the data, and the test phase in which the model is applied to new data, i.e., data that was not used for training the model. The kmeans-all actor described above implements the training phase. Another Kepler actor, kmeans-all-apply, implements the test phase where the model is applied to new data. The kmeans-all-apply actor has a similar design as the kmeans-all actor, using the Execution Choice actor as a base to provide different implementations for applying a clustering model. 3 Application kmeans-all and kmeans-all-apply can be used in any workflow in which clustering is needed. The availability of several implementations of k-means allows the user to compare accuracy results as well as scalability of these different implementations. Accuracy is, of course, a must-have requirement of any machine learning technique. But scalability is also important as dataset size increases. With kmeans-all and kmeans-apply-all, the user can compare both accuracy and speedup results between R and KNIME, which are familiar tools but with limited scalability, and scalable implementations provided by MLlib and Mahout, which are intended for distributed processing but are newer tools. A use case that we are working on is WIFIRE, which is a project to build an end-to-end cyberinfrastructure for wildfire modeling and prediction [2]. One of the data sources for WIFIRE is sensor data from weather stations. Clustering can be applied to this sensor data to categorize and detect weather patterns associated with Santa Ana conditions [3], which greatly increase the wildfire risks of a region. With the kmeans-all actor, the user can compare clustering results using R and MLlib on a dataset that can fit into memory. For larger datasets, the MLlib implementation can be selected. The only change to switch between implementations is to specify the implementation choice for the Choice parameter in kmeans-all. Thus, the same workflow can be used for datasets of any size, using the appropriate algorithm implementation, without having to change the workflow. 4 Related Work Several products exist that provide a GUI-driven workflow approach to machine learning. Two of the more popular of these are KNIME and RapidMiner 6. Both have open-source versions of their products. However, for execution on big data platforms such as Hadoop 7 and Spark, both products 6 RapidMiner: http://rapidminer.com/ 7 Hadoop: https://hadoop.apache.org/ 2447

require a commercial version or extension. In addition, RapidMiner and KNIME are both dedicated to predictive analytics. These products provide a workflow approach for activities centered on data analysis and machine learning, not for general scientific activities, which is the goal for Kepler. PILOT (Ptolemy Inference, Learning, and Optimization Toolkit) [1] is a machine learning toolkit implemented on top of Ptolemy II 8, which is the actor-oriented design framework that also forms the foundation for Kepler. PILOT provides state-space modeling and Bayesian inference techniques that can be applied to online robotic and control applications. As currently implemented, Ptolemy II, and thus PILOT, do not support parallel processing of very large datasets such as with Hadoop and Spark. 5 Conclusions and Future Work We have presented an approach to integrate alternative implementations of a machine learning technique in a workflow. This allows the user to compare accuracy and scalability results for various implementations of a machine learning algorithm without having to change the workflow. As described above, kmeans-all and kmeans-all-apply can be used to build and apply a single clustering model. In the WIFIRE use case, they can be used to perform cluster analysis on a single weather station at a time. Kepler provides a way to apply data parallelism to a problem using the Distributed Data Parallel (DDP) Director [5]. We plan to use the DDP Director to add parallel processing to kmeans-all and kmeans-all-apply in order to construct and apply multiple models at once. Referring back to the WIFIRE use case, we can apply data parallelism in order to cluster data from multiple weather stations at the same time with kmeans-all, and to apply these models to new data from several weather stations simultaneously with kmeans-all-apply. The objective of our machine learning module in Kepler is to provide a way to incorporate machine learning functionality in general into a workflow. To this end, we have created actors to implement other machine learning algorithms in addition to k-means. In the WIFIRE use case, scaling the sample values is beneficial so that no one feature dominates the clustering, and handling samples with missing values is important since some clustering algorithms do not work with incomplete cases. Thus, we have created actors to perform these operations and have included them in the WIFIRE workflow. Our goal is to offer a variety of machine learning techniques for prediction, descriptive analysis, as well as for data preparation tasks such as scaling. Any of these actors can be enhanced to integrate several implementations similar to kmeans-all and kmeans-all-apply as the need arises. References [1] I. Akkaya, S. Emoto, and E. Lee. PILOT: An actor-oreitned learning and optimization toolkit for robotic swarm applications. 2nd International Workshop on Robotic Sensor Networks, 2015. [2] I. Altintas, J. Block, R. de Callafon, D. Crawl, C. Cowart, A. Gupta, M. Nguyen, H.-W. Braun, J. Schulze, M. Gollner, A. Trouve, and L. Smarr. Towards an integrated cyberinfrastructure for scalable data-driven monitoring, dynamic prediction and resilience of wildfires. International Conference on Computational Science, 2015. New York: Elsevier. [3] R. Fovell. The Santa Ana Winds. 2002. Retrieved from http://people.atmos.ucla.edu/fovell/asother/mm5/santaana/winds.html. [4] J. Hartigan. Clustering algorithms. 1975. John Wiley & Sons, Inc. [5] J. Wang, D. Crawl, and I. Altintas. A framework for distributed data-parallel execution in the kepler scientific workflow system. Workshop on Advances in the Kepler Scientific Workflow System and its Applications at ICCS, 2012. New York: Elsevier. 8 Ptolemy II: http://ptolemy.eecs.berkeley.edu/ptolemyii/ 2448