Integration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper

Size: px
Start display at page:

Download "Integration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper"

Transcription

1 and Statistica Enterprise Server Solutions Statistica White Paper Siva Ramalingam Thomas Hill TIBCO Statistica

2 Table of Contents Introduction...2 Spark Support in Statistica...3 Requirements...3 Statistica Spark Workspace Nodes...3 Out-of-the-Box Point-and-Click Nodes (No-Coding)...4 Spark Scala Script Node...6 Example Spark Workspaces...8 H2O Support in Statistica...10 Requirements...10 Statistica H2O Workspace Nodes...10 H2O Example Nodes...11 Example H2O Workspaces...12 In-Database Analytics Support in Statistica...13 Requirements...13 Statistica In-database Workspace, Overview...13 Statistica In-database Workspace, Nodes...13 List of available In-database nodes in Statistica...15 Final Comments, and Some Caveats...16 Putting it Together: Architectures for Big Data Analytics...19 Spotfire...19 Statistica Enterprise, Version Control, GMP/GxP, and Validation...19 Page 1 of 20

3 Introduction For decades, Statistica has provided comprehensive, powerful, and scalable analytics to customers across various industries. However, as data volumes and velocity are increasing exponentially, there is also the need for a flexible architecture that can move analytics to the data (into the database, data repository) orchestrate complex analytic pipelines that combine multiple data sources and in-database, in-memory parallel, and in-server analytics when it is most efficient and useful. Statistica provides such an architecture with capabilities to push the computations to cluster computing environments from within a simple and powerful workspace interface. This interface provides flexibility with respect to: - How and where to source data - Where to perform computations - Whether to use point-and-click low-code interfaces or any of the popular big-data scripting/programming environments (Python, R, Scala), and - How and where to push results for subsequent processing or visualization The user can design, code and visualize their big data pipeline on the Statistica client while offloading the expensive computations to the server (Spark, H2O or Database) to utilize the best of both worlds. Spark. Spark is an open source distributed computing framework that offers massively parallel in-memory processing for SQL, streaming, machine learning and graph analytics exposed through Java, Scala, Python and R APIs. The parallel operations are facilitated by the RDD (resilient distributed dataset) abstraction, which is immutable and provides lazy, fault tolerant operations. Spark requires a distributed file system such as the Hadoop Distributed File System as well as a cluster manager like Hadoop YARN. Spark supports various advanced analytics and machine learning through spark.ml, through algorithms that are optimized for distributed inmemory execution and scalability. H2O. H2O is an open source deep learning platform that is focused on artificial intelligence applications for enterprises. It offers a variety of machine learning models that can be trained using APIs in R, Python, Java, Scala, JSON and H2O s Flow GUI. It also provides access to Spark s machine learning library through the Sparkling Water API. Similar to Spark it is a distributed computing platform that can be run on cloud, on premise clusters or single machines using the Hadoop Distributed File System. In-database analytics. In-Database analytics refers to the process of integrating data analysis into data storage systems. For many enterprises today data security is of the highest priority. In many cases there is great reluctance to move data out of the storage system into a separate processing application. In-database analytics can both provide better data security and significant performance improvement with the introduction of column oriented database designed specifically for this use case. Page 2 of 20

4 Spark Support in Statistica Requirements 1. The user has access to a Spark instance Running on Debian/Ubuntu, Redhat/CentOS or MacOS Version or greater 2. The Spark instance has a running Livy server Version 2 or greater That was built against Spark or greater 3. Active network connection Statistica Spark Workspace Nodes Statistica provides access to the Spark engine via point-and-click data workspace nodes (icons in the workspace that provide access to the GUI), to configure how data are sources and analyzed; the user can also execute Scala programs directly. The implementation of the workspace nodes relies on Livy, the open source REST interface to Apache Spark. On the Statistica client the user will need to enter the Livy server URL under Home-> Tools -> Options -> Server/Web tab as shown in Figure 1. Figure 1: Statistica Server Configuration Menu Page 3 of 20

5 There are 3 steps to specifying the Livy server URL. Statistica supports basic authentication, where you can specify username and password followed by the server URL and the port on which Livy is running. If the server has no authentication set up, you can omit the username and password from the URL ( Out-of-the-Box Point-and-Click Nodes (No-Coding) Statistica allows users to develop and manage custom-developed Spark Scala Scripts; this is described later in this paper. Statistica also ships with the following predefined (script) nodes that can be used to create Spark-based analytics workflows (pipelines) without coding. The list below briefly describes each one of these predefined nodes, and identifies the specific Spark (spark.ml) API's used in the specific analyses. Spark Data Node This node reads a file in csv, json or parquet format from hdfs or filesytem and produces a spark dataframe that can be connected to downstream spark.ml analyses. Spark Decision Tree Classifier This node runs a decision tree classification analysis on a user specified input dataframe. A categorical dependent along with continuous predictors are expected inputs. The input data should not contain missing values. Spark Decision Tree Regressor This node runs a decision tree regression analysis on a user specified input dataframe. A double label column along with continuous predictors are expected for inputs. The input data should not contain missing values. Spark Feature Selection This node runs feature selection on a user specified input data frame. The algorithm estimates Chi-Square (predictor importance) statistics for each dependent/predictor pair. Missing data are casewise deleted for each pair. Continuous predictors are discretized using QuantileDiscretizer. Categorical predictors are indexed using StringIndexer. Chi-Square values are computed using chisqtest. Spark Feature Election Pipeline This node runs feature selection on a user specified input data frame. The algorithm sets up a Spark ML pipeline to select the user specified number of best predictors for a given dependent (y) variable. The input data should not contain missing values. Spark Generalized Linear Model This node runs a generalized linear regression analysis on an input dataframe. A binary (y) label column, continuous predictors and an optional weight variable are expected inputs. The input data should not contain missing values. Page 4 of 20

6 Spark Index Categoricals This node uses the StringIndexer API from Spark to convert the chosen string variables into columns with numeric indices. A column x becomes x_idx. The input data should not contain missing values. This node produces a downstream dataframe that can be fed to downstream spark.ml analyses. Spark Linear Regression This node runs a linear regression analysis on a user specified input dataframe. A double label column, continuous predictors and an optional weight variable are expected inputs. The input data should not contain missing values. Spark Logistic Regression This node runs binomial logistic regression analysis on a user specified input dataframe. A binary label column, continuous predictors and an optional weight variable are expected inputs. The input data should not contain missing values. Spark Make Design Matrix This node uses the RFormula API to produce a main effects design matrix. A single dependent variable and any number of continuous and/or categorical variables are expected inputs. The input data should not contain missing values. This node produces a design matrix as a downstream document that can be fed to further spark ml analyses. Spark Random Forest Classifier This node runs a random forest classification analysis on a user specified input dataframe. A categorical dependent and continuous predictors are expected inputs. The input data should not contain missing values. Spark Random Forest Pipeline This node runs a random forest classification analysis on a user specified input dataframe. This node features selection on user specified input data frame. The algorithm sets up a Spark ML pipeline to select the top predictors, using a Spark Random Forest Classifier. A categorical dependent and continuous predictors are expected inputs. The input data should not contain missing values. Spark Random Forest Regressor This node runs a random forest regression analysis on a user specified input dataframe. A continuous dependent (y) and continuous predictors are expected inputs. The input data should not contain missing values. Spark Reference Code Categoricals This node uses StingIndexer and OneHotEncoder APIs to convert the chosen string variables into columns of binary vectors with an _idx_vec extension. A column x becomes x_idx_vec on output. The input data should not contain missing values. This node produces a downstream dataframe that can be fed to further spark ml analyses. Page 5 of 20

7 Spark SVM Classification This node runs classification SVM analysis on a user specified input dataframe. A binary label column and continuous predictors are expected inputs. The input data should not contain missing values. Spark Scala Script Node To develop and submit an actual Scala script to the Spark engine, the user can then use the Statistica Dataminer Workspace Node called Spark Scala Script. The node is accessible from the menu under Big Data Analytics->Spark as shown in Figure 2. Figure 2: Big Data Analytics Spark Menu This node is the primary access point to the Spark engine from the Statistica client. It is designed using the same architecture as other Statistica code nodes (Python, C#). So it comes with support for using Statistica spreadsheets as input, creating and using user defined parameters, specifying code to be executed (directly in the node or a file in the system) when the node is run and bringing back results into Statistica spreadsheets and make them available for downstream analyses. The following Statistica-specific variables are used in Spark Scala Script node: stain of type List[DataFrame] is used to access upstream datasets. staout of type DataFrame is used to make a Spark DataFrame accessible to downstream Spark nodes. staresults of type List[DataFrame] is used to bring back Spark DataFrames as Statistica spreadsheets into the node s reporting documents collection. User defined node parameters are available as predefined constants and can be referenced in the Scala script by their user specified name. Page 6 of 20

8 Figure 3 shows an example Spark Scala node in detail. The Parameters tab contains the user created parameters that specify the options to import the file crabs.csv into a Spark DataFrame from the cluster file system into Spark. Header specifies that the first row of the file that contains variable names/headers Infer schema specifies that Spark should automatically infer the variable types Path to file specifies where the file resides on disk (also can be a hdfs file path) File type specifies if the file is a csv, json or parquet file Casewise delete MD specifies the imported DataFrame should remove all cases that contain missing data Figure 3: Spark Scala Node Parameters Tab Page 7 of 20

9 Figure 4: Spark Scala Code Tab The Scala code will utilize the user defined parameters to import the file into Spark as a DataFrame and makes it available to downstream analyses by assigning it to staout variable. The parameter mappings are as follows Header getheader Infer schema inferschema Path to file filepath File type filetype Casewise delete MD nadrop Example Spark Workspaces Statistica ships with several example workspaces to illustrate different workflows that can be designed with Spark Scala Script nodes. These workspaces are available in the Examples/Workspaces directory under installation directory as shown in Figure 5. Page 8 of 20

10 Figure 5: Spark Example Workspaces Figure 6: Spark Example Regression Workspace Page 9 of 20

11 H2O Support in Statistica Requirements 1. The user has access to a running H2O instance Running on a cluster or a single host Has Sparkling Water 2.1 or greater Network connection if the H2O instance is on a remote server Statistica H2O Workspace Nodes Statistica nodes utilize the H2O REST API, so H2O server needs to be accessible over the network. Typically, the H2O server IP address and port are presented to the user once H2O is started. This information is needed for the H2O nodes (H2O Data) to connect to the H2O environment. Statistica ships with predefined H20 nodes which can be accessed from Big Data Analytics- >H2O tab in the menu as shown in Figure 7. Figure 7: H20 Analysis Nodes These nodes are completely parameterized for point-and-click UI, and implemented in IronPython. So the user just needs to specify the connection information and model parameters to run the specified analysis on the remote H2O instance. Here some additional details regarding the H2O Data. Figure 8: H20 Data Node, Parameters Tab Page 10 of 20

12 The H2O URL parameter specifies where the remote H2O instance is hosted. The Data source path specifies where the data file resides (supports Amazon S3, HDFS or local file on server). The Code tab of the node shows the two parameters available to the IronPython script (and H20) to be referenced as NodeParameters["H2O_url"] and NodeParameters["dataSourcePath"]. The script uses these parameters to import the data into H2O server and returns a Statistica spreadsheet that shows the sample of the dataset and report detailing the metadata of the dataset. Figure 9: H20 Data Node, Code Tab H2O Example Nodes H2O Data Mapping This node allows the user to specify a custom schema for the H2O DataFrame. This can be helpful in case where the user deems the data types inferred by H2O to be unsuitable. H2O Gradient Boosting Machine (GBM) A GBM (Gradient Boosting Machine) is an ensemble of either regression or classification tree models. Both are forward-learning ensemble methods that obtain predictive results using gradually improved estimations. H2O Generalized Linear Modeling (GLM) Generalized Linear Models (GLMs) are an extension of traditional linear models. GLMs estimate regression models for outcomes following exponential distributions. The GLM suite includes Gaussian, Poisson, Binomial, Multinomial and Gamma regression. Page 11 of 20

13 H2O Prediction Node In the H2O Prediction Node, you can make predictions on new data sets using the trained models that are created when running H2O GBM or GLM nodes. Additional Data Science H2O Algorithms The following additional data science algorithms are also provided: H20 DeepLearning H20 DRF (Distributed Random Forest) H20 PCA (Principal Competent Analysis) H20 K-Means Example H2O Workspaces Statistica also ships with several example workspaces to illustrate the possible workflows that can be constructed with H2O nodes. These workspaces are available in the Examples/Workspaces directory under installation directory as shown in Figure 10. Figure 10: H20 Example Workspaces Page 12 of 20

14 In-Database Analytics Support in Statistica Requirements The user has access to 1. Statistica Enterprise Server data configuration 2. Data residing in SQL Server with appropriate permissions (create, update and drop tables on tempdb for some algorithms) Statistica In-database Workspace, Overview The In-database analytic nodes are code free workspace nodes that are completely parameterized and ready to use out of the box. These nodes are accessible from the Big Data Analytics tab of the main menu under In-Database Analytics section as shown in Figure 11. Figure 11: In-SQL-Database Processing Node The In-Database Analytics nodes provide options to bring the analysis to the data stored in a database. Internally, the nodes implement the respective computations and analyses by automatically creating suitable queries that move time-consuming data processing operations into the database. For example, to compute covariances and analyses that depend on covariance matrices, it is possible to compute the sums-of-squares-and-cross-product (SSCP) matrices via queries in-database. Depending on the specific databases that are targeted (and their configuration), these queries can execute in parallel and in-memory, for very efficient in-memory parallelized analytics. Once the SSCP matrix is computed, Statistica (server) can then use that matrix to perform correlation/covariances-based analytics such as stepwise regression. In that manner, very large datasets can be analyzed using almost entirely in-database computational resources. Statistica In-database Workspace, Nodes These nodes are broken down into logical groups as Data Access, Data Management and Analysis nodes. The starting point of any In-Database analytic workflow is the In-Database Page 13 of 20

15 Enterprise Data Configuration node which uses a database connection object (see Statistica Enterprise Server database connection and data configuration). As illustrated in the image below, the In-Database Enterprise Data Configuration node takes an Enterprise Data Configuration URL as input and provides options for the user to bring back a sample of the data (for review by the user) from the database in a Statistic Spreadsheet and to specify the type of database the data resides in (if not specified Statistica will try to automatically infer the database type). The database connection object created by this node is kept open ("alive") for further analyses downstream. Figure 12: In-Database-Processing Node, Specifications Tab Figure 12: In-Database-Processing Example Workspace The image above illustrates a complete In-database analytic workflow involving the Data access, management and analysis nodes. Page 14 of 20

16 List of available In-database nodes in Statistica In-database Write to Database This node provides the user the ability to move all or a subset of an existing table in the database into another table (persistent or temporary) in the same database by either a creating or overwriting or appending to the destination table. In-database Filter Duplicate Cases This node provides the user the ability to filter out records with identical values for user specified columns. It will return a filtered data stream as output containing user selected columns. In-database Random Sample Filtering This node follows the functionality for the Random Sampling module in Statistica with a few important differences: The In-Database Random Sample Filtering node does not support sampling-with-replacement or oversampling (sampling fractions > 1.0) nor split node (stratified) random sampling. The transformations to the data are not applied until the process gets to the In-Database analytics node, which will operate on the modified query. This means that if there are multiple downstream nodes connected, they might have different results since random sampling will happen once the main analytics procedure is executed. In-database Sort This node allows the user to sort columns of the data table. The user has the ability to specify sort direction for each column and the position of the column in the output stream. In-database Correlation Matrix This node allows the user to generate correlation and partial correlation matrices for specified data table columns as Statistica Spreadsheets. In-database Descriptive Statistics This node enables the user to obtain descriptive statistics for specified data table columns. The node provides a subset of functionality as compared with the regular Descriptive Statistics node that operates on Statistica Spreadsheets. In-database Logistic Regression This node enables the user to run logistic regression analysis on a specified binary dependent (y) variable with continuous predictor and categorical predictor variables. The node returns a coefficients spreadsheet as output. In-database Multiple Regression This node enables the user to run multiple regression analysis on specified continuous dependent and continuous predictor columns of the data table. The node produces a coefficients spreadsheet as output. Page 15 of 20

17 Final Comments, and Some Caveats The features described in the sections above combined with the capabilities of Statistica provide a powerful and flexible platform for big data exploration and analysis. The architecture provides options for analysts to access multiple data processing platforms in the same flow, move selected analyses to big-data-repositories when that is useful or necessary, or to move data to the Statistica server to perform the analysis on dedicated hardware, when the respective databases should not be overloaded with analytics in order to support other business-critical functions. Users planning to explore these features are advised to consider the following possible issues. Spark nodes Data export/import to/from the Spark engine from/to Statistica are managed through REST calls and the users should be mindful of the data size to be in the order of few megabytes at most in these cases. An alternative would be to use the HDFS import/export nodes available in Statistica and design your script to read from/write to HDFS. Another item to be mindful is resource allocation for each Spark analysis. At the moment Statistica requires that resource allocation be configured using Livy server side configuration options. H2O nodes Sparkling Water is not shipped with Statistica and needs to be deployed, configured and started before Statistica H2O workspace nodes can be used. Data export/import to/from H2O server from/to Statistica are managed through REST calls and the users should mindful of the data size to be in the order of few megabytes at most in these cases. Of course, typically, Big data are accessed directly in HDFS or in other suitable data repositories. In-database nodes Performance of the in-database analytics analyses relies on the database engine and the amount of data stored. However, it should be noted, that some tasks can take a considerable amount of time and impact performance of the database. Database systems impose limitations that in-database analytics nodes have to satisfy. Some major limitations by database type are described below. For example, the limitations for SQL Server are shown below (see Maximum Capacity Specifications for SQL Server, extracted 8/14/2015). Columns per nonwide table 1024 Columns per SELECT statement 4096 Bytes per row 8060 Also note that direct interactions with databases requires certain access privileges, in particular for some analyses that may create intermediate temporary tables to support iterative analyses (e.g., logistic regression). Page 16 of 20

18 Bringing analytics to the database helps to improve on performance and makes the solution scalable. However, it also moves calculations to a different engine. As such, the approach is leveraging the database built-in functions and relies on the number representation within the database. Databases differ in data types and functions. Even though in-database analytics nodes perform necessary conversions of types and utilize common patterns for the majority of types of SQL, it is hardly possible to maintain the same level of accuracy across all databases. In order to obtain the highest level of accuracy and consistency it is recommended to import the data into Statistica and run the analysis with common Statistica modules. Moving analytics to the data, moving data to Statistica analytics In practical applications it is sometimes not obvious whether big-data-analytics can be performed more efficiently on the database side or in-database, or if it is easier and faster to extract the data and move them to the Statistica analytics server for analyses. In extreme cases, the choice is obvious: If the data is truely huge and exceeds the capacity for storage on the Statistica server side, then in-database analytics (any of the approached discussed) is the way to create repeatable scalable analytic workflows and pipelines. Likewise, when the data sizes are in the megabyte range, then it is often slower to move the analytics to the database, rather than to import/move the data to the Statistica server for analysis. Obviously, local analytics on the Statistica side against small to moderately large datasets can run faster and be more interactive than analytics run inside a complex server farm (given the overhead etc.). Given sufficient bandwidth and speed, even sizeable data sets can be analyzed faster on the Statistica server side (or even desktop), in particular when the main storage system is also heavily used to support other business critical functions, as is often the case. Moving Information Extraction into Big Data Repositories Regardless of the specific database platform and in-database analytics that is used, a common pattern for big-data-analytics can best be characterized as a "funnel." Data is growing exponentially in most organizations, but the information that is critical for successful operations of an organization or process is not growing at that rate. Put another way, big-data-analytics can be thought of information extraction against continuously growing data volumes. A common approach is to push initial data preparation and information extraction into the database (to the data), but perform the final analyses on a dedicated Statistica analysis server. This "funnel" model is summarized in Figure 13. Page 17 of 20

19 Figure 13: "Funnel" Modeling Pipeline Architecture In this architecture data such as error log files, calibration logs, raw customer data and narratives, etc. are stored in HDFS. An initial in-database process imposes a schema onto those data and extracts the most relevant information for subsequent modeling, and performs other data preprocessing. For example, one might extract from log files that are documenting calibration runs for large numbers of tool the largest deviation from specification. Further analyses could perform initial feature selection to identify the tool data with the greatest diagnostic value for predicting quality. The resulting subset of "important predictors" and the respective values for the maximum-deviations-from-specification per run date can then be brought into the Statistica Analysis Server (e.g., the Monitoring and Alerting Server MAS, as shown in the diagram) for subsequent monitoring and analyses. Results can ultimately be displayed on dedicated visualization platforms for further drilldown. This architecture is commonly applied in manufacturing contexts. Page 18 of 20

20 Putting it Together: Architectures for Big Data Analytics The TIBCO software solutions portfolio contains a number of best-in-class tools to facilitate connecting to data, integrating and virtualizing connections across diverse data sources, and for processing data and events in real time against high-volume and high-velocity data. TIBCO also provides one of the most popular tools for visual analytics: Spotfire. Spotfire Spotfire provides the ideal end-point for big-data analytics orchestrated through Statistica. Essentially, all big-data-analytics is about the extraction of information from (big) data information that can be represented as a set of interrelated results tables. Spotfire provides consumers of that information the ideal interface to visually explore results, compare segments, slice-and-dice results, or render results in highly-specialized displays like wafer maps, geo maps, KPI charts, etc. The following overview architecture illustrates the tiers of a complete big-data solution where Statistica provides the analytic backend supported by other big-data platforms including Spark, H20, etc. Statistica then provides tools for scheduling, monitoring, and publishing results to Spotfire to support end-users who take actions based on their visual review of results. This general pattern encompasses perhaps the majority of all enterprise-wide analytics use cases for big-data analytics, in manufacturing, marketing, etc. Figure 14: TIBCO Big Data Architecture for Analytics Statistica Enterprise, Version Control, GMP/GxP, and Validation In this architecture, Statistica Enterprise not only provides orchestration and scheduling of analytics, but can also provide the critical support to enable enterprise-wide deployment and management of models perhaps 100'ds or 1000'ds of models. Statistica Enterprise provides Page 19 of 20

21 version control, audit logs, role-based access (and abstractions) to analytic templates and data, as well as features like approval processes and electronic signature support for validated GMP/GxP (Good Manufacturing Practices, Good anything Practices) deployments in regulated industries (e.g., in pharmaceutical and medical device industries, financial services, etc.). In addition, Statistica provides tools for monitoring large numbers of analytic flows, and efficient web-based monitoring of large numbers of such flows and processes through the Monitoring and Alerting Server (MAS). These components are in use across a wide range of industries to enable validated analytics of standardized curated workflows through the enterprise, for efficient monitoring of large numbers of processes and parameters. Page 20 of 20

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Sparkling Water. August 2015: First Edition

Sparkling Water.   August 2015: First Edition Sparkling Water Michal Malohlava Alex Tellez Jessica Lanford http://h2o.gitbooks.io/sparkling-water-and-h2o/ August 2015: First Edition Sparkling Water by Michal Malohlava, Alex Tellez & Jessica Lanford

More information

Deploying, Managing and Reusing R Models in an Enterprise Environment

Deploying, Managing and Reusing R Models in an Enterprise Environment Deploying, Managing and Reusing R Models in an Enterprise Environment Making Data Science Accessible to a Wider Audience Lou Bajuk-Yorgan, Sr. Director, Product Management Streaming and Advanced Analytics

More information

Chapter 1 - The Spark Machine Learning Library

Chapter 1 - The Spark Machine Learning Library Chapter 1 - The Spark Machine Learning Library Objectives Key objectives of this chapter: The Spark Machine Learning Library (MLlib) MLlib dense and sparse vectors and matrices Types of distributed matrices

More information

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire Putting it all together: Creating a Big Data Analytic Workflow with Spotfire Authors: David Katz and Mike Alperin, TIBCO Data Science Team In a previous blog, we showed how ultra-fast visualization of

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and AI and Visual Analytics: Machine Learning in Business Operations Steven Hillion Senior Director, Data Science Anshuman Mishra Principal Data Scientist DISCLAIMER During the course of this presentation,

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

TIBCO Statistica Release Notes

TIBCO Statistica Release Notes TIBCO Statistica Release Notes Software Release 13.3.1 November 2017 Two-Second Advantage Important Information SOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE. USE OF SUCH EMBEDDED OR BUNDLED

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Higher level data processing in Apache Spark

Higher level data processing in Apache Spark Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Analyzing Flight Data

Analyzing Flight Data IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Scalable Machine Learning in R. with H2O

Scalable Machine Learning in R. with H2O Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016 Introduction Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA Ph.D. in Biostatistics with

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

PUBLIC SAP Vora Sizing Guide

PUBLIC SAP Vora Sizing Guide SAP Vora 2.0 Document Version: 1.1 2017-11-14 PUBLIC Content 1 Introduction to SAP Vora....3 1.1 System Architecture....5 2 Factors That Influence Performance....6 3 Sizing Fundamentals and Terminology....7

More information

Introducing SAS Model Manager 15.1 for SAS Viya

Introducing SAS Model Manager 15.1 for SAS Viya ABSTRACT Paper SAS2284-2018 Introducing SAS Model Manager 15.1 for SAS Viya Glenn Clingroth, Robert Chu, Steve Sparano, David Duling SAS Institute Inc. SAS Model Manager has been a popular product since

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

Python Certification Training

Python Certification Training Introduction To Python Python Certification Training Goal : Give brief idea of what Python is and touch on basics. Define Python Know why Python is popular Setup Python environment Discuss flow control

More information

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone Introducing Microsoft SQL Server 2016 R Services Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone SQL Server 2016: Everything built-in built-in built-in built-in built-in built-in $2,230

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

Bringing Data to Life

Bringing Data to Life Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model

More information

Spotfire: Brisbane Breakfast & Learn. Thursday, 9 November 2017

Spotfire: Brisbane Breakfast & Learn. Thursday, 9 November 2017 Spotfire: Brisbane Breakfast & Learn Thursday, 9 November 2017 CONFIDENTIALITY The following information is confidential information of TIBCO Software Inc. Use, duplication, transmission, or republication

More information

Data Analytics and Machine Learning: From Node to Cluster

Data Analytics and Machine Learning: From Node to Cluster Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Big Data processing: a framework suitable for Economists and Statisticians

Big Data processing: a framework suitable for Economists and Statisticians Big Data processing: a framework suitable for Economists and Statisticians Giuseppe Bruno 1, D. Condello 1 and A. Luciani 1 1 Economics and statistics Directorate, Bank of Italy; Economic Research in High

More information

Accelerating Spark Workloads using GPUs

Accelerating Spark Workloads using GPUs Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark

More information

ACHIEVEMENTS FROM TRAINING

ACHIEVEMENTS FROM TRAINING LEARN WELL TECHNOCRAFT DATA SCIENCE/ MACHINE LEARNING SYLLABUS 8TH YEAR OF ACCOMPLISHMENTS AUTHORIZED GLOBAL CERTIFICATION CENTER FOR MICROSOFT, ORACLE, IBM, AWS AND MANY MORE. 8411002339/7709292162 WWW.DW-LEARNWELL.COM

More information

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service Demo Introduction Keywords: Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service Goal of Demo: Oracle Big Data Preparation Cloud Services can ingest data from various

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Integrating Advanced Analytics with Big Data

Integrating Advanced Analytics with Big Data Integrating Advanced Analytics with Big Data Ian McKenna, Ph.D. Senior Financial Engineer 2017 The MathWorks, Inc. 1 The Goal SCALE! 2 The Solution tall 3 Agenda Introduction to tall data Case Study: Predicting

More information

The Evolution of Big Data Platforms and Data Science

The Evolution of Big Data Platforms and Data Science IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Apache Spark and Scala Certification Training

Apache Spark and Scala Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

As a reference, please find a version of the Machine Learning Process described in the diagram below.

As a reference, please find a version of the Machine Learning Process described in the diagram below. PREDICTION OVERVIEW In this experiment, two of the Project PEACH datasets will be used to predict the reaction of a user to atmospheric factors. This experiment represents the first iteration of the Machine

More information

Apache SystemML Declarative Machine Learning

Apache SystemML Declarative Machine Learning Apache Big Data Seville 2016 Apache SystemML Declarative Machine Learning Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing to open

More information

Scalable Data Science in R and Apache Spark 2.0. Felix Cheung, Principal Engineer, Microsoft

Scalable Data Science in R and Apache Spark 2.0. Felix Cheung, Principal Engineer, Microsoft Scalable Data Science in R and Apache Spark 2.0 Felix Cheung, Principal Engineer, Spark @ Microsoft About me Apache Spark Committer Apache Zeppelin PMC/Committer Contributing to Spark since 1.3 and Zeppelin

More information

Oracle Machine Learning Notebook

Oracle Machine Learning Notebook Oracle Machine Learning Notebook Included in Autonomous Data Warehouse Cloud Charlie Berger, MS Engineering, MBA Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics charlie.berger@oracle.com

More information

Additional License Authorizations. For Vertica software products

Additional License Authorizations. For Vertica software products Additional License Authorizations Products and suites covered Products E-LTU or E-Media available * Perpetual License Non-production use category ** Term License Non-production use category (if available)

More information

How to choose the right approach to analytics and reporting

How to choose the right approach to analytics and reporting SOLUTION OVERVIEW How to choose the right approach to analytics and reporting A comprehensive comparison of the open source and commercial versions of the OpenText Analytics Suite In today s digital world,

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Data Mining: STATISTICA

Data Mining: STATISTICA Outline Data Mining: STATISTICA Prepare the data Classification and regression (C & R, ANN) Clustering Association rules Graphic user interface Prepare the Data Statistica can read from Excel,.txt and

More information

MICROSOFT BUSINESS INTELLIGENCE

MICROSOFT BUSINESS INTELLIGENCE SSIS MICROSOFT BUSINESS INTELLIGENCE 1) Introduction to Integration Services Defining sql server integration services Exploring the need for migrating diverse Data the role of business intelligence (bi)

More information

Spotfire and Tableau Positioning. Summary

Spotfire and Tableau Positioning. Summary Licensed for distribution Summary So how do the products compare? In a nutshell Spotfire is the more sophisticated and better performing visual analytics platform, and this would be true of comparisons

More information

SciSpark 201. Searching for MCCs

SciSpark 201. Searching for MCCs SciSpark 201 Searching for MCCs Agenda for 201: Access your SciSpark & Notebook VM (personal sandbox) Quick recap. of SciSpark Project What is Spark? SciSpark Extensions scitensor: N-dimensional arrays

More information

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:

More information

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide FAQs 1. What is the browser compatibility for logging into the TCS Connected Intelligence Data Lake for Business Portal? Please check whether you are using Mozilla Firefox 18 or above and Google Chrome

More information

Tackling Big Data Using MATLAB

Tackling Big Data Using MATLAB Tackling Big Data Using MATLAB Alka Nair Application Engineer 2015 The MathWorks, Inc. 1 Building Machine Learning Models with Big Data Access Preprocess, Exploration & Model Development Scale up & Integrate

More information

Using Existing Numerical Libraries on Spark

Using Existing Numerical Libraries on Spark Using Existing Numerical Libraries on Spark Brian Spector Chicago Spark Users Meetup June 24 th, 2015 Experts in numerical algorithms and HPC services How to use existing libraries on Spark Call algorithm

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc. JAVASCRIPT CHARTING Scaling for the Enterprise with Metric Insights 2013 Copyright Metric insights, Inc. A REVOLUTION IS HAPPENING... 3! Challenges... 3! Borrowing From The Enterprise BI Stack... 4! Visualization

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Managing Complex SAS Metadata Security Using Nested Groups to Organize Logical Roles

Managing Complex SAS Metadata Security Using Nested Groups to Organize Logical Roles Paper 1789-2018 Managing Complex SAS Metadata Security Using Nested Groups to Organize Logical Roles ABSTRACT Stephen Overton, Overton Technologies SAS Metadata security can be complicated to setup and

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs 1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be

More information

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization Composite Software, Inc. June 2011 TABLE OF CONTENTS INTRODUCTION... 3 DATA FEDERATION... 4 PROBLEM DATA CONSOLIDATION

More information

Alteryx Technical Overview

Alteryx Technical Overview Alteryx Technical Overview v 1.5, March 2017 2017 Alteryx, Inc. v1.5, March 2017 Page 1 Contents System Overview... 3 Alteryx Designer... 3 Alteryx Engine... 3 Alteryx Service... 5 Alteryx Scheduler...

More information

Matrix Computations and " Neural Networks in Spark

Matrix Computations and  Neural Networks in Spark Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com Training Neural Networks Datasets

More information

Databricks, an Introduction

Databricks, an Introduction Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,

More information

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface Data Mining: i STATISTICA Outline Prepare the data Classification and regression Clustering Association rules Graphic user interface 1 Prepare the Data Statistica can read from Excel,.txt and many other

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Dynamics 365. for Finance and Operations, Enterprise edition (onpremises) system requirements

Dynamics 365. for Finance and Operations, Enterprise edition (onpremises) system requirements Dynamics 365 ignite for Finance and Operations, Enterprise edition (onpremises) system requirements This document describes the various system requirements for Microsoft Dynamics 365 for Finance and Operations,

More information

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage Silverton Consulting, Inc. StorInt Briefing 2017 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED Page 2 Introduction Unstructured data has

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

Migrate from Netezza Workload Migration

Migrate from Netezza Workload Migration Migrate from Netezza Automated Big Data Open Netezza Source Workload Migration CASE SOLUTION STUDY BRIEF Automated Netezza Workload Migration To achieve greater scalability and tighter integration with

More information

Optimize Your Databases Using Foglight for Oracle s Performance Investigator

Optimize Your Databases Using Foglight for Oracle s Performance Investigator Optimize Your Databases Using Foglight for Oracle s Performance Investigator Solve performance issues faster with deep SQL workload visibility and lock analytics Abstract Get all the information you need

More information

MLeap: Release Spark ML Pipelines

MLeap: Release Spark ML Pipelines MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Web Dev @ Cornell Studied some General Biology Rails Consulting for TrueCar and other companies Implement ML model for ClearBook

More information

IBM Best Practices Working With Multiple CCM Applications Draft

IBM Best Practices Working With Multiple CCM Applications Draft Best Practices Working With Multiple CCM Applications. This document collects best practices to work with Multiple CCM applications in large size enterprise deployment topologies. Please see Best Practices

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

EnterpriseLink Benefits

EnterpriseLink Benefits EnterpriseLink Benefits GGY a Moody s Analytics Company 5001 Yonge Street Suite 1300 Toronto, ON M2N 6P6 Phone: 416-250-6777 Toll free: 1-877-GGY-AXIS Fax: 416-250-6776 Email: axis@ggy.com Web: www.ggy.com

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information