Integration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper

Size: px

Start display at page:

Download "Integration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper"

Alan Watkins
6 years ago
Views:

1 and Statistica Enterprise Server Solutions Statistica White Paper Siva Ramalingam Thomas Hill TIBCO Statistica

2 Table of Contents Introduction...2 Spark Support in Statistica...3 Requirements...3 Statistica Spark Workspace Nodes...3 Out-of-the-Box Point-and-Click Nodes (No-Coding)...4 Spark Scala Script Node...6 Example Spark Workspaces...8 H2O Support in Statistica...10 Requirements...10 Statistica H2O Workspace Nodes...10 H2O Example Nodes...11 Example H2O Workspaces...12 In-Database Analytics Support in Statistica...13 Requirements...13 Statistica In-database Workspace, Overview...13 Statistica In-database Workspace, Nodes...13 List of available In-database nodes in Statistica...15 Final Comments, and Some Caveats...16 Putting it Together: Architectures for Big Data Analytics...19 Spotfire...19 Statistica Enterprise, Version Control, GMP/GxP, and Validation...19 Page 1 of 20

3 Introduction For decades, Statistica has provided comprehensive, powerful, and scalable analytics to customers across various industries. However, as data volumes and velocity are increasing exponentially, there is also the need for a flexible architecture that can move analytics to the data (into the database, data repository) orchestrate complex analytic pipelines that combine multiple data sources and in-database, in-memory parallel, and in-server analytics when it is most efficient and useful. Statistica provides such an architecture with capabilities to push the computations to cluster computing environments from within a simple and powerful workspace interface. This interface provides flexibility with respect to: - How and where to source data - Where to perform computations - Whether to use point-and-click low-code interfaces or any of the popular big-data scripting/programming environments (Python, R, Scala), and - How and where to push results for subsequent processing or visualization The user can design, code and visualize their big data pipeline on the Statistica client while offloading the expensive computations to the server (Spark, H2O or Database) to utilize the best of both worlds. Spark. Spark is an open source distributed computing framework that offers massively parallel in-memory processing for SQL, streaming, machine learning and graph analytics exposed through Java, Scala, Python and R APIs. The parallel operations are facilitated by the RDD (resilient distributed dataset) abstraction, which is immutable and provides lazy, fault tolerant operations. Spark requires a distributed file system such as the Hadoop Distributed File System as well as a cluster manager like Hadoop YARN. Spark supports various advanced analytics and machine learning through spark.ml, through algorithms that are optimized for distributed inmemory execution and scalability. H2O. H2O is an open source deep learning platform that is focused on artificial intelligence applications for enterprises. It offers a variety of machine learning models that can be trained using APIs in R, Python, Java, Scala, JSON and H2O s Flow GUI. It also provides access to Spark s machine learning library through the Sparkling Water API. Similar to Spark it is a distributed computing platform that can be run on cloud, on premise clusters or single machines using the Hadoop Distributed File System. In-database analytics. In-Database analytics refers to the process of integrating data analysis into data storage systems. For many enterprises today data security is of the highest priority. In many cases there is great reluctance to move data out of the storage system into a separate processing application. In-database analytics can both provide better data security and significant performance improvement with the introduction of column oriented database designed specifically for this use case. Page 2 of 20

Spark Support in Statistica Requirements 1. The user has access to a Spark instance Running on Debian/Ubuntu, Redhat/CentOS or MacOS Version 2.0.2 or greater 2.

4 Spark Support in Statistica Requirements 1. The user has access to a Spark instance Running on Debian/Ubuntu, Redhat/CentOS or MacOS Version or greater 2. The Spark instance has a running Livy server Version 2 or greater That was built against Spark or greater 3. Active network connection Statistica Spark Workspace Nodes Statistica provides access to the Spark engine via point-and-click data workspace nodes (icons in the workspace that provide access to the GUI), to configure how data are sources and analyzed; the user can also execute Scala programs directly. The implementation of the workspace nodes relies on Livy, the open source REST interface to Apache Spark. On the Statistica client the user will need to enter the Livy server URL under Home-> Tools -> Options -> Server/Web tab as shown in Figure 1. Figure 1: Statistica Server Configuration Menu Page 3 of 20

5 There are 3 steps to specifying the Livy server URL. Statistica supports basic authentication, where you can specify username and password followed by the server URL and the port on which Livy is running. If the server has no authentication set up, you can omit the username and password from the URL ( Out-of-the-Box Point-and-Click Nodes (No-Coding) Statistica allows users to develop and manage custom-developed Spark Scala Scripts; this is described later in this paper. Statistica also ships with the following predefined (script) nodes that can be used to create Spark-based analytics workflows (pipelines) without coding. The list below briefly describes each one of these predefined nodes, and identifies the specific Spark (spark.ml) API's used in the specific analyses. Spark Data Node This node reads a file in csv, json or parquet format from hdfs or filesytem and produces a spark dataframe that can be connected to downstream spark.ml analyses. Spark Decision Tree Classifier This node runs a decision tree classification analysis on a user specified input dataframe. A categorical dependent along with continuous predictors are expected inputs. The input data should not contain missing values. Spark Decision Tree Regressor This node runs a decision tree regression analysis on a user specified input dataframe. A double label column along with continuous predictors are expected for inputs. The input data should not contain missing values. Spark Feature Selection This node runs feature selection on a user specified input data frame. The algorithm estimates Chi-Square (predictor importance) statistics for each dependent/predictor pair. Missing data are casewise deleted for each pair. Continuous predictors are discretized using QuantileDiscretizer. Categorical predictors are indexed using StringIndexer. Chi-Square values are computed using chisqtest. Spark Feature Election Pipeline This node runs feature selection on a user specified input data frame. The algorithm sets up a Spark ML pipeline to select the user specified number of best predictors for a given dependent (y) variable. The input data should not contain missing values. Spark Generalized Linear Model This node runs a generalized linear regression analysis on an input dataframe. A binary (y) label column, continuous predictors and an optional weight variable are expected inputs. The input data should not contain missing values. Page 4 of 20

6 Spark Index Categoricals This node uses the StringIndexer API from Spark to convert the chosen string variables into columns with numeric indices. A column x becomes x_idx. The input data should not contain missing values. This node produces a downstream dataframe that can be fed to downstream spark.ml analyses. Spark Linear Regression This node runs a linear regression analysis on a user specified input dataframe. A double label column, continuous predictors and an optional weight variable are expected inputs. The input data should not contain missing values. Spark Logistic Regression This node runs binomial logistic regression analysis on a user specified input dataframe. A binary label column, continuous predictors and an optional weight variable are expected inputs. The input data should not contain missing values. Spark Make Design Matrix This node uses the RFormula API to produce a main effects design matrix. A single dependent variable and any number of continuous and/or categorical variables are expected inputs. The input data should not contain missing values. This node produces a design matrix as a downstream document that can be fed to further spark ml analyses. Spark Random Forest Classifier This node runs a random forest classification analysis on a user specified input dataframe. A categorical dependent and continuous predictors are expected inputs. The input data should not contain missing values. Spark Random Forest Pipeline This node runs a random forest classification analysis on a user specified input dataframe. This node features selection on user specified input data frame. The algorithm sets up a Spark ML pipeline to select the top predictors, using a Spark Random Forest Classifier. A categorical dependent and continuous predictors are expected inputs. The input data should not contain missing values. Spark Random Forest Regressor This node runs a random forest regression analysis on a user specified input dataframe. A continuous dependent (y) and continuous predictors are expected inputs. The input data should not contain missing values. Spark Reference Code Categoricals This node uses StingIndexer and OneHotEncoder APIs to convert the chosen string variables into columns of binary vectors with an _idx_vec extension. A column x becomes x_idx_vec on output. The input data should not contain missing values. This node produces a downstream dataframe that can be fed to further spark ml analyses. Page 5 of 20

Spark SVM Classification This node runs classification SVM analysis on a user specified input dataframe. A binary label column and continuous predictors are expected inputs.

7 Spark SVM Classification This node runs classification SVM analysis on a user specified input dataframe. A binary label column and continuous predictors are expected inputs. The input data should not contain missing values. Spark Scala Script Node To develop and submit an actual Scala script to the Spark engine, the user can then use the Statistica Dataminer Workspace Node called Spark Scala Script. The node is accessible from the menu under Big Data Analytics->Spark as shown in Figure 2. Figure 2: Big Data Analytics Spark Menu This node is the primary access point to the Spark engine from the Statistica client. It is designed using the same architecture as other Statistica code nodes (Python, C#). So it comes with support for using Statistica spreadsheets as input, creating and using user defined parameters, specifying code to be executed (directly in the node or a file in the system) when the node is run and bringing back results into Statistica spreadsheets and make them available for downstream analyses. The following Statistica-specific variables are used in Spark Scala Script node: stain of type List[DataFrame] is used to access upstream datasets. staout of type DataFrame is used to make a Spark DataFrame accessible to downstream Spark nodes. staresults of type List[DataFrame] is used to bring back Spark DataFrames as Statistica spreadsheets into the node s reporting documents collection. User defined node parameters are available as predefined constants and can be referenced in the Scala script by their user specified name. Page 6 of 20

Figure 3 shows an example Spark Scala node in detail. The Parameters tab contains the user created parameters that specify the options to import the file crabs.

8 Figure 3 shows an example Spark Scala node in detail. The Parameters tab contains the user created parameters that specify the options to import the file crabs.csv into a Spark DataFrame from the cluster file system into Spark. Header specifies that the first row of the file that contains variable names/headers Infer schema specifies that Spark should automatically infer the variable types Path to file specifies where the file resides on disk (also can be a hdfs file path) File type specifies if the file is a csv, json or parquet file Casewise delete MD specifies the imported DataFrame should remove all cases that contain missing data Figure 3: Spark Scala Node Parameters Tab Page 7 of 20

Figure 4: Spark Scala Code Tab The Scala code will utilize the user defined parameters to import the file into Spark as a DataFrame and makes it available to downstream analyses by assigning it to

9 Figure 4: Spark Scala Code Tab The Scala code will utilize the user defined parameters to import the file into Spark as a DataFrame and makes it available to downstream analyses by assigning it to staout variable. The parameter mappings are as follows Header getheader Infer schema inferschema Path to file filepath File type filetype Casewise delete MD nadrop Example Spark Workspaces Statistica ships with several example workspaces to illustrate different workflows that can be designed with Spark Scala Script nodes. These workspaces are available in the Examples/Workspaces directory under installation directory as shown in Figure 5. Page 8 of 20

10 Figure 5: Spark Example Workspaces Figure 6: Spark Example Regression Workspace Page 9 of 20

H2O Support in Statistica Requirements 1. The user has access to a running H2O instance Running on a cluster or a single host Has Sparkling Water 2.

network. Typically, the H2O server IP address and port are presented to the user once H2O is started. This information is needed for the H2O nodes (H2O Data) to connect to the H2O environment.

11 H2O Support in Statistica Requirements 1. The user has access to a running H2O instance Running on a cluster or a single host Has Sparkling Water 2.1 or greater Network connection if the H2O instance is on a remote server Statistica H2O Workspace Nodes Statistica nodes utilize the H2O REST API, so H2O server needs to be accessible over the network. Typically, the H2O server IP address and port are presented to the user once H2O is started. This information is needed for the H2O nodes (H2O Data) to connect to the H2O environment. Statistica ships with predefined H20 nodes which can be accessed from Big Data Analytics- >H2O tab in the menu as shown in Figure 7. Figure 7: H20 Analysis Nodes These nodes are completely parameterized for point-and-click UI, and implemented in IronPython. So the user just needs to specify the connection information and model parameters to run the specified analysis on the remote H2O instance. Here some additional details regarding the H2O Data. Figure 8: H20 Data Node, Parameters Tab Page 10 of 20

The H2O URL parameter specifies where the remote H2O instance is hosted. The Data source path specifies where the data file resides (supports Amazon S3, HDFS or local file on server).

12 The H2O URL parameter specifies where the remote H2O instance is hosted. The Data source path specifies where the data file resides (supports Amazon S3, HDFS or local file on server). The Code tab of the node shows the two parameters available to the IronPython script (and H20) to be referenced as NodeParameters["H2O_url"] and NodeParameters["dataSourcePath"]. The script uses these parameters to import the data into H2O server and returns a Statistica spreadsheet that shows the sample of the dataset and report detailing the metadata of the dataset. Figure 9: H20 Data Node, Code Tab H2O Example Nodes H2O Data Mapping This node allows the user to specify a custom schema for the H2O DataFrame. This can be helpful in case where the user deems the data types inferred by H2O to be unsuitable. H2O Gradient Boosting Machine (GBM) A GBM (Gradient Boosting Machine) is an ensemble of either regression or classification tree models. Both are forward-learning ensemble methods that obtain predictive results using gradually improved estimations. H2O Generalized Linear Modeling (GLM) Generalized Linear Models (GLMs) are an extension of traditional linear models. GLMs estimate regression models for outcomes following exponential distributions. The GLM suite includes Gaussian, Poisson, Binomial, Multinomial and Gamma regression. Page 11 of 20

H2O Prediction Node In the H2O Prediction Node, you can make predictions on new data sets using the trained models that are created when running H2O GBM or GLM nodes.

13 H2O Prediction Node In the H2O Prediction Node, you can make predictions on new data sets using the trained models that are created when running H2O GBM or GLM nodes. Additional Data Science H2O Algorithms The following additional data science algorithms are also provided: H20 DeepLearning H20 DRF (Distributed Random Forest) H20 PCA (Principal Competent Analysis) H20 K-Means Example H2O Workspaces Statistica also ships with several example workspaces to illustrate the possible workflows that can be constructed with H2O nodes. These workspaces are available in the Examples/Workspaces directory under installation directory as shown in Figure 10. Figure 10: H20 Example Workspaces Page 12 of 20

14 In-Database Analytics Support in Statistica Requirements The user has access to 1. Statistica Enterprise Server data configuration 2. Data residing in SQL Server with appropriate permissions (create, update and drop tables on tempdb for some algorithms) Statistica In-database Workspace, Overview The In-database analytic nodes are code free workspace nodes that are completely parameterized and ready to use out of the box. These nodes are accessible from the Big Data Analytics tab of the main menu under In-Database Analytics section as shown in Figure 11. Figure 11: In-SQL-Database Processing Node The In-Database Analytics nodes provide options to bring the analysis to the data stored in a database. Internally, the nodes implement the respective computations and analyses by automatically creating suitable queries that move time-consuming data processing operations into the database. For example, to compute covariances and analyses that depend on covariance matrices, it is possible to compute the sums-of-squares-and-cross-product (SSCP) matrices via queries in-database. Depending on the specific databases that are targeted (and their configuration), these queries can execute in parallel and in-memory, for very efficient in-memory parallelized analytics. Once the SSCP matrix is computed, Statistica (server) can then use that matrix to perform correlation/covariances-based analytics such as stepwise regression. In that manner, very large datasets can be analyzed using almost entirely in-database computational resources. Statistica In-database Workspace, Nodes These nodes are broken down into logical groups as Data Access, Data Management and Analysis nodes. The starting point of any In-Database analytic workflow is the In-Database Page 13 of 20

Enterprise Data Configuration node which uses a database connection object (see Statistica Enterprise Server database connection and data configuration).

15 Enterprise Data Configuration node which uses a database connection object (see Statistica Enterprise Server database connection and data configuration). As illustrated in the image below, the In-Database Enterprise Data Configuration node takes an Enterprise Data Configuration URL as input and provides options for the user to bring back a sample of the data (for review by the user) from the database in a Statistic Spreadsheet and to specify the type of database the data resides in (if not specified Statistica will try to automatically infer the database type). The database connection object created by this node is kept open ("alive") for further analyses downstream. Figure 12: In-Database-Processing Node, Specifications Tab Figure 12: In-Database-Processing Example Workspace The image above illustrates a complete In-database analytic workflow involving the Data access, management and analysis nodes. Page 14 of 20

16 List of available In-database nodes in Statistica In-database Write to Database This node provides the user the ability to move all or a subset of an existing table in the database into another table (persistent or temporary) in the same database by either a creating or overwriting or appending to the destination table. In-database Filter Duplicate Cases This node provides the user the ability to filter out records with identical values for user specified columns. It will return a filtered data stream as output containing user selected columns. In-database Random Sample Filtering This node follows the functionality for the Random Sampling module in Statistica with a few important differences: The In-Database Random Sample Filtering node does not support sampling-with-replacement or oversampling (sampling fractions > 1.0) nor split node (stratified) random sampling. The transformations to the data are not applied until the process gets to the In-Database analytics node, which will operate on the modified query. This means that if there are multiple downstream nodes connected, they might have different results since random sampling will happen once the main analytics procedure is executed. In-database Sort This node allows the user to sort columns of the data table. The user has the ability to specify sort direction for each column and the position of the column in the output stream. In-database Correlation Matrix This node allows the user to generate correlation and partial correlation matrices for specified data table columns as Statistica Spreadsheets. In-database Descriptive Statistics This node enables the user to obtain descriptive statistics for specified data table columns. The node provides a subset of functionality as compared with the regular Descriptive Statistics node that operates on Statistica Spreadsheets. In-database Logistic Regression This node enables the user to run logistic regression analysis on a specified binary dependent (y) variable with continuous predictor and categorical predictor variables. The node returns a coefficients spreadsheet as output. In-database Multiple Regression This node enables the user to run multiple regression analysis on specified continuous dependent and continuous predictor columns of the data table. The node produces a coefficients spreadsheet as output. Page 15 of 20

17 Final Comments, and Some Caveats The features described in the sections above combined with the capabilities of Statistica provide a powerful and flexible platform for big data exploration and analysis. The architecture provides options for analysts to access multiple data processing platforms in the same flow, move selected analyses to big-data-repositories when that is useful or necessary, or to move data to the Statistica server to perform the analysis on dedicated hardware, when the respective databases should not be overloaded with analytics in order to support other business-critical functions. Users planning to explore these features are advised to consider the following possible issues. Spark nodes Data export/import to/from the Spark engine from/to Statistica are managed through REST calls and the users should be mindful of the data size to be in the order of few megabytes at most in these cases. An alternative would be to use the HDFS import/export nodes available in Statistica and design your script to read from/write to HDFS. Another item to be mindful is resource allocation for each Spark analysis. At the moment Statistica requires that resource allocation be configured using Livy server side configuration options. H2O nodes Sparkling Water is not shipped with Statistica and needs to be deployed, configured and started before Statistica H2O workspace nodes can be used. Data export/import to/from H2O server from/to Statistica are managed through REST calls and the users should mindful of the data size to be in the order of few megabytes at most in these cases. Of course, typically, Big data are accessed directly in HDFS or in other suitable data repositories. In-database nodes Performance of the in-database analytics analyses relies on the database engine and the amount of data stored. However, it should be noted, that some tasks can take a considerable amount of time and impact performance of the database. Database systems impose limitations that in-database analytics nodes have to satisfy. Some major limitations by database type are described below. For example, the limitations for SQL Server are shown below (see Maximum Capacity Specifications for SQL Server, extracted 8/14/2015). Columns per nonwide table 1024 Columns per SELECT statement 4096 Bytes per row 8060 Also note that direct interactions with databases requires certain access privileges, in particular for some analyses that may create intermediate temporary tables to support iterative analyses (e.g., logistic regression). Page 16 of 20

18 Bringing analytics to the database helps to improve on performance and makes the solution scalable. However, it also moves calculations to a different engine. As such, the approach is leveraging the database built-in functions and relies on the number representation within the database. Databases differ in data types and functions. Even though in-database analytics nodes perform necessary conversions of types and utilize common patterns for the majority of types of SQL, it is hardly possible to maintain the same level of accuracy across all databases. In order to obtain the highest level of accuracy and consistency it is recommended to import the data into Statistica and run the analysis with common Statistica modules. Moving analytics to the data, moving data to Statistica analytics In practical applications it is sometimes not obvious whether big-data-analytics can be performed more efficiently on the database side or in-database, or if it is easier and faster to extract the data and move them to the Statistica analytics server for analyses. In extreme cases, the choice is obvious: If the data is truely huge and exceeds the capacity for storage on the Statistica server side, then in-database analytics (any of the approached discussed) is the way to create repeatable scalable analytic workflows and pipelines. Likewise, when the data sizes are in the megabyte range, then it is often slower to move the analytics to the database, rather than to import/move the data to the Statistica server for analysis. Obviously, local analytics on the Statistica side against small to moderately large datasets can run faster and be more interactive than analytics run inside a complex server farm (given the overhead etc.). Given sufficient bandwidth and speed, even sizeable data sets can be analyzed faster on the Statistica server side (or even desktop), in particular when the main storage system is also heavily used to support other business critical functions, as is often the case. Moving Information Extraction into Big Data Repositories Regardless of the specific database platform and in-database analytics that is used, a common pattern for big-data-analytics can best be characterized as a "funnel." Data is growing exponentially in most organizations, but the information that is critical for successful operations of an organization or process is not growing at that rate. Put another way, big-data-analytics can be thought of information extraction against continuously growing data volumes. A common approach is to push initial data preparation and information extraction into the database (to the data), but perform the final analyses on a dedicated Statistica analysis server. This "funnel" model is summarized in Figure 13. Page 17 of 20

Figure 13: "Funnel" Modeling Pipeline Architecture In this architecture data such as error log files, calibration logs, raw customer data and narratives, etc. are stored in HDFS.

19 Figure 13: "Funnel" Modeling Pipeline Architecture In this architecture data such as error log files, calibration logs, raw customer data and narratives, etc. are stored in HDFS. An initial in-database process imposes a schema onto those data and extracts the most relevant information for subsequent modeling, and performs other data preprocessing. For example, one might extract from log files that are documenting calibration runs for large numbers of tool the largest deviation from specification. Further analyses could perform initial feature selection to identify the tool data with the greatest diagnostic value for predicting quality. The resulting subset of "important predictors" and the respective values for the maximum-deviations-from-specification per run date can then be brought into the Statistica Analysis Server (e.g., the Monitoring and Alerting Server MAS, as shown in the diagram) for subsequent monitoring and analyses. Results can ultimately be displayed on dedicated visualization platforms for further drilldown. This architecture is commonly applied in manufacturing contexts. Page 18 of 20

Putting it Together: Architectures for Big Data Analytics The TIBCO software solutions portfolio contains a number of best-in-class tools to facilitate connecting to data, integrating and

20 Putting it Together: Architectures for Big Data Analytics The TIBCO software solutions portfolio contains a number of best-in-class tools to facilitate connecting to data, integrating and virtualizing connections across diverse data sources, and for processing data and events in real time against high-volume and high-velocity data. TIBCO also provides one of the most popular tools for visual analytics: Spotfire. Spotfire Spotfire provides the ideal end-point for big-data analytics orchestrated through Statistica. Essentially, all big-data-analytics is about the extraction of information from (big) data information that can be represented as a set of interrelated results tables. Spotfire provides consumers of that information the ideal interface to visually explore results, compare segments, slice-and-dice results, or render results in highly-specialized displays like wafer maps, geo maps, KPI charts, etc. The following overview architecture illustrates the tiers of a complete big-data solution where Statistica provides the analytic backend supported by other big-data platforms including Spark, H20, etc. Statistica then provides tools for scheduling, monitoring, and publishing results to Spotfire to support end-users who take actions based on their visual review of results. This general pattern encompasses perhaps the majority of all enterprise-wide analytics use cases for big-data analytics, in manufacturing, marketing, etc. Figure 14: TIBCO Big Data Architecture for Analytics Statistica Enterprise, Version Control, GMP/GxP, and Validation In this architecture, Statistica Enterprise not only provides orchestration and scheduling of analytics, but can also provide the critical support to enable enterprise-wide deployment and management of models perhaps 100'ds or 1000'ds of models. Statistica Enterprise provides Page 19 of 20

21 version control, audit logs, role-based access (and abstractions) to analytic templates and data, as well as features like approval processes and electronic signature support for validated GMP/GxP (Good Manufacturing Practices, Good anything Practices) deployments in regulated industries (e.g., in pharmaceutical and medical device industries, financial services, etc.). In addition, Statistica provides tools for monitoring large numbers of analytic flows, and efficient web-based monitoring of large numbers of such flows and processes through the Monitoring and Alerting Server (MAS). These components are in use across a wide range of industries to enable validated analytics of standardized curated workflows through the enterprise, for efficient monitoring of large numbers of processes and parameters. Page 20 of 20

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process