Accelerator for Apache Spark. Functional Specification. 23 August Version 1.0.0

Size: px
Start display at page:

Download "Accelerator for Apache Spark. Functional Specification. 23 August Version 1.0.0"

Transcription

1 Accelerator for Apache Spark Functional Specification Global Headquarters 3303 Hillview Avenue Palo Alto, CA Tel: Toll Free: Fax: August 2016 Version , TIBCO Software Inc. All rights reserved. TIBCO, the TIBCO logo, The Power of Now, and TIBCO Software are trademarks or registered trademarks of TIBCO Software Inc. in the United States and/or other countries. All other product and company names and marks mentioned in this document are the property of their respective owners and are mentioned for identification purposes only. This document outlines the functional specification of the components of the Accelerator for Apache Spark

2 Revision History Version Date Author Comments /04/2016 Piotr Smolinski Initial version /04/2016 Piotr Smolinski /06/2016 Piotr Smolinski /06/2016 Ana Costa e Silva /08/2016 Piotr Smolinski Version for release Accelerator for Apache Spark Functional Specification 2

3 Copyright Notice COPYRIGHT 2016 TIBCO Software Inc. This document is unpublished and the foregoing notice is affixed to protect TIBCO Software Inc. in the event of inadvertent publication. All rights reserved. No part of this document may be reproduced in any form, including photocopying or transmission electronically to any computer, without prior written consent of TIBCO Software Inc. The information contained in this document is confidential and proprietary to TIBCO Software Inc. and may not be used or disclosed except as expressly authorized in writing by TIBCO Software Inc. Copyright protection includes material generated from our software programs displayed on the screen, such as icons, screen displays, and the like. Trademarks Technologies described herein are either covered by existing patents or patent applications are in progress. All brand and product names are trademarks or registered trademarks of their respective holders and are hereby acknowledged. Confidentiality The information in this document is subject to change without notice. This document contains information that is confidential and proprietary to TIBCO Software Inc. and may not be copied, published, or disclosed to others, or used for any purposes other than review, without written authorization of an officer of TIBCO Software Inc. Submission of this document does not represent a commitment to implement any portion of this specification in the products of the submitters. Content Warranty The information in this document is subject to change without notice. THIS DOCUMENT IS PROVIDED "AS IS" AND TIBCO MAKES NO WARRANTY, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO ALL WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. TIBCO Software Inc. shall not be liable for errors contained herein or for incidental or consequential damages in connection with the furnishing, performance or use of this material. For more information, please contact: TIBCO Software Inc Hillview Avenue Palo Alto, CA USA Accelerator for Apache Spark Functional Specification 3

4 Table of Contents TABLE OF CONTENTS...4 TABLE OF FIGURES...7 TABLE OF TABLES PREFACE PURPOSE OF DOCUMENT SCOPE REFERENCED DOCUMENTS ARCHITECTURE COMPONENTS EVENT PROCESSOR FLOWS (FAST DATA STORY) SPOTFIRE COMPONENTS (BIG DATA STORY) LIVEVIEW COMPONENTS (OPERATIONS STORY) EVENT SEQUENCING REGULAR EVENT FLOW DATA PROCESSING FLOW SIMULATION EVENT PROCESSOR - STREAMBASE CORE LOGIC ProcessTransactionsAndScore ProcessTransaction CategorizeTransactions CategorizeTransaction (DefaultCategorizeTransaction) FeaturizeTransactions EvaluateModel (H2OEvaluateModel) TRANSPORT BINDING KafkaWiredProcessTransaction KafkaConsumeTransactions KafkaProduceNotifications KafkaAcknowledgeTransaction PERSISTENT RUNTIME STATE HBaseCustomerHistory HBaseAddCustomerTransaction Accelerator for Apache Spark Functional Specification 4

5 4.4 CONFIGURATION LOADING AND CHANGE MONITORING MaintainCategories MaintainFeatures H2OMaintainModel CoordinateStartup DATA ANALYTICS - SPOTFIRE DISCOVER BIG DATA Totals Discover Big Data: Drill-down Categories Basket Analysis Client Cross-Sell Geos Play-page MODEL BIG DATA Preparation Training Model quality check Variable importance Discrimination threshold selection DESIGN AND EXECUTE MARKETING CAMPAIGNS Campaign bundling Campaign deployment DATA ACCESS - SPARK AND H2O DATA ACCESS AND PROCESSING IN SPARK MODEL TRAINING IN SPARKLING WATER / H2O EVENTS TO DATA - FLUME INFORMATION TO BE STORED FROM EVENTS TO DATA WHEN MY DATA IS AVAILABLE Events Runtime context Intermediary storage Target storage Accelerator for Apache Spark Functional Specification 5

6 7.4 DATA FOR ANALYTICS Data format Data organization Enrichment Tools INSIGHT TO ACTION - ZOOKEEPER AND H2O EVENT FLOW SIMULATOR Accelerator for Apache Spark Functional Specification 6

7 Table of Figures Figure 1: Solution Component Diagram Figure 2: Regular event flow Figure 3: Data processing activities Figure 4: ProcessTransactionAndScore Figure 5: ProcessTransactionAndScore transactions Figure 6: ProcessTransactionAndScore notifications Figure 7: ProcessTransactionAndScore acks Figure 8: ProcessTransactionAndScore acks Figure 9: ProcessTransactionAndScore transactionsout Figure 10: ProcessTransactionAndScore categories Figure 11: ProcessTransaction Figure 12: ProcessTransaction Transactions Figure 13: ProcessTransaction Predictions Figure 14: CategorizeTransactions Figure 15: DefaultCategorizeTransaction Figure 16: FeaturizeTransactions Figure 17: H2OEvaluateModel Figure 18: KafkaWiredProcessTransaction Figure 19: KafkaWiredProcessTransaction Transactions Figure 20: KafkaWiredProcessTransaction Categories Figure 21: KafkaConsumeTransactions Figure 22: KafkaProduceNotifications Figure 23: KafkaAcknowledgeTransaction Figure 24: HBaseCustomerHistory Figure 25: HBaseAddCustomerTransaction Figure 26: MaintainCategories Figure 27: MaintainFeatures Figure 28: H2OMaintainModel Figure 29: CoordinateStartup Figure 30: Spotfire: Discover: Totals Accelerator for Apache Spark Functional Specification 7

8 Figure 31: Drill-down Figure 32: Spotfire: Discover: Categories Figure 33: Spotfire: Discover: Basket Analysis Figure 34: Spotfire: Discover: Client CrossSell Figure 35: Spotfire: Discover: Geos Figure 36: Spotfire: Discover: Play Page Figure 37: Spotfire: Model: Prep Figure 38: Spotfire: Model: Training in Spark Figure 39: Spotfire: Model: Evaluate Quality Figure 40: Spotfire: Model: Variable Importance Figure 41: Spotfire: Model: Custom Threshold Figure 42: Spotfire: Deploy: Bundle Models into Campaigns Figure 43: Spotfire: Model: Launch Your Campaigns Accelerator for Apache Spark Functional Specification 8

9 Table of Tables Table 1: Accelerator for Apache Spark Components Table 2: Event Processor Modules Table 3: Spotfire and Spark components Table 4: LVW and LDM components Accelerator for Apache Spark Functional Specification 9

10 1 Preface 1.1 Purpose of Document This document addresses dynamic aspects of the Accelerator for Apache Spark. It describes the applied solutions as planned, repeatable in concrete customer projects and realization in the accelerator demo. The Accelerator for Apache Spark addresses the growing market of analytics solutions (Big Data) with strong focus on the event processing (Fast Data). The accelerator goal is to highlight the TIBCO added value to the Big Data world. We have acknowledged that the Big Data solutions already exist. The missing point is getting value from Big Data analytics. It is possible to explore data, process it and build the models. The challenge arises when the data is no longer static. The events flow through the system and the event processing goal is to capture them in analytics optimal form. Once the results from analytics are available they should be converted into value. The accelerator covers the full cycle from event capture through analytics to predictive and prescriptive model execution against observations. 1.2 Scope The document covers the following aspects: Scalable event capture and processing (Kafka and StreamBase) Event persistence in Big Data storages (Kafka, Flume, Spark) o minimal event processing layer impact o data processing efficiency Numerical model training in Big Data processing clusters (Spotfire, Spark, H2O) Model deployment to scaled out event processors (Spotfire, ZooKeeper and StreamBase) Operational monitoring (LiveView DataMart and LiveView Web) Artificial data generation and injection 1.3 Referenced Documents Document Reference Accelerator for Apache Spark Quick Start Guide Accelerator for Apache Spark Interface Specification Accelerator for Apache Spark Functional Specification 10

11 2 Architecture 2.1 Components The accelerator architecture focuses on commonly applied open source Big Data products. The key solutions are: Kafka - extremely scalable message bus for Fast Data HDFS - de facto standard for Big Data storage These two products have been confronted to TIBCO products: StreamBase - event processing solution Spotfire - analytics platform To gaps in the architecture have been filled with: HBase - for scalable event context storage Flume - for Fast Data to Big Data transition Spark - for data access and transformation H2O - for clustered model training and lightning-fast scoring ZooKeeper - for cluster coordination Figure 1: Solution Component Diagram What's important, the accelerator is not limited to Big Data. The problem of getting the value from analytics exists also in traditional applications. Accelerator for Apache Spark Functional Specification 11

12 Table 1: Accelerator for Apache Spark Components Component Software Description Messaging Firehose Apache Kafka Highly scalable messaging bus. The core of Fast Data system is messaging bus capable of passing thousands of messages per second and still expandable. With Kafka it is possible to add nodes on demand to support more traffic. Data Storage Apache Hadoop HDFS The Big Data systems rely on the efficient and reliable storage for enormous amounts of data. Hadoop framework provides two components, one for the data (HDFS) and one for programs (YARN). Event Processor TIBCO StreamBase StreamBase is a CEP and ESP platform for event processing. It combines visual programming with high efficiency for reactive event handling. The component provides integration and event processing capabilities. Data Analytics TIBCO Spotfire Spotfire is a data visualization and analytics platform. In the accelerator the access patterns to the data stored in the cluster were evaluated. The accelerator shows also sample flow for model building in the Big Data cluster and runtime model deployment. Runtime Context Store Apache HBase NoSQL columnar database used with HDFS. Data Writer Apache Flume Event persistence framework. Data Access Service Apache Spark Big Data processing framework. Apache Spark is current state-of-the-art solution for processing data in Big Data clusters. It offers much better throughput and latency than the original Hadoop Map-Reduce. Model Training Engine H2O Cluster-oriented numerical modelling software. Traditional numerical modelling algorithms in R or NumPy/SciPy are implemented with simple architecture in mind. When the dataset significantly exceeds a single node capacity reimplementation of such algorithms is needed. H2O is a successful attempt to train models and it generates effective realtime processing models. Simulation Publisher StreamBase Jython Kafka Simulation framework. The component injects the messages into the system for the demo purposes. There component uses customer modelling and data injection parts. Real-Time Dashboard Live DataMart LiveView Web StreamBase Visualization component presenting recent changes in the system in the real-time. Accelerator for Apache Spark Functional Specification 12

13 2.2 Event Processor Flows (Fast Data story) The Fast Data focuses on the data flowing through the system. The operating data unit is customer. The event processing layer captures new transactions, builds customer history and prepares offers. Table 2: Event Processor Modules Module Component Description Kafka transaction binding Event Processor The integration binding to the messaging firehose. It contains example of Kafka adapter usage and complex XML handling. Context binding Event Processor Each transaction is processed in scope of previous transactions executed by the same customer. The state is externalized to HBase. Enrichment Event Processor The context contains only the raw facts. It this case it is list of transactions with just product ids. For the model processing this information has to be enriched with category cross-referencing. Transaction featurization Event Processor Before transactions can be processed by model, the transaction and history must be converted into model input. The typical model input is a list of categorical or numerical values. Model execution Event Processor The models are external deployable artifacts produced by the data analytics layer. The result of event processing in this case is a score for each deployed model. Live DataMart binding Event Processor The LVW is provided as real-time monitoring dashboard. The underlying LDM is fed by the event processing component. Flume binding Event Processor Binding for secure sending of the data to HDFS. Offer acceptance tracking Event Processor Process of tracking customer response. Configuration notifications Event Processor Binding for the configuration changes provided by ZooKeeper. Accelerator for Apache Spark Functional Specification 13

14 2.3 Spotfire components (Big Data story) The Big Data store uses holistic view on the data. It aggregates customers and builds statistical models. The operating unit is dataset. Table 3: Spotfire and Spark components Module Component Description ETL Data Access Service Data Analytics Transformation from Avro to Parquet. Data discovery Data Analytics Data Access Service Access to the underlying tables for data discovery. Model training Data Analytics Data Access Service Model preparation and assessment Model deployment Data Analytics Model submission to the event processing layer. 2.4 LiveView components (operations story) The LiveView shows the current state of the system. It presents the currently relevant information about running processes. That means it contains only small fraction of the data or heavily reduced information. Table 4: LVW and LDM components Module Component Description Transaction TransactionItems Real-Time Dashboard Recent transactions with their content. The tables form master-child structure. ModelStatus Real-Time Dashboard Status of the models. StoreSummary Real-Time Dashboard Current store information. Includes static store information, like geographic position, and aggregate information derived from transaction. Accelerator for Apache Spark Functional Specification 14

15 3 Event Sequencing 3.1 Regular event flow Live DataMart Kafka Originator Kafka StreamBase Flume HBase H2O HDFS deliver GET score select best offer PUT publish collect notify insert acknowledge track acceptance update write batch remove batch acknowledge Figure 2: Regular event flow The Fast Data story is automated process. It focuses on transaction event processing. The sequence of events happening: 1. The originator publishes transaction event (XML) to Kafka Transactions topic 2. StreamBase event flow retrieves event 3. The past customers transactions are retrieved from HBase 4. The past transactions are filtered by date (to limit to the recent transactions) and deduplicated 5. The built customer context is converted into features 6. The data is scored by deployed H2O models 7. The results are filtered according to the model metadata (cut-off, validity dates and so on) 8. From the remaining results the winning one is selected 9. The transaction data with scoring result is published to Kafka as JSON a. Flume Source consumes batches of messages b. Once all messages are accepted by agent's Channel, the batch is acknowledged c. The batches are delivered to HDFS Sink d. Once the Sink flushes buffers, it removes data from Channel 10. The result is published to Kafka Notifications topic as XML 11. The message is delivered to originator (it may or may not contain offer) 12. The transaction is published to LDM for runtime tracking 13. The past transactions for current customer are scanned for pending offers 14. The pending offers with categories matching the incoming transaction are marked succeeded 15. The past transactions are scanned for outdated offers (based on message timestamp) 16. The pending offers with missed deadline are marked unsucceeded Accelerator for Apache Spark Functional Specification 15

16 3.2 Data processing flow Collect data ETL data Discover data Build models Bundle models Deploy models Figure 3: Data processing activities The Big Data story is human driven process. The focus here is exploration of the data stored in Big Data cluster (HDFS+Spark). The process eventually produces models executed in the event processing layer. The high-level procedure follows: 1. The data is collected in HDFS as Avro 2. The ETL process turns many relatively small Avro files into Parquet a. transaction deduplication b. category info enrichment c. data flattening 3. The data scientist explores the data and provides candidate algorithms (partially covered by the accelerator) 4. The data analyst builds the candidate algorithms and assesses their performance (for example using ROC curves). The accepted models are described and passed to the operations 5. The system operator combines the models into bundles 6. The bundles are deployed to event processing The side activities happening in the event processing layer are: 1. The events sent to Flume are accompanied with model evaluation data 2. The customer purchases are tracked for offer acceptance and sent in the real-time to LDM 3. The offer acceptance and model efficiency can be transformed in ETL process 3.3 Simulation The traffic simulator is a StreamBase component generating test transaction flow. The component publishes transaction messages using configured transaction rate, reference data and time compression factor. The module is also capable of simulating customer response on the presented offers. There are two implementations of the component. One implementation uses real-time transaction generation model. This variation uses stateful process of tracking large number of customers and generates random transactions using reference data probability modifiers. The process tries keeping uniform time distribution between subsequent transactions of the given customer. The advantage is that the realtime data generation may adapt customer behaviour to the system responses (offers). Alternative implementation reads pregenerated data and sends messages. The data is stored in a flat tab-separated file. The file is ordered by timestamp and transaction id. The ordering guarantees that transaction lines for the same transaction are stored as single block. The generator process builds random customer history. Single iteration creates a customer with some demographic profile. For this customer a series of transactions is built. The transactions are written out as flattened transaction item list. Accelerator for Apache Spark Functional Specification 16

17 4 Event Processor - StreamBase 4.1 Core logic ProcessTransactionsAndScore The event flow handles the main context related logic. Figure 4: ProcessTransactionAndScore In the flow the transactions are processed for customer offering and for hot item (category) tracking. The ProcessTransaction module executes the logic related to customer offering. It loads the customer context, transforms it into model consumable feature set and does the model output final interpretation. In this particular case the winning offer is selected. The TrackCategories expands the incoming transaction into transaction lines with category info. What's important, a single line may have 0 or many categories. The resulting categories are then processed as individual signals. The module provides also external configuration wiring. The control submodules are responsible for maintenance of the reference tables and deployed models. The transactionsin input stream carries the raw transaction information passed from the originator. The capture group supports arbitrary external content to be passed transparently to the output streams. This feature is used to retain the Kafka consumption context information. Figure 5: ProcessTransactionAndScore transactions The notifications output stream emits the ultimate result of the processing logic. This result is used to send the offer to customer. The events contain input event transport-related fields. Accelerator for Apache Spark Functional Specification 17

18 Figure 6: ProcessTransactionAndScore notifications After all logic is executed the messages are acknowledged to the input topic. With Kafka this means the last consumed offsets are saved in Zookeeper. Because the acknowledgement protocol is transport related and logic independent, the acks events carry only transport information. Figure 7: ProcessTransactionAndScore acks The notifications output stream emits the ultimate result of the processing logic. This result is used to send the offer to customer. The events contain input event transport-related fields. Figure 8: ProcessTransactionAndScore acks From the same structure as the notifications the audit information is derived and published as transactionsout. The events are used to update the LDM tables and to store the transactions and evaluation results in HDFS for auditing and effectiveness tracking purposes. Figure 9: ProcessTransactionAndScore transactionsout The categories output stream emits category tracking tuples. These are later consumed for category performance check and for customer to detect the offer responses. Accelerator for Apache Spark Functional Specification 18

19 Figure 10: ProcessTransactionAndScore categories ProcessTransaction This is the main working horse for the CEP-style processing. The flow implements stateful context for customer's transactions. The past transactions are retrieved from dedicated storage solution (pluggable) and the new transaction is appended to the ledger. All the transactions in the retrieved history are classified according to the product to category mapping. Subsequently the enriched customer context is converted into feature vector, i.e. data structure corresponding to the customer description in the applied modelling. The result is then processed by all currently deployed models. Figure 11: ProcessTransaction The Transactions input stream carries essential information about the transaction. The flow in this module is responsible for information enrichment and adaptation. Figure 12: ProcessTransaction Transactions The Predictions output stream strips the locally collected state. It emits the originally input information with accepted model results. Accelerator for Apache Spark Functional Specification 19

20 Figure 13: ProcessTransaction Predictions CategorizeTransactions The flow just iterates over transactions and applies category resolution to each of them. Figure 14: CategorizeTransactions CategorizeTransaction (DefaultCategorizeTransaction) The transaction categorization uses pluggable logic. In the applied case it uses query table to load all the categories assigned to product SKU. Figure 15: DefaultCategorizeTransaction Accelerator for Apache Spark Functional Specification 20

21 4.1.5 FeaturizeTransactions Context featurization is typically complex task. The CEP context information (enriched by the known state and reference data) has to be converted into a structure that matches the one used to train the models. In many of the cases there is no perfect mapping between the static data used by a data scientist and runtime state available during event processing. The featurization tries to build the input sample description as close as possible to the one used in the model training process. Figure 16: FeaturizeTransactions EvaluateModel (H2OEvaluateModel) Once the incoming transaction is transformed into features, it can be processed by the models. In the accelerator case the featurized transactions are processed by ultra-fast models generated with H2O. In generic case there could be even several alternative model deployable at the same time for routed or broadcasted execution. Figure 17: H2OEvaluateModel The logic in the flow is simple. The incoming features are adapted to the model operator interface. 4.2 Transport binding KafkaWiredProcessTransaction The event processor core logic is related to the transaction processing. The top level event flow orchestrates Kafka message exchange and exposes notification flows for other features. Accelerator for Apache Spark Functional Specification 21

22 Figure 18: KafkaWiredProcessTransaction The transaction processor consumes messages from Kafka bus. The transactions are evaluated using core logic to obtain offers for customer and to categorize the transaction items. The processing results are sent as offering to the caller. The KafkaWiredProcessTransaction is top-level event flow orchestrating the transport binding and actual logic execution. The event flow calls the main processing logic and passes the transport-related information as capture group. This data is transparent to the underlying implementation, but it is required to properly send responses to the incoming messages and to commit the transactions. The event flow offers two output streams intended for synchronous event consumption: Transactions Categories The Transactions output stream emits the transaction information with model evaluation results. The output stream is used to: Figure 19: KafkaWiredProcessTransaction Transactions update LDM tables report events to Flume track prepared offers The Categories output stream captures categorized transaction information. It emits tuples for each transaction line. Accelerator for Apache Spark Functional Specification 22

23 The stream is consumed by: Figure 20: KafkaWiredProcessTransaction Categories offer acceptance tracking hot categories tracking (currently not implemented) KafkaConsumeTransactions The Kafka consumption has been simplified in this version of accelerator. There is single consumer handling all the partitions of the Transactions topic. The consumer is statically configured to connect to known broker list. At the startup the flow is inactive. The subscription is opened once the coordination module decides that all the models and configuration settings have been read. This was made in order to avoid processing of the data with partial configuration. The process reads topic metadata from ZooKeeper. Then for each partition it retrieves recent consumption offset and activates subscriber. The flow reads messages from the broker and before emitting events for processing does interpretation of the opaque content: the XML payload is adapted to StreamBase compliant format and then to tuple the header is parsed and exposed for transport-related handling Figure 21: KafkaConsumeTransactions KafkaProduceNotifications The message sending is much simpler than consuming. The flow renders payload XML the StreamBase style and transforms it to the interface defined schema. Then the message is sent out according to the data passed in transport header provided by the consuming module. Accelerator for Apache Spark Functional Specification 23

24 Figure 22: KafkaProduceNotifications KafkaAcknowledgeTransaction Transaction acknowledgement in Kafka is simple. One has to save the last processed offset in the shared location, in this case in ZooKeeper node. Figure 23: KafkaAcknowledgeTransaction 4.3 Persistent runtime state In the accelerator the runtime state for the main transaction processing logic is maintained in HBase. This is pluggable component and, as long as the contract is respected, the HBase may be replaced with any technology. A similar feature can be implemented for example with TIBCO ActiveSpaces. The main advantage of HBase over AS is durability focus. Also the product API allows for much lighter communication protocol and lower coupling between components HBaseCustomerHistory In order to retrieve the customer past transactions a Get operation is executed. The operation is done with MaxVersions attribute set to high value; therefore all transactions stored by HBase are retrieved. It has been assumed that the solutions should be duplicate-tolerant. There could be multiple entries for the same transaction, but the initial design states that the content for given transaction id is same. This way it is enough to retrieve only unique records. The lookup by primary key uses region server routing, therefore the operation scales linearly with the HBase cluster size. Accelerator for Apache Spark Functional Specification 24

25 Figure 24: HBaseCustomerHistory HBaseAddCustomerTransaction The counterpart of past transactions retrieval is appending of transaction to the customer's history. In HBase it is made simple. Updating a field with version tracking is equivalent to appending an entry to the version log. The update by primary key uses region server routing. Similarly as the lookup the operation scales linearly with the HBase cluster size. Figure 25: HBaseAddCustomerTransaction 4.4 Configuration loading and change monitoring The solution uses ZooKeeper to store the global configuration. ZooKeeper is a cluster-wide source of truth. It prevents from uncontrolled knowledge corruption that may happen during split-brain. All node changes are atomic, i.e. the consumers can see only full updates. The important characteristic of ZooKeeper is that the consumer can see the last value, but may miss the intermediate ones. In case of global setting management it is perfectly acceptable. In the solution the asynchronous API was used to retrieve the data. That means the process registers for change notification and reads the value. If the node does not exist it is treated as if it were empty. In this release the configuration is monitored using separate connection for each monitored node MaintainCategories The categories are kept in a file in HDFS. The file is pointed by content of z-node. During startup and whenever the z-node changes (even for the same content), the associated query table is cleaned and filled with the content from the product catalogue. Figure 26: MaintainCategories MaintainFeatures Features follow the same structure as product-category mapping. The z-node points to the location in HDFS where the current feature list is defined. On startup and whenever the observed node changes, the shared query table is cleaned and filled with content. Accelerator for Apache Spark Functional Specification 25

26 Figure 27: MaintainFeatures H2OMaintainModel The model maintenance is realized slightly different way than category mapping and feature list. The z- node keeps a list of model sets as file pointer per line. The observer process reads all the files and builds the metadata list. This list is them passed to the H2OEvaluateModel that updates the operator. Figure 28: H2OMaintainModel CoordinateStartup The ZooKeeper observers are asynchronous. That means there is no guarantee that the system is fully configured during init phase. In order to avoid processing messages with partially configured solution, the subscription should be started only once the configuration has been applied. To achieve this we need a coordination of messages coming from independent parts. The process is connected to the maintenance flows via container connections. Once all three inputs report success, the ready state is released. Figure 29: CoordinateStartup Accelerator for Apache Spark Functional Specification 26

27 5 Data Analytics - Spotfire TIBCO's Accelerator for Apache Spark meets a customer service use case, where we want to understand our sales and to create models that we can later deploy in real time to send promotions for specific products to our customers. For this we run a Classification Model to identify customers who are likely to say "Yes" to an offer of a particular product. However, this type of model adapts for many other use-cases, for example financial crime detection or prediction of machine failure or in general any time you want to distinguish between two types of records from each other. You can use this accelerator in those use cases as well. The example file aims at aiding 3 different tasks. The tasks are made simple by easy to use controls. In the demonstration scenario all parts are handled by single visualization. In real projects there will be most likely separate sites dedicated to various task executions. 5.1 Discover Big Data The first section is called Discover Big Data and it serves as an environment that enables answering business questions in a visual way, including needs of Big Data Reporting and Discovery. This section is composed of 6 pages. All aggregations are delegated to the Big Data cluster running Apache Spark Totals The top of this page shows a preview of the data, which has a set of 10K lines and the respective content. Such a preview can be useful for inspiring strategies for analysing the data. Below, we show some KPIs and how they evolve over time. By clicking on the X and Y axes selectors, the user can choose different KPIs. Figure 30: Spotfire: Discover: Totals Accelerator for Apache Spark Functional Specification 27

28 5.1.2 Discover Big Data: Drill-down Figure 31: Drill-down This section proposes a drill into the data. There are four market segments in the data. When the user chooses some or all of them in the pie chart, a tree map subdividing the respective total revenue by product appears. When selecting one or many products, below appears a time series of the respective revenues. The user may as such navigate the different dimensions of the data or choose different dimensions in any of the visualisations Categories To achieve the better understanding of the data some more details are required. As again a way of discovering the shape of the data, here is offered a drill-down by product categories. At the top, a tree map shows the importance on sales and price of each product. The visualisations at the bottom show some KPIs now and over time. By default, they encompass the whole business, but they respond to choices of one or many products in tree map. Accelerator for Apache Spark Functional Specification 28

29 Figure 32: Spotfire: Discover: Categories Basket Analysis Here, upon making a choice on the left hand list, we get a tree map that show the importance of all other products that were sold in the same baskets that contained the product of choice. This is a nice way of perceiving how customers buy products together and can help understand which variables should be included in models. The controls on the right allow choosing different metrics and dimensions. Figure 33: Spotfire: Discover: Basket Analysis Accelerator for Apache Spark Functional Specification 29

30 5.1.5 Client Cross-Sell Understand customer taste. What types of products do clients buy, regardless of whether in the same basket or not. Similar to the previous page, here are shown the products that clients who bought the chosen product have also bought, whether in the same basket or in any moment in time. This is useful when drawing cross/up-sell campaigns. Figure 34: Spotfire: Discover: Client CrossSell Geos The geospatial analysis it important aspect of data processing. Spotfire allows users to display aggregated data in order to understand the spatial relationships and geographical coverage. It is possible to locate the shops, which sell better give products, analyse the customer trends by region, and understand performance. This page shows how revenue and average price are distributed by shop and by region. It leverages Spotfire s ability to draw great maps. Accelerator for Apache Spark Functional Specification 30

31 Figure 35: Spotfire: Discover: Geos Play-page This page provides a custom playground for users. Load one month of data into memory. You can choose which month you want by using our prompt. Use our recommendation engine to pick the right visualisation to answer new business questions. Replicate the desired visualisation on in-database data. This page can be duplicated as many times as required. Figure 36: Spotfire: Discover: Play Page Accelerator for Apache Spark Functional Specification 31

32 5.2 Model Big Data The second section of the Spotfire component is called Model Big Data and supports the business in the task of Modelling Big Data. This part is made of 5 pages that support the business in the task of Modelling Big Data. The Accelerator supports the Random Forest Classification model, which is a type of model that is valid on any type of data. Therefore, it can be run by a business person. The goal is to make models that support promotions of a particular product or groups of products. In the accelerator the H2O DRF algorithm was applied. H2O is particularly effective for the presented case because it is able to train models on Big Data scale datasets, integrates nicely with Spark and produces extremely fast runtime models Preparation Before the models can be trained, the user has to define the input data for the model. The model training algorithms expect the data in reduced form, so called features. Every sample (in our case customer) is described by uniform set of variables. The calculation of these variables is parameterized by user selected settings. In the provided example the customer is described by past purchases in each category and response label that in our case tells if customer made any purchase in interesting categories in the following months. In the provided example the user decides which months he/she wants to use for training the model and which months contain response purchases. For training, we recommend to take enough months to encompass at least one full seasonal cycle, for example one full year. Maybe very old data is less relevant to current customer behaviour. If that is the case, one may not want to include much old data. For testing, at least 1 to 3 months of data should be used, preferably the more recent. Figure 37: Spotfire: Model: Prep Accelerator for Apache Spark Functional Specification 32

33 5.2.2 Training Once the model training reference data was selected, the actual model training is executed. The user names the groups of products that will be modelled. He/she then uses the central list to selects the products to make a promotion for. The user selects the products to make the promotion and launches the training job in the Spark cluster. The data defined by the user are collected and passed to the cluster for execution. The actual process can be long. The user may check the job s progress on the Spark web UI and track the checkpoints in the dashboard. In the presented demonstration, the model training job produces POJOs (H2O real-time execution models) and collects the information provided by H2O engine. When the process is finished and the job is done, the models are available for inspection and deployment. The user should press the Refresh results button. When this button is pressed, Spotfire reaches to Spark via TERR to obtain the latest results of model building exercise. As the outcome of the training process the following datasets are created: results - model training job results; for each model training job there is tab-separated text file containing information line for each generated models pojo - generated H2O POJO classes; the directory contains subdirectory for each model training job roc - directory stores ROC points generated by H2O for each training job varimp - variable importance scores obtained from model training jobs sets - directory containing model metadata as tab-separated files describing deployable model and its parameters These results are analysed in the following pages. Figure 38: Spotfire: Model: Training in Spark Accelerator for Apache Spark Functional Specification 33

34 5.2.3 Model quality check On the left hand pane, the user chooses which model to evaluate. One can choose the current model or any past model. The choice populates the chart with the respective ROC curve. Evaluating model quality involves seeing if its results allow better decisions than a random choice, e.g. than tossing a coin. The model in the accelerator aims at separating one type of clients from the remainder, namely the ones who may be interested in the chosen product. For any given client, if we chose what type they were at random, the model s ROC curve (Receiver Operating Characteristic) would likely be close to the red line in the chart. If the model were perfect and got every decision right, the model s ROC Curve would be given by the green line. The blue line gives the position of the actual ROC Curve of the chosen model. The total area below the blue line is called AUC (Area Under the Curve) and gives a measure of how much better the current model is when compared with making a choice at random (represented by the red line). The left hand table shows the AUC of all models, which gives the user an idea of how good models are expected to be. Models with large enough AUC can be user approved. Previously approved models can also have their approval revoked in this page. All following pages just show approved models. It is important to bear in mind that approval of a model should not be final before the variable importance page is analysed, which happens on the next page. Figure 39: Spotfire: Model: Evaluate Quality Variable importance On the left hand pane, the user chooses which model to continue analysing. Only previously approved models appear here. By default, the models will use all available data to understand what drives the purchases of the modelled product. Some products are better drivers of a specific promotion than others. The chart is used to understand the relationship between your products and customer preferences by identifying the most important predictors. Go back to your Discover Clients' Taste page to validate the discovery. Accelerator for Apache Spark Functional Specification 34

35 This type of considerations is more important in some usecases than in others. In more sophisticated cases the variable importance discovered using one model may be used to provide better training parameters for another model. In fact, a combination of visualizing the ranking of the features as well as the detail of the individual features is important for a number of reasons: Validation of the model s quality. Maybe your best feature is so good because it is part of the answer and should therefore be excluded. Validation of the data s quality. Were you expecting a different feature to have more power than what is showing? Perhaps there are data quality issues causing a lack of relevance, or maybe outliers introduced a bias. These quality issues can be quickly spotted in visualization, for example a histogram of the numeric features. Correlation is not causation. It is necessary to ask questions that lead to a better understanding of the reality being predicted. Surprising top features. Sometimes predictors expected to be irrelevant turn out to have huge predictive ability. This knowledge, when shared with the business, will inevitably lead to better decisions. Inspiration for new features. Sometimes the most informative features are the reason to delve into new related information as a source of other rich features. Computational efficiency. Features with very low predictive power can be removed from the model as long as the prediction accuracy on the test dataset stays high. This ensures a more lightweight model with a higher degree of freedom, better interpretability, and potentially faster calculations when applying it to current data, in batch or real time. It is important to bear in mind that approval of a model should not be final before the variable importance page is analysed. If any issues are spotted, the user can revoke previously approved models. Accelerator for Apache Spark Functional Specification 35

36 Figure 40: Spotfire: Model: Variable Importance Discrimination threshold selection This page is entirely optional. When a model run is in real time, a measure is calculated of how likely a given customer is to say yes to a promotion of your specific product. In order to decide to send him or her a promotion, this metric is compared against a Threshold. This Threshold is defined by default to maximise the F1-score. The F1-score balances two types of desirable characteristics this type of models can have: Precision: of all the people the model would send a promotion to, what proportion accepts it; Recall: of all the people that would have said yes to a promotion, how many did the model recognise. F1 weighs these two dimensions equally. If you are happy with this choice, you can ignore this page. However, the user may have their own way of defining a desired Threshold and can use this page to set it. For example, they may want to maximise just precision or just recall, or to weigh them differently. Table 2a can be used to select other known model performance metrics. In 2b, one may select a Threshold manually. This is useful if it is important to control the proportion of customers that are identified as target, in case this must be weighed against the size of a team who will manually treat each case (e.g. telemarketing calls). The Proportion Selection (% of customer base) figure populates against this choice. In 2c, you may create your own model performance metric. For example, attribute a monetary cost to sending a promotion that is not converted and/or a monetary gain to a promotion that is converted. You can do this by typing your own formula in "here" on the Y-axis of the chart and then selecting the Threshold that maximises it. All the data needed for a custom calculation is available in the data that feeds the chart. In area 3, the user chooses the Threshold of choice and saves it by pressing Use. Accelerator for Apache Spark Functional Specification 36

37 Figure 41: Spotfire: Model: Custom Threshold 5.3 Design and execute marketing campaigns This final part is made of 2 pages that support the business in the task of running marketing campaigns that deploy the models learnt in the previous sections. Each model targets one product. The models are deployed to the event processing layer are model sets that we call campaigns or marketing campaigns. Campaigns launch promotions for groups of products at once by bundling models and their respective thresholds together Campaign bundling The produced models can be composed together to form a co-deployed bundle. Here you can bundle existing models into a new campaign and name your campaign. Alternatively, you can load a past campaign and revise it, by adding new models or thresholds to it or by removing past models. Sections 1 and 2 of this page require user action, whilst the remainder just provide information. In Section 1, the user chooses to either create a new campaign which takes on the name he/she chooses to give it just below, or chooses to load an existing campaign for analysis by choosing one from table a) to the immediate right. The models that are part of the new or existing campaign appear in table b) on the right hand middle section of the page. The user can now use Section 2 to change the models that are part of a campaign. This can be done by choosing to add new models, which he/she collects from table c). Or by deleting existing models from the current campaign. When done, the user can save the new settings of the existing campaign. The button at the bottom Refresh available model list ensures that all the more recently run models appear in list c). Figure 42: Spotfire: Deploy: Bundle Models into Campaigns Accelerator for Apache Spark Functional Specification 37

38 5.3.2 Campaign deployment This page connects you to the real time event processing engine. Here you can see the name of the campaigns that are now running in real time and inspect their underlying models and thresholds. You can also launch a new campaign. The left hand side of this page allows user action, whilst the right presents resulting information. The button Which campaigns are currently running in real time? show the names of the campaigns that are running now. These names appear as Streambase will see them. The button Refresh list of available campaigns will update table a) so it includes all past campaigns, including the ones that have just been created. When the user chooses a campaign from this table, table b) reflects the models that are part of it. Finally, the button Deploy the new selected campaign can be pressed to launch a new campaign in real time. Figure 43: Spotfire: Model: Launch Your Campaigns Accelerator for Apache Spark Functional Specification 38

TIBCO Complex Event Processing Evaluation Guide

TIBCO Complex Event Processing Evaluation Guide TIBCO Complex Event Processing Evaluation Guide This document provides a guide to evaluating CEP technologies. http://www.tibco.com Global Headquarters 3303 Hillview Avenue Palo Alto, CA 94304 Tel: +1

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Deploying, Managing and Reusing R Models in an Enterprise Environment

Deploying, Managing and Reusing R Models in an Enterprise Environment Deploying, Managing and Reusing R Models in an Enterprise Environment Making Data Science Accessible to a Wider Audience Lou Bajuk-Yorgan, Sr. Director, Product Management Streaming and Advanced Analytics

More information

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire Putting it all together: Creating a Big Data Analytic Workflow with Spotfire Authors: David Katz and Mike Alperin, TIBCO Data Science Team In a previous blog, we showed how ultra-fast visualization of

More information

Accelerator for Apache Spark. Interface Specification. 23 August Version 1.0.0

Accelerator for Apache Spark. Interface Specification. 23 August Version 1.0.0 Accelerator for Apache Spark Interface Specification http://www.tibco.com Global Headquarters 3303 Hillview Avenue Palo Alto, CA 94304 Tel: +1 650-846-1000 Toll Free: 1 800-420-8450 Fax: +1 650-846-1005

More information

MicroStrategy Desktop Quick Start Guide

MicroStrategy Desktop Quick Start Guide MicroStrategy Desktop Quick Start Guide Version: 10.4 10.4, June 2017 Copyright 2017 by MicroStrategy Incorporated. All rights reserved. If you have not executed a written or electronic agreement with

More information

TIBCO API Exchange Manager

TIBCO API Exchange Manager TIBCO API Exchange Manager Release Notes Software Release 2.1.0 March 2014 Two-Second Advantage Important Information SSOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE. USE OF SUCH EMBEDDED OR

More information

From Insight to Action: Analytics from Both Sides of the Brain. Vaz Balasingham Director of Solutions Consulting

From Insight to Action: Analytics from Both Sides of the Brain. Vaz Balasingham Director of Solutions Consulting From Insight to Action: Analytics from Both Sides of the Brain Vaz Balasingham Director of Solutions Consulting vbalasin@tibco.com Insight to Action from Both Sides of the Brain Value Grow Revenue Reduce

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data June 2006 Note: This document is for informational purposes. It is not a commitment to deliver any material, code, or functionality,

More information

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

TIBCO Spotfire Statement of Direction. Spotfire Product Management

TIBCO Spotfire Statement of Direction. Spotfire Product Management TIBCO Spotfire Statement of Direction Spotfire Product Management CONFIDENTIALITY The following information is confidential information of TIBCO Software Inc. Use, duplication, transmission, or republication

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

PUBLIC SAP Vora Sizing Guide

PUBLIC SAP Vora Sizing Guide SAP Vora 2.0 Document Version: 1.1 2017-11-14 PUBLIC Content 1 Introduction to SAP Vora....3 1.1 System Architecture....5 2 Factors That Influence Performance....6 3 Sizing Fundamentals and Terminology....7

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

Operations Dashboard for ArcGIS Monitoring GIS Operations. Michele Lundeen Esri

Operations Dashboard for ArcGIS Monitoring GIS Operations. Michele Lundeen Esri Operations Dashboard for ArcGIS Monitoring GIS Operations Michele Lundeen Esri mlundeen@esri.com What is a dashboard? Conceptual term, can mean different things to different audiences Dashboards provide

More information

How to Accelerate Merger and Acquisition Synergies

How to Accelerate Merger and Acquisition Synergies How to Accelerate Merger and Acquisition Synergies MERGER AND ACQUISITION CHALLENGES Mergers and acquisitions (M&A) occur frequently in today s business environment; $3 trillion in 2017 alone. 1 M&A enables

More information

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes AN UNDER THE HOOD LOOK Databricks Delta, a component of the Databricks Unified Analytics Platform*, is a unified

More information

Assurance Features and Navigation

Assurance Features and Navigation Assurance Features and Navigation Cisco DNA Center 1.1.2 Job Aid Copyright Page THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE. ALL STATEMENTS,

More information

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time

More information

CA Performance Management Data Aggregator

CA Performance Management Data Aggregator CA Performance Management Data Aggregator Basic Self-Certification Guide 2.4.1 This Documentation, which includes embedded help systems and electronically distributed materials, (hereinafter referred to

More information

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Datameer for Data Preparation:

Datameer for Data Preparation: Datameer for Data Preparation: Explore, Profile, Blend, Cleanse, Enrich, Share, Operationalize DATAMEER FOR DATA PREPARATION: EXPLORE, PROFILE, BLEND, CLEANSE, ENRICH, SHARE, OPERATIONALIZE Datameer Datameer

More information

Provisioning an OCH Network Connection

Provisioning an OCH Network Connection Provisioning an OCH Network Connection Cisco EPN Manager 2.0 Job Aid Copyright Page THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE. ALL STATEMENTS,

More information

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

IPv6 Classification. PacketShaper 11.8

IPv6 Classification. PacketShaper 11.8 PacketShaper 11.8 Legal Notice Copyright 2017 Symantec Corp. All rights reserved. Symantec, the Symantec Logo, the Checkmark Logo, Blue Coat, and the Blue Coat logo are trademarks or registered trademarks

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

Quest ChangeAuditor 5.1 FOR LDAP. User Guide

Quest ChangeAuditor 5.1 FOR LDAP. User Guide Quest ChangeAuditor FOR LDAP 5.1 User Guide Copyright Quest Software, Inc. 2010. All rights reserved. This guide contains proprietary information protected by copyright. The software described in this

More information

PPM Essentials Accelerator Product Guide - On Premise. Service Pack

PPM Essentials Accelerator Product Guide - On Premise. Service Pack PPM Essentials Accelerator Product Guide - On Premise Service Pack 02.0.02 This Documentation, which includes embedded help systems and electronically distributed materials (hereinafter referred to as

More information

Wireless Clients and Users Monitoring Overview

Wireless Clients and Users Monitoring Overview Wireless Clients and Users Monitoring Overview Cisco Prime Infrastructure 3.1 Job Aid Copyright Page THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT

More information

Sage 500 ERP Business Intelligence

Sage 500 ERP Business Intelligence Sage 500 ERP Business Intelligence Getting Started Guide Sage 500 Intelligence (7.4) Getting Started Guide The software described in this document is protected by copyright, And may not be copied on any

More information

WHITEPAPER. MemSQL Enterprise Feature List

WHITEPAPER. MemSQL Enterprise Feature List WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure

More information

Sitecore Experience Platform 8.0 Rev: September 13, Sitecore Experience Platform 8.0

Sitecore Experience Platform 8.0 Rev: September 13, Sitecore Experience Platform 8.0 Sitecore Experience Platform 8.0 Rev: September 13, 2018 Sitecore Experience Platform 8.0 All the official Sitecore documentation. Page 1 of 455 Experience Analytics glossary This topic contains a glossary

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Oracle9i Data Mining. Data Sheet August 2002

Oracle9i Data Mining. Data Sheet August 2002 Oracle9i Data Mining Data Sheet August 2002 Oracle9i Data Mining enables companies to build integrated business intelligence applications. Using data mining functionality embedded in the Oracle9i Database,

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Data Management Glossary

Data Management Glossary Data Management Glossary A Access path: The route through a system by which data is found, accessed and retrieved Agile methodology: An approach to software development which takes incremental, iterative

More information

Wired Network Summary Data Overview

Wired Network Summary Data Overview Wired Network Summary Data Overview Cisco Prime Infrastructure 3.1 Job Aid Copyright Page THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE.

More information

Practical Machine Learning Agenda

Practical Machine Learning Agenda Practical Machine Learning Agenda Starting From Log Management Moving To Machine Learning PunchPlatform team Thales Challenges Thanks 1 Starting From Log Management 2 Starting From Log Management Data

More information

Accounts Payable Workflow Guide. Version 14.6

Accounts Payable Workflow Guide. Version 14.6 Accounts Payable Workflow Guide Version 14.6 Copyright Information Copyright 2017 Informa Software. All Rights Reserved. No part of this publication may be reproduced, transmitted, transcribed, stored

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

vcenter Operations Manager for Horizon View Administration

vcenter Operations Manager for Horizon View Administration vcenter Operations Manager for Horizon View Administration vcenter Operations Manager for Horizon View 1.5 vcenter Operations Manager for Horizon View 1.5.1 This document supports the version of each product

More information

One Identity Active Roles 7.2. Replication: Best Practices and Troubleshooting Guide

One Identity Active Roles 7.2. Replication: Best Practices and Troubleshooting Guide One Identity Active Roles 7.2 Replication: Best Practices and Troubleshooting Copyright 2017 One Identity LLC. ALL RIGHTS RESERVED. This guide contains proprietary information protected by copyright. The

More information

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Architectural challenges for building a low latency, scalable multi-tenant data warehouse Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics

More information

Oracle Endeca Information Discovery

Oracle Endeca Information Discovery Oracle Endeca Information Discovery Glossary Version 2.4.0 November 2012 Copyright and disclaimer Copyright 2003, 2013, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered

More information

Toad Data Point - Professional Edition. The Toad Data Point Professional edition includes the following new features and enhancements.

Toad Data Point - Professional Edition. The Toad Data Point Professional edition includes the following new features and enhancements. Toad Data Point 4.2 New in This Release Thursday, April 13, 2017 Contents Toad Data Point Professional Edition Toad Data Point - Base and Professional Editions Toad Data Point - Professional Edition The

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate

More information

SAP Edge Services, cloud edition Edge Services Predictive Analytics Service Guide Version 1803

SAP Edge Services, cloud edition Edge Services Predictive Analytics Service Guide Version 1803 SAP Edge Services, cloud edition Edge Services Predictive Analytics Service Guide Version 1803 Table of Contents MACHINE LEARNING AND PREDICTIVE ANALYTICS... 3 Model Trained with R and Exported as PMML...

More information

PTC Windchill Quality Solutions Extension for ThingWorx Guide

PTC Windchill Quality Solutions Extension for ThingWorx Guide PTC Windchill Quality Solutions Extension for ThingWorx Guide Copyright 2016 PTC Inc. and/or Its Subsidiary Companies. All Rights Reserved. User and training guides and related documentation from PTC Inc.

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

More information

HYPERION SYSTEM 9 PERFORMANCE SCORECARD

HYPERION SYSTEM 9 PERFORMANCE SCORECARD HYPERION SYSTEM 9 PERFORMANCE SCORECARD RELEASE 9.2 NEW FEATURES Welcome to Hyperion System 9 Performance Scorecard, Release 9.2. This document describes the new or modified features in this release. C

More information

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka WHITE PAPER Reference Guide for Deploying and Configuring Apache Kafka Revised: 02/2015 Table of Content 1. Introduction 3 2. Apache Kafka Technology Overview 3 3. Common Use Cases for Kafka 4 4. Deploying

More information

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Hortonworks DataFlow Sam Lachterman Solutions Engineer Hortonworks DataFlow Sam Lachterman Solutions Engineer 1 Hortonworks Inc. 2011 2017. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development,

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Quest Migration Manager for Exchange Resource Kit User Guide

Quest Migration Manager for Exchange Resource Kit User Guide Quest Migration Manager for Exchange 8.14 Resource Kit User Guide 2017 Quest Software Inc. ALL RIGHTS RESERVED. This guide contains proprietary information protected by copyright. The software described

More information

TIBCO BusinessEvents Extreme. System Sizing Guide. Software Release Published May 27, 2012

TIBCO BusinessEvents Extreme. System Sizing Guide. Software Release Published May 27, 2012 TIBCO BusinessEvents Extreme System Sizing Guide Software Release 1.0.0 Published May 27, 2012 Important Information SOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE. USE OF SUCH EMBEDDED OR

More information

Provisioning an Ethernet Private Line (EPL) Virtual Connection

Provisioning an Ethernet Private Line (EPL) Virtual Connection Provisioning an Ethernet Private Line (EPL) Virtual Connection Cisco EPN Manager 2.0 Job Aid Copyright Page THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE

More information

Logi Info v12.5 WHAT S NEW

Logi Info v12.5 WHAT S NEW Logi Info v12.5 WHAT S NEW Introduction Logi empowers companies to embed analytics into the fabric of their organizations and products enabling anyone to analyze data, share insights, and make informed

More information

Qlik Sense Enterprise architecture and scalability

Qlik Sense Enterprise architecture and scalability White Paper Qlik Sense Enterprise architecture and scalability June, 2017 qlik.com Platform Qlik Sense is an analytics platform powered by an associative, in-memory analytics engine. Based on users selections,

More information

Spotfire: Brisbane Breakfast & Learn. Thursday, 9 November 2017

Spotfire: Brisbane Breakfast & Learn. Thursday, 9 November 2017 Spotfire: Brisbane Breakfast & Learn Thursday, 9 November 2017 CONFIDENTIALITY The following information is confidential information of TIBCO Software Inc. Use, duplication, transmission, or republication

More information

PeopleSoft 9.1 PeopleBook: Events and Notifications Framework

PeopleSoft 9.1 PeopleBook: Events and Notifications Framework PeopleSoft 9.1 PeopleBook: Events and Notifications Framework March 2012 PeopleSoft 9.1 PeopleBook: Events and Notifications Framework SKU hcm91fp2eewh-b0312 Copyright 1988, 2012, Oracle and/or its affiliates.

More information

Comprehensive Guide to Evaluating Event Stream Processing Engines

Comprehensive Guide to Evaluating Event Stream Processing Engines Comprehensive Guide to Evaluating Event Stream Processing Engines i Copyright 2006 Coral8, Inc. All rights reserved worldwide. Worldwide Headquarters: Coral8, Inc. 82 Pioneer Way, Suite 106 Mountain View,

More information

Panopticon Designer, Server & Streams Release Notes. Version 17.0

Panopticon Designer, Server & Streams Release Notes. Version 17.0 Panopticon Designer, Server & Streams Release Notes Version 17.0 Datawatch Corporation makes no representation or warranties with respect to the contents of this manual or the associated software and especially

More information

DataCollect Administrative Tools Supporting DataCollect (CMDT 3900) Version 3.0.0

DataCollect Administrative Tools Supporting DataCollect (CMDT 3900) Version 3.0.0 Administrator Manual DataCollect Administrative Tools Supporting DataCollect (CMDT 3900) Version 3.0.0 P/N 15V-090-00054-100 Revision A SKF is a registered trademark of the SKF Group. All other trademarks

More information

UMP Alert Engine. Status. Requirements

UMP Alert Engine. Status. Requirements UMP Alert Engine Status Requirements Goal Terms Proposed Design High Level Diagram Alert Engine Topology Stream Receiver Stream Router Policy Evaluator Alert Publisher Alert Topology Detail Diagram Alert

More information

Validating Service Provisioning

Validating Service Provisioning Validating Service Provisioning Cisco EPN Manager 2.1 Job Aid Copyright Page THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE. ALL STATEMENTS,

More information

Qualys Cloud Suite 2.28

Qualys Cloud Suite 2.28 Qualys Cloud Suite 2.28 We re excited to tell you about improvements and enhancements in Qualys Cloud Suite 2.28. AssetView ThreatPROTECT View Policy Compliance Summary in Asset Details Export Dashboards

More information

TIBCO Spotfire Analytics Investments

TIBCO Spotfire Analytics Investments TIBCO Spotfire Analytics Investments Smart Visual Analytics Be first to insight and first to action Analytics Apps at Scale Build and broadcast smart analytics Inline Data Wrangling Immerse yourself in

More information

Analytics Driven, Simple, Accurate and Actionable Cyber Security Solution CYBER ANALYTICS

Analytics Driven, Simple, Accurate and Actionable Cyber Security Solution CYBER ANALYTICS Analytics Driven, Simple, Accurate and Actionable Cyber Security Solution CYBER ANALYTICS Overview Cyberattacks are increasingly getting more frequent, more sophisticated and more widespread than ever

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

Toad Intelligence Central 3.3 New in This Release

Toad Intelligence Central 3.3 New in This Release Toad Intelligence Central 3.3 New in This Release Tuesday, March 28, 2017 This release of Toad Intelligence Central includes the following new features and enhancements. Toad Data Point Enter Variable

More information

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide FAQs 1. What is the browser compatibility for logging into the TCS Connected Intelligence Data Lake for Business Portal? Please check whether you are using Mozilla Firefox 18 or above and Google Chrome

More information

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007 What s New in Spotfire DXP 1.1 Spotfire Product Management January 2007 Spotfire DXP Version 1.1 This document highlights the new capabilities planned for release in version 1.1 of Spotfire DXP. In this

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Paper SAS Taming the Rule. Charlotte Crain, Chris Upton, SAS Institute Inc.

Paper SAS Taming the Rule. Charlotte Crain, Chris Upton, SAS Institute Inc. ABSTRACT Paper SAS2620-2016 Taming the Rule Charlotte Crain, Chris Upton, SAS Institute Inc. When business rules are deployed and executed--whether a rule is fired or not if the rule-fire outcomes are

More information

Big-Data Pipeline on ONTAP and Orchestration with Robin Cloud Platform

Big-Data Pipeline on ONTAP and Orchestration with Robin Cloud Platform Technical Report Big-Data Pipeline on ONTAP and Orchestration with Robin Cloud Platform Ranga Sankar, Jayakumar Chendamarai, Aaron Carter, David Bellizzi, NetApp July 2018 TR-4706 Abstract This document

More information

One Identity Starling Identity Analytics & Risk Intelligence. User Guide

One Identity Starling Identity Analytics & Risk Intelligence. User Guide One Identity Starling Identity Analytics & Risk Intelligence User Guide Copyright 2019 One Identity LLC. ALL RIGHTS RESERVED. This guide contains proprietary information protected by copyright. The software

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

CA GovernanceMinder. CA IdentityMinder Integration Guide

CA GovernanceMinder. CA IdentityMinder Integration Guide CA GovernanceMinder CA IdentityMinder Integration Guide 12.6.00 This Documentation, which includes embedded help systems and electronically distributed materials, (hereinafter referred to as the Documentation

More information

Enabling Data Governance Leveraging Critical Data Elements

Enabling Data Governance Leveraging Critical Data Elements Adaptive Presentation at DAMA-NYC October 19 th, 2017 Enabling Data Governance Leveraging Critical Data Elements Jeff Goins, President, Jeff.goins@adaptive.com James Cerrato, Chief, Product Evangelist,

More information

Decision Manager Help. Version 7.1.7

Decision Manager Help. Version 7.1.7 Version 7.1.7 This document describes products and services of Pegasystems Inc. It may contain trade secrets and proprietary information. The document and product are protected by copyright and distributed

More information

TIBCO Spotfire Automation Services

TIBCO Spotfire Automation Services Software Release 7.11 LTS November 2017 Two-Second Advantage 2 Important Information SOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE. USE OF SUCH EMBEDDED OR BUNDLED TIBCO SOFTWARE IS SOLELY

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Service Manager. Database Configuration Guide

Service Manager. Database Configuration Guide Service Manager powered by HEAT Database Configuration Guide 2017.2.1 Copyright Notice This document contains the confidential information and/or proprietary property of Ivanti, Inc. and its affiliates

More information

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Building Agile and Resilient Schema Transformations using Apache Kafka and ESB's Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Ricardo Ferreira

More information

ActiveSpaces Transactions. Quick Start Guide. Software Release Published May 25, 2015

ActiveSpaces Transactions. Quick Start Guide. Software Release Published May 25, 2015 ActiveSpaces Transactions Quick Start Guide Software Release 2.5.0 Published May 25, 2015 Important Information SOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE. USE OF SUCH EMBEDDED OR BUNDLED

More information