Accelerator for Apache Spark. Functional Specification. 23 August Version 1.0.0

Size: px

Start display at page:

Download "Accelerator for Apache Spark. Functional Specification. 23 August Version 1.0.0"

Cody Harrell
6 years ago
Views:

Accelerator for Apache Spark Functional Specification http://www.tibco.

1 Accelerator for Apache Spark Functional Specification Global Headquarters 3303 Hillview Avenue Palo Alto, CA Tel: Toll Free: Fax: August 2016 Version , TIBCO Software Inc. All rights reserved. TIBCO, the TIBCO logo, The Power of Now, and TIBCO Software are trademarks or registered trademarks of TIBCO Software Inc. in the United States and/or other countries. All other product and company names and marks mentioned in this document are the property of their respective owners and are mentioned for identification purposes only. This document outlines the functional specification of the components of the Accelerator for Apache Spark

2 Revision History Version Date Author Comments /04/2016 Piotr Smolinski Initial version /04/2016 Piotr Smolinski /06/2016 Piotr Smolinski /06/2016 Ana Costa e Silva /08/2016 Piotr Smolinski Version for release Accelerator for Apache Spark Functional Specification 2

3 Copyright Notice COPYRIGHT 2016 TIBCO Software Inc. This document is unpublished and the foregoing notice is affixed to protect TIBCO Software Inc. in the event of inadvertent publication. All rights reserved. No part of this document may be reproduced in any form, including photocopying or transmission electronically to any computer, without prior written consent of TIBCO Software Inc. The information contained in this document is confidential and proprietary to TIBCO Software Inc. and may not be used or disclosed except as expressly authorized in writing by TIBCO Software Inc. Copyright protection includes material generated from our software programs displayed on the screen, such as icons, screen displays, and the like. Trademarks Technologies described herein are either covered by existing patents or patent applications are in progress. All brand and product names are trademarks or registered trademarks of their respective holders and are hereby acknowledged. Confidentiality The information in this document is subject to change without notice. This document contains information that is confidential and proprietary to TIBCO Software Inc. and may not be copied, published, or disclosed to others, or used for any purposes other than review, without written authorization of an officer of TIBCO Software Inc. Submission of this document does not represent a commitment to implement any portion of this specification in the products of the submitters. Content Warranty The information in this document is subject to change without notice. THIS DOCUMENT IS PROVIDED "AS IS" AND TIBCO MAKES NO WARRANTY, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO ALL WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. TIBCO Software Inc. shall not be liable for errors contained herein or for incidental or consequential damages in connection with the furnishing, performance or use of this material. For more information, please contact: TIBCO Software Inc Hillview Avenue Palo Alto, CA USA Accelerator for Apache Spark Functional Specification 3

4 Table of Contents TABLE OF CONTENTS...4 TABLE OF FIGURES...7 TABLE OF TABLES PREFACE PURPOSE OF DOCUMENT SCOPE REFERENCED DOCUMENTS ARCHITECTURE COMPONENTS EVENT PROCESSOR FLOWS (FAST DATA STORY) SPOTFIRE COMPONENTS (BIG DATA STORY) LIVEVIEW COMPONENTS (OPERATIONS STORY) EVENT SEQUENCING REGULAR EVENT FLOW DATA PROCESSING FLOW SIMULATION EVENT PROCESSOR - STREAMBASE CORE LOGIC ProcessTransactionsAndScore ProcessTransaction CategorizeTransactions CategorizeTransaction (DefaultCategorizeTransaction) FeaturizeTransactions EvaluateModel (H2OEvaluateModel) TRANSPORT BINDING KafkaWiredProcessTransaction KafkaConsumeTransactions KafkaProduceNotifications KafkaAcknowledgeTransaction PERSISTENT RUNTIME STATE HBaseCustomerHistory HBaseAddCustomerTransaction Accelerator for Apache Spark Functional Specification 4

5 4.4 CONFIGURATION LOADING AND CHANGE MONITORING MaintainCategories MaintainFeatures H2OMaintainModel CoordinateStartup DATA ANALYTICS - SPOTFIRE DISCOVER BIG DATA Totals Discover Big Data: Drill-down Categories Basket Analysis Client Cross-Sell Geos Play-page MODEL BIG DATA Preparation Training Model quality check Variable importance Discrimination threshold selection DESIGN AND EXECUTE MARKETING CAMPAIGNS Campaign bundling Campaign deployment DATA ACCESS - SPARK AND H2O DATA ACCESS AND PROCESSING IN SPARK MODEL TRAINING IN SPARKLING WATER / H2O EVENTS TO DATA - FLUME INFORMATION TO BE STORED FROM EVENTS TO DATA WHEN MY DATA IS AVAILABLE Events Runtime context Intermediary storage Target storage Accelerator for Apache Spark Functional Specification 5

6 7.4 DATA FOR ANALYTICS Data format Data organization Enrichment Tools INSIGHT TO ACTION - ZOOKEEPER AND H2O EVENT FLOW SIMULATOR Accelerator for Apache Spark Functional Specification 6

7 Table of Figures Figure 1: Solution Component Diagram Figure 2: Regular event flow Figure 3: Data processing activities Figure 4: ProcessTransactionAndScore Figure 5: ProcessTransactionAndScore transactions Figure 6: ProcessTransactionAndScore notifications Figure 7: ProcessTransactionAndScore acks Figure 8: ProcessTransactionAndScore acks Figure 9: ProcessTransactionAndScore transactionsout Figure 10: ProcessTransactionAndScore categories Figure 11: ProcessTransaction Figure 12: ProcessTransaction Transactions Figure 13: ProcessTransaction Predictions Figure 14: CategorizeTransactions Figure 15: DefaultCategorizeTransaction Figure 16: FeaturizeTransactions Figure 17: H2OEvaluateModel Figure 18: KafkaWiredProcessTransaction Figure 19: KafkaWiredProcessTransaction Transactions Figure 20: KafkaWiredProcessTransaction Categories Figure 21: KafkaConsumeTransactions Figure 22: KafkaProduceNotifications Figure 23: KafkaAcknowledgeTransaction Figure 24: HBaseCustomerHistory Figure 25: HBaseAddCustomerTransaction Figure 26: MaintainCategories Figure 27: MaintainFeatures Figure 28: H2OMaintainModel Figure 29: CoordinateStartup Figure 30: Spotfire: Discover: Totals Accelerator for Apache Spark Functional Specification 7

8 Figure 31: Drill-down Figure 32: Spotfire: Discover: Categories Figure 33: Spotfire: Discover: Basket Analysis Figure 34: Spotfire: Discover: Client CrossSell Figure 35: Spotfire: Discover: Geos Figure 36: Spotfire: Discover: Play Page Figure 37: Spotfire: Model: Prep Figure 38: Spotfire: Model: Training in Spark Figure 39: Spotfire: Model: Evaluate Quality Figure 40: Spotfire: Model: Variable Importance Figure 41: Spotfire: Model: Custom Threshold Figure 42: Spotfire: Deploy: Bundle Models into Campaigns Figure 43: Spotfire: Model: Launch Your Campaigns Accelerator for Apache Spark Functional Specification 8

9 Table of Tables Table 1: Accelerator for Apache Spark Components Table 2: Event Processor Modules Table 3: Spotfire and Spark components Table 4: LVW and LDM components Accelerator for Apache Spark Functional Specification 9

10 1 Preface 1.1 Purpose of Document This document addresses dynamic aspects of the Accelerator for Apache Spark. It describes the applied solutions as planned, repeatable in concrete customer projects and realization in the accelerator demo. The Accelerator for Apache Spark addresses the growing market of analytics solutions (Big Data) with strong focus on the event processing (Fast Data). The accelerator goal is to highlight the TIBCO added value to the Big Data world. We have acknowledged that the Big Data solutions already exist. The missing point is getting value from Big Data analytics. It is possible to explore data, process it and build the models. The challenge arises when the data is no longer static. The events flow through the system and the event processing goal is to capture them in analytics optimal form. Once the results from analytics are available they should be converted into value. The accelerator covers the full cycle from event capture through analytics to predictive and prescriptive model execution against observations. 1.2 Scope The document covers the following aspects: Scalable event capture and processing (Kafka and StreamBase) Event persistence in Big Data storages (Kafka, Flume, Spark) o minimal event processing layer impact o data processing efficiency Numerical model training in Big Data processing clusters (Spotfire, Spark, H2O) Model deployment to scaled out event processors (Spotfire, ZooKeeper and StreamBase) Operational monitoring (LiveView DataMart and LiveView Web) Artificial data generation and injection 1.3 Referenced Documents Document Reference Accelerator for Apache Spark Quick Start Guide Accelerator for Apache Spark Interface Specification Accelerator for Apache Spark Functional Specification 10

11 2 Architecture 2.1 Components The accelerator architecture focuses on commonly applied open source Big Data products. The key solutions are: Kafka - extremely scalable message bus for Fast Data HDFS - de facto standard for Big Data storage These two products have been confronted to TIBCO products: StreamBase - event processing solution Spotfire - analytics platform To gaps in the architecture have been filled with: HBase - for scalable event context storage Flume - for Fast Data to Big Data transition Spark - for data access and transformation H2O - for clustered model training and lightning-fast scoring ZooKeeper - for cluster coordination Figure 1: Solution Component Diagram What's important, the accelerator is not limited to Big Data. The problem of getting the value from analytics exists also in traditional applications. Accelerator for Apache Spark Functional Specification 11

12 Table 1: Accelerator for Apache Spark Components Component Software Description Messaging Firehose Apache Kafka Highly scalable messaging bus. The core of Fast Data system is messaging bus capable of passing thousands of messages per second and still expandable. With Kafka it is possible to add nodes on demand to support more traffic. Data Storage Apache Hadoop HDFS The Big Data systems rely on the efficient and reliable storage for enormous amounts of data. Hadoop framework provides two components, one for the data (HDFS) and one for programs (YARN). Event Processor TIBCO StreamBase StreamBase is a CEP and ESP platform for event processing. It combines visual programming with high efficiency for reactive event handling. The component provides integration and event processing capabilities. Data Analytics TIBCO Spotfire Spotfire is a data visualization and analytics platform. In the accelerator the access patterns to the data stored in the cluster were evaluated. The accelerator shows also sample flow for model building in the Big Data cluster and runtime model deployment. Runtime Context Store Apache HBase NoSQL columnar database used with HDFS. Data Writer Apache Flume Event persistence framework. Data Access Service Apache Spark Big Data processing framework. Apache Spark is current state-of-the-art solution for processing data in Big Data clusters. It offers much better throughput and latency than the original Hadoop Map-Reduce. Model Training Engine H2O Cluster-oriented numerical modelling software. Traditional numerical modelling algorithms in R or NumPy/SciPy are implemented with simple architecture in mind. When the dataset significantly exceeds a single node capacity reimplementation of such algorithms is needed. H2O is a successful attempt to train models and it generates effective realtime processing models. Simulation Publisher StreamBase Jython Kafka Simulation framework. The component injects the messages into the system for the demo purposes. There component uses customer modelling and data injection parts. Real-Time Dashboard Live DataMart LiveView Web StreamBase Visualization component presenting recent changes in the system in the real-time. Accelerator for Apache Spark Functional Specification 12

13 2.2 Event Processor Flows (Fast Data story) The Fast Data focuses on the data flowing through the system. The operating data unit is customer. The event processing layer captures new transactions, builds customer history and prepares offers. Table 2: Event Processor Modules Module Component Description Kafka transaction binding Event Processor The integration binding to the messaging firehose. It contains example of Kafka adapter usage and complex XML handling. Context binding Event Processor Each transaction is processed in scope of previous transactions executed by the same customer. The state is externalized to HBase. Enrichment Event Processor The context contains only the raw facts. It this case it is list of transactions with just product ids. For the model processing this information has to be enriched with category cross-referencing. Transaction featurization Event Processor Before transactions can be processed by model, the transaction and history must be converted into model input. The typical model input is a list of categorical or numerical values. Model execution Event Processor The models are external deployable artifacts produced by the data analytics layer. The result of event processing in this case is a score for each deployed model. Live DataMart binding Event Processor The LVW is provided as real-time monitoring dashboard. The underlying LDM is fed by the event processing component. Flume binding Event Processor Binding for secure sending of the data to HDFS. Offer acceptance tracking Event Processor Process of tracking customer response. Configuration notifications Event Processor Binding for the configuration changes provided by ZooKeeper. Accelerator for Apache Spark Functional Specification 13

14 2.3 Spotfire components (Big Data story) The Big Data store uses holistic view on the data. It aggregates customers and builds statistical models. The operating unit is dataset. Table 3: Spotfire and Spark components Module Component Description ETL Data Access Service Data Analytics Transformation from Avro to Parquet. Data discovery Data Analytics Data Access Service Access to the underlying tables for data discovery. Model training Data Analytics Data Access Service Model preparation and assessment Model deployment Data Analytics Model submission to the event processing layer. 2.4 LiveView components (operations story) The LiveView shows the current state of the system. It presents the currently relevant information about running processes. That means it contains only small fraction of the data or heavily reduced information. Table 4: LVW and LDM components Module Component Description Transaction TransactionItems Real-Time Dashboard Recent transactions with their content. The tables form master-child structure. ModelStatus Real-Time Dashboard Status of the models. StoreSummary Real-Time Dashboard Current store information. Includes static store information, like geographic position, and aggregate information derived from transaction. Accelerator for Apache Spark Functional Specification 14

15 3 Event Sequencing 3.1 Regular event flow Live DataMart Kafka Originator Kafka StreamBase Flume HBase H2O HDFS deliver GET score select best offer PUT publish collect notify insert acknowledge track acceptance update write batch remove batch acknowledge Figure 2: Regular event flow The Fast Data story is automated process. It focuses on transaction event processing. The sequence of events happening: 1. The originator publishes transaction event (XML) to Kafka Transactions topic 2. StreamBase event flow retrieves event 3. The past customers transactions are retrieved from HBase 4. The past transactions are filtered by date (to limit to the recent transactions) and deduplicated 5. The built customer context is converted into features 6. The data is scored by deployed H2O models 7. The results are filtered according to the model metadata (cut-off, validity dates and so on) 8. From the remaining results the winning one is selected 9. The transaction data with scoring result is published to Kafka as JSON a. Flume Source consumes batches of messages b. Once all messages are accepted by agent's Channel, the batch is acknowledged c. The batches are delivered to HDFS Sink d. Once the Sink flushes buffers, it removes data from Channel 10. The result is published to Kafka Notifications topic as XML 11. The message is delivered to originator (it may or may not contain offer) 12. The transaction is published to LDM for runtime tracking 13. The past transactions for current customer are scanned for pending offers 14. The pending offers with categories matching the incoming transaction are marked succeeded 15. The past transactions are scanned for outdated offers (based on message timestamp) 16. The pending offers with missed deadline are marked unsucceeded Accelerator for Apache Spark Functional Specification 15

16 3.2 Data processing flow Collect data ETL data Discover data Build models Bundle models Deploy models Figure 3: Data processing activities The Big Data story is human driven process. The focus here is exploration of the data stored in Big Data cluster (HDFS+Spark). The process eventually produces models executed in the event processing layer. The high-level procedure follows: 1. The data is collected in HDFS as Avro 2. The ETL process turns many relatively small Avro files into Parquet a. transaction deduplication b. category info enrichment c. data flattening 3. The data scientist explores the data and provides candidate algorithms (partially covered by the accelerator) 4. The data analyst builds the candidate algorithms and assesses their performance (for example using ROC curves). The accepted models are described and passed to the operations 5. The system operator combines the models into bundles 6. The bundles are deployed to event processing The side activities happening in the event processing layer are: 1. The events sent to Flume are accompanied with model evaluation data 2. The customer purchases are tracked for offer acceptance and sent in the real-time to LDM 3. The offer acceptance and model efficiency can be transformed in ETL process 3.3 Simulation The traffic simulator is a StreamBase component generating test transaction flow. The component publishes transaction messages using configured transaction rate, reference data and time compression factor. The module is also capable of simulating customer response on the presented offers. There are two implementations of the component. One implementation uses real-time transaction generation model. This variation uses stateful process of tracking large number of customers and generates random transactions using reference data probability modifiers. The process tries keeping uniform time distribution between subsequent transactions of the given customer. The advantage is that the realtime data generation may adapt customer behaviour to the system responses (offers). Alternative implementation reads pregenerated data and sends messages. The data is stored in a flat tab-separated file. The file is ordered by timestamp and transaction id. The ordering guarantees that transaction lines for the same transaction are stored as single block. The generator process builds random customer history. Single iteration creates a customer with some demographic profile. For this customer a series of transactions is built. The transactions are written out as flattened transaction item list. Accelerator for Apache Spark Functional Specification 16

4 Event Processor - StreamBase 4.1 Core logic 4.1.1 ProcessTransactionsAndScore The event flow handles the main context related logic.

17 4 Event Processor - StreamBase 4.1 Core logic ProcessTransactionsAndScore The event flow handles the main context related logic. Figure 4: ProcessTransactionAndScore In the flow the transactions are processed for customer offering and for hot item (category) tracking. The ProcessTransaction module executes the logic related to customer offering. It loads the customer context, transforms it into model consumable feature set and does the model output final interpretation. In this particular case the winning offer is selected. The TrackCategories expands the incoming transaction into transaction lines with category info. What's important, a single line may have 0 or many categories. The resulting categories are then processed as individual signals. The module provides also external configuration wiring. The control submodules are responsible for maintenance of the reference tables and deployed models. The transactionsin input stream carries the raw transaction information passed from the originator. The capture group supports arbitrary external content to be passed transparently to the output streams. This feature is used to retain the Kafka consumption context information. Figure 5: ProcessTransactionAndScore transactions The notifications output stream emits the ultimate result of the processing logic. This result is used to send the offer to customer. The events contain input event transport-related fields. Accelerator for Apache Spark Functional Specification 17

Figure 6: ProcessTransactionAndScore notifications After all logic is executed the messages are acknowledged to the input topic. With Kafka this means the last consumed offsets are saved in Zookeeper.

Figure 7: ProcessTransactionAndScore acks The notifications output stream emits the ultimate result of the processing logic. This result is used to send the offer to customer.

18 Figure 6: ProcessTransactionAndScore notifications After all logic is executed the messages are acknowledged to the input topic. With Kafka this means the last consumed offsets are saved in Zookeeper. Because the acknowledgement protocol is transport related and logic independent, the acks events carry only transport information. Figure 7: ProcessTransactionAndScore acks The notifications output stream emits the ultimate result of the processing logic. This result is used to send the offer to customer. The events contain input event transport-related fields. Figure 8: ProcessTransactionAndScore acks From the same structure as the notifications the audit information is derived and published as transactionsout. The events are used to update the LDM tables and to store the transactions and evaluation results in HDFS for auditing and effectiveness tracking purposes. Figure 9: ProcessTransactionAndScore transactionsout The categories output stream emits category tracking tuples. These are later consumed for category performance check and for customer to detect the offer responses. Accelerator for Apache Spark Functional Specification 18

Figure 10: ProcessTransactionAndScore categories 4.1.2 ProcessTransaction This is the main working horse for the CEP-style processing. The flow implements stateful context for customer's transactions.

All the transactions in the retrieved history are classified according to the product to category mapping. Subsequently the enriched customer context is converted into feature vector, i.e. data structure corresponding to the customer description in the applied modelling.

19 Figure 10: ProcessTransactionAndScore categories ProcessTransaction This is the main working horse for the CEP-style processing. The flow implements stateful context for customer's transactions. The past transactions are retrieved from dedicated storage solution (pluggable) and the new transaction is appended to the ledger. All the transactions in the retrieved history are classified according to the product to category mapping. Subsequently the enriched customer context is converted into feature vector, i.e. data structure corresponding to the customer description in the applied modelling. The result is then processed by all currently deployed models. Figure 11: ProcessTransaction The Transactions input stream carries essential information about the transaction. The flow in this module is responsible for information enrichment and adaptation. Figure 12: ProcessTransaction Transactions The Predictions output stream strips the locally collected state. It emits the originally input information with accepted model results. Accelerator for Apache Spark Functional Specification 19

20 Figure 13: ProcessTransaction Predictions CategorizeTransactions The flow just iterates over transactions and applies category resolution to each of them. Figure 14: CategorizeTransactions CategorizeTransaction (DefaultCategorizeTransaction) The transaction categorization uses pluggable logic. In the applied case it uses query table to load all the categories assigned to product SKU. Figure 15: DefaultCategorizeTransaction Accelerator for Apache Spark Functional Specification 20

21 4.1.5 FeaturizeTransactions Context featurization is typically complex task. The CEP context information (enriched by the known state and reference data) has to be converted into a structure that matches the one used to train the models. In many of the cases there is no perfect mapping between the static data used by a data scientist and runtime state available during event processing. The featurization tries to build the input sample description as close as possible to the one used in the model training process. Figure 16: FeaturizeTransactions EvaluateModel (H2OEvaluateModel) Once the incoming transaction is transformed into features, it can be processed by the models. In the accelerator case the featurized transactions are processed by ultra-fast models generated with H2O. In generic case there could be even several alternative model deployable at the same time for routed or broadcasted execution. Figure 17: H2OEvaluateModel The logic in the flow is simple. The incoming features are adapted to the model operator interface. 4.2 Transport binding KafkaWiredProcessTransaction The event processor core logic is related to the transaction processing. The top level event flow orchestrates Kafka message exchange and exposes notification flows for other features. Accelerator for Apache Spark Functional Specification 21

22 Figure 18: KafkaWiredProcessTransaction The transaction processor consumes messages from Kafka bus. The transactions are evaluated using core logic to obtain offers for customer and to categorize the transaction items. The processing results are sent as offering to the caller. The KafkaWiredProcessTransaction is top-level event flow orchestrating the transport binding and actual logic execution. The event flow calls the main processing logic and passes the transport-related information as capture group. This data is transparent to the underlying implementation, but it is required to properly send responses to the incoming messages and to commit the transactions. The event flow offers two output streams intended for synchronous event consumption: Transactions Categories The Transactions output stream emits the transaction information with model evaluation results. The output stream is used to: Figure 19: KafkaWiredProcessTransaction Transactions update LDM tables report events to Flume track prepared offers The Categories output stream captures categorized transaction information. It emits tuples for each transaction line. Accelerator for Apache Spark Functional Specification 22

The stream is consumed by: Figure 20: KafkaWiredProcessTransaction Categories offer acceptance tracking hot categories tracking (currently not implemented) 4.2.2 KafkaConsumeTransactions The Kafka consumption has been simplified in this version of accelerator.

23 The stream is consumed by: Figure 20: KafkaWiredProcessTransaction Categories offer acceptance tracking hot categories tracking (currently not implemented) KafkaConsumeTransactions The Kafka consumption has been simplified in this version of accelerator. There is single consumer handling all the partitions of the Transactions topic. The consumer is statically configured to connect to known broker list. At the startup the flow is inactive. The subscription is opened once the coordination module decides that all the models and configuration settings have been read. This was made in order to avoid processing of the data with partial configuration. The process reads topic metadata from ZooKeeper. Then for each partition it retrieves recent consumption offset and activates subscriber. The flow reads messages from the broker and before emitting events for processing does interpretation of the opaque content: the XML payload is adapted to StreamBase compliant format and then to tuple the header is parsed and exposed for transport-related handling Figure 21: KafkaConsumeTransactions KafkaProduceNotifications The message sending is much simpler than consuming. The flow renders payload XML the StreamBase style and transforms it to the interface defined schema. Then the message is sent out according to the data passed in transport header provided by the consuming module. Accelerator for Apache Spark Functional Specification 23

Figure 22: KafkaProduceNotifications 4.2.4 KafkaAcknowledgeTransaction Transaction acknowledgement in Kafka is simple.

3 Persistent runtime state In the accelerator the runtime state for the main transaction processing logic is maintained in HBase.

24 Figure 22: KafkaProduceNotifications KafkaAcknowledgeTransaction Transaction acknowledgement in Kafka is simple. One has to save the last processed offset in the shared location, in this case in ZooKeeper node. Figure 23: KafkaAcknowledgeTransaction 4.3 Persistent runtime state In the accelerator the runtime state for the main transaction processing logic is maintained in HBase. This is pluggable component and, as long as the contract is respected, the HBase may be replaced with any technology. A similar feature can be implemented for example with TIBCO ActiveSpaces. The main advantage of HBase over AS is durability focus. Also the product API allows for much lighter communication protocol and lower coupling between components HBaseCustomerHistory In order to retrieve the customer past transactions a Get operation is executed. The operation is done with MaxVersions attribute set to high value; therefore all transactions stored by HBase are retrieved. It has been assumed that the solutions should be duplicate-tolerant. There could be multiple entries for the same transaction, but the initial design states that the content for given transaction id is same. This way it is enough to retrieve only unique records. The lookup by primary key uses region server routing, therefore the operation scales linearly with the HBase cluster size. Accelerator for Apache Spark Functional Specification 24

Figure 24: HBaseCustomerHistory 4.3.2 HBaseAddCustomerTransaction The counterpart of past transactions retrieval is appending of transaction to the customer's history. In HBase it is made simple.

25 Figure 24: HBaseCustomerHistory HBaseAddCustomerTransaction The counterpart of past transactions retrieval is appending of transaction to the customer's history. In HBase it is made simple. Updating a field with version tracking is equivalent to appending an entry to the version log. The update by primary key uses region server routing. Similarly as the lookup the operation scales linearly with the HBase cluster size. Figure 25: HBaseAddCustomerTransaction 4.4 Configuration loading and change monitoring The solution uses ZooKeeper to store the global configuration. ZooKeeper is a cluster-wide source of truth. It prevents from uncontrolled knowledge corruption that may happen during split-brain. All node changes are atomic, i.e. the consumers can see only full updates. The important characteristic of ZooKeeper is that the consumer can see the last value, but may miss the intermediate ones. In case of global setting management it is perfectly acceptable. In the solution the asynchronous API was used to retrieve the data. That means the process registers for change notification and reads the value. If the node does not exist it is treated as if it were empty. In this release the configuration is monitored using separate connection for each monitored node MaintainCategories The categories are kept in a file in HDFS. The file is pointed by content of z-node. During startup and whenever the z-node changes (even for the same content), the associated query table is cleaned and filled with the content from the product catalogue. Figure 26: MaintainCategories MaintainFeatures Features follow the same structure as product-category mapping. The z-node points to the location in HDFS where the current feature list is defined. On startup and whenever the observed node changes, the shared query table is cleaned and filled with content. Accelerator for Apache Spark Functional Specification 25

26 Figure 27: MaintainFeatures H2OMaintainModel The model maintenance is realized slightly different way than category mapping and feature list. The z- node keeps a list of model sets as file pointer per line. The observer process reads all the files and builds the metadata list. This list is them passed to the H2OEvaluateModel that updates the operator. Figure 28: H2OMaintainModel CoordinateStartup The ZooKeeper observers are asynchronous. That means there is no guarantee that the system is fully configured during init phase. In order to avoid processing messages with partially configured solution, the subscription should be started only once the configuration has been applied. To achieve this we need a coordination of messages coming from independent parts. The process is connected to the maintenance flows via container connections. Once all three inputs report success, the ready state is released. Figure 29: CoordinateStartup Accelerator for Apache Spark Functional Specification 26

5 Data Analytics - Spotfire TIBCO's Accelerator for Apache Spark meets a customer service use case, where we want to understand our sales and to create models that we can later deploy in real time to

27 5 Data Analytics - Spotfire TIBCO's Accelerator for Apache Spark meets a customer service use case, where we want to understand our sales and to create models that we can later deploy in real time to send promotions for specific products to our customers. For this we run a Classification Model to identify customers who are likely to say "Yes" to an offer of a particular product. However, this type of model adapts for many other use-cases, for example financial crime detection or prediction of machine failure or in general any time you want to distinguish between two types of records from each other. You can use this accelerator in those use cases as well. The example file aims at aiding 3 different tasks. The tasks are made simple by easy to use controls. In the demonstration scenario all parts are handled by single visualization. In real projects there will be most likely separate sites dedicated to various task executions. 5.1 Discover Big Data The first section is called Discover Big Data and it serves as an environment that enables answering business questions in a visual way, including needs of Big Data Reporting and Discovery. This section is composed of 6 pages. All aggregations are delegated to the Big Data cluster running Apache Spark Totals The top of this page shows a preview of the data, which has a set of 10K lines and the respective content. Such a preview can be useful for inspiring strategies for analysing the data. Below, we show some KPIs and how they evolve over time. By clicking on the X and Y axes selectors, the user can choose different KPIs. Figure 30: Spotfire: Discover: Totals Accelerator for Apache Spark Functional Specification 27

28 5.1.2 Discover Big Data: Drill-down Figure 31: Drill-down This section proposes a drill into the data. There are four market segments in the data. When the user chooses some or all of them in the pie chart, a tree map subdividing the respective total revenue by product appears. When selecting one or many products, below appears a time series of the respective revenues. The user may as such navigate the different dimensions of the data or choose different dimensions in any of the visualisations Categories To achieve the better understanding of the data some more details are required. As again a way of discovering the shape of the data, here is offered a drill-down by product categories. At the top, a tree map shows the importance on sales and price of each product. The visualisations at the bottom show some KPIs now and over time. By default, they encompass the whole business, but they respond to choices of one or many products in tree map. Accelerator for Apache Spark Functional Specification 28

29 Figure 32: Spotfire: Discover: Categories Basket Analysis Here, upon making a choice on the left hand list, we get a tree map that show the importance of all other products that were sold in the same baskets that contained the product of choice. This is a nice way of perceiving how customers buy products together and can help understand which variables should be included in models. The controls on the right allow choosing different metrics and dimensions. Figure 33: Spotfire: Discover: Basket Analysis Accelerator for Apache Spark Functional Specification 29

30 5.1.5 Client Cross-Sell Understand customer taste. What types of products do clients buy, regardless of whether in the same basket or not. Similar to the previous page, here are shown the products that clients who bought the chosen product have also bought, whether in the same basket or in any moment in time. This is useful when drawing cross/up-sell campaigns. Figure 34: Spotfire: Discover: Client CrossSell Geos The geospatial analysis it important aspect of data processing. Spotfire allows users to display aggregated data in order to understand the spatial relationships and geographical coverage. It is possible to locate the shops, which sell better give products, analyse the customer trends by region, and understand performance. This page shows how revenue and average price are distributed by shop and by region. It leverages Spotfire s ability to draw great maps. Accelerator for Apache Spark Functional Specification 30

Figure 35: Spotfire: Discover: Geos 5.1.7 Play-page This page provides a custom playground for users.

Use our recommendation engine to pick the right visualisation to answer new business questions.

31 Figure 35: Spotfire: Discover: Geos Play-page This page provides a custom playground for users. Load one month of data into memory. You can choose which month you want by using our prompt. Use our recommendation engine to pick the right visualisation to answer new business questions. Replicate the desired visualisation on in-database data. This page can be duplicated as many times as required. Figure 36: Spotfire: Discover: Play Page Accelerator for Apache Spark Functional Specification 31

5.2 Model Big Data The second section of the Spotfire component is called Model Big Data and supports the business in the task of Modelling Big Data.

32 5.2 Model Big Data The second section of the Spotfire component is called Model Big Data and supports the business in the task of Modelling Big Data. This part is made of 5 pages that support the business in the task of Modelling Big Data. The Accelerator supports the Random Forest Classification model, which is a type of model that is valid on any type of data. Therefore, it can be run by a business person. The goal is to make models that support promotions of a particular product or groups of products. In the accelerator the H2O DRF algorithm was applied. H2O is particularly effective for the presented case because it is able to train models on Big Data scale datasets, integrates nicely with Spark and produces extremely fast runtime models Preparation Before the models can be trained, the user has to define the input data for the model. The model training algorithms expect the data in reduced form, so called features. Every sample (in our case customer) is described by uniform set of variables. The calculation of these variables is parameterized by user selected settings. In the provided example the customer is described by past purchases in each category and response label that in our case tells if customer made any purchase in interesting categories in the following months. In the provided example the user decides which months he/she wants to use for training the model and which months contain response purchases. For training, we recommend to take enough months to encompass at least one full seasonal cycle, for example one full year. Maybe very old data is less relevant to current customer behaviour. If that is the case, one may not want to include much old data. For testing, at least 1 to 3 months of data should be used, preferably the more recent. Figure 37: Spotfire: Model: Prep Accelerator for Apache Spark Functional Specification 32

5.2.2 Training Once the model training reference data was selected, the actual model training is executed. The user names the groups of products that will be modelled.

33 5.2.2 Training Once the model training reference data was selected, the actual model training is executed. The user names the groups of products that will be modelled. He/she then uses the central list to selects the products to make a promotion for. The user selects the products to make the promotion and launches the training job in the Spark cluster. The data defined by the user are collected and passed to the cluster for execution. The actual process can be long. The user may check the job s progress on the Spark web UI and track the checkpoints in the dashboard. In the presented demonstration, the model training job produces POJOs (H2O real-time execution models) and collects the information provided by H2O engine. When the process is finished and the job is done, the models are available for inspection and deployment. The user should press the Refresh results button. When this button is pressed, Spotfire reaches to Spark via TERR to obtain the latest results of model building exercise. As the outcome of the training process the following datasets are created: results - model training job results; for each model training job there is tab-separated text file containing information line for each generated models pojo - generated H2O POJO classes; the directory contains subdirectory for each model training job roc - directory stores ROC points generated by H2O for each training job varimp - variable importance scores obtained from model training jobs sets - directory containing model metadata as tab-separated files describing deployable model and its parameters These results are analysed in the following pages. Figure 38: Spotfire: Model: Training in Spark Accelerator for Apache Spark Functional Specification 33

34 5.2.3 Model quality check On the left hand pane, the user chooses which model to evaluate. One can choose the current model or any past model. The choice populates the chart with the respective ROC curve. Evaluating model quality involves seeing if its results allow better decisions than a random choice, e.g. than tossing a coin. The model in the accelerator aims at separating one type of clients from the remainder, namely the ones who may be interested in the chosen product. For any given client, if we chose what type they were at random, the model s ROC curve (Receiver Operating Characteristic) would likely be close to the red line in the chart. If the model were perfect and got every decision right, the model s ROC Curve would be given by the green line. The blue line gives the position of the actual ROC Curve of the chosen model. The total area below the blue line is called AUC (Area Under the Curve) and gives a measure of how much better the current model is when compared with making a choice at random (represented by the red line). The left hand table shows the AUC of all models, which gives the user an idea of how good models are expected to be. Models with large enough AUC can be user approved. Previously approved models can also have their approval revoked in this page. All following pages just show approved models. It is important to bear in mind that approval of a model should not be final before the variable importance page is analysed, which happens on the next page. Figure 39: Spotfire: Model: Evaluate Quality Variable importance On the left hand pane, the user chooses which model to continue analysing. Only previously approved models appear here. By default, the models will use all available data to understand what drives the purchases of the modelled product. Some products are better drivers of a specific promotion than others. The chart is used to understand the relationship between your products and customer preferences by identifying the most important predictors. Go back to your Discover Clients' Taste page to validate the discovery. Accelerator for Apache Spark Functional Specification 34

35 This type of considerations is more important in some usecases than in others. In more sophisticated cases the variable importance discovered using one model may be used to provide better training parameters for another model. In fact, a combination of visualizing the ranking of the features as well as the detail of the individual features is important for a number of reasons: Validation of the model s quality. Maybe your best feature is so good because it is part of the answer and should therefore be excluded. Validation of the data s quality. Were you expecting a different feature to have more power than what is showing? Perhaps there are data quality issues causing a lack of relevance, or maybe outliers introduced a bias. These quality issues can be quickly spotted in visualization, for example a histogram of the numeric features. Correlation is not causation. It is necessary to ask questions that lead to a better understanding of the reality being predicted. Surprising top features. Sometimes predictors expected to be irrelevant turn out to have huge predictive ability. This knowledge, when shared with the business, will inevitably lead to better decisions. Inspiration for new features. Sometimes the most informative features are the reason to delve into new related information as a source of other rich features. Computational efficiency. Features with very low predictive power can be removed from the model as long as the prediction accuracy on the test dataset stays high. This ensures a more lightweight model with a higher degree of freedom, better interpretability, and potentially faster calculations when applying it to current data, in batch or real time. It is important to bear in mind that approval of a model should not be final before the variable importance page is analysed. If any issues are spotted, the user can revoke previously approved models. Accelerator for Apache Spark Functional Specification 35

Figure 40: Spotfire: Model: Variable Importance 5.2.5 Discrimination threshold selection This page is entirely optional.

36 Figure 40: Spotfire: Model: Variable Importance Discrimination threshold selection This page is entirely optional. When a model run is in real time, a measure is calculated of how likely a given customer is to say yes to a promotion of your specific product. In order to decide to send him or her a promotion, this metric is compared against a Threshold. This Threshold is defined by default to maximise the F1-score. The F1-score balances two types of desirable characteristics this type of models can have: Precision: of all the people the model would send a promotion to, what proportion accepts it; Recall: of all the people that would have said yes to a promotion, how many did the model recognise. F1 weighs these two dimensions equally. If you are happy with this choice, you can ignore this page. However, the user may have their own way of defining a desired Threshold and can use this page to set it. For example, they may want to maximise just precision or just recall, or to weigh them differently. Table 2a can be used to select other known model performance metrics. In 2b, one may select a Threshold manually. This is useful if it is important to control the proportion of customers that are identified as target, in case this must be weighed against the size of a team who will manually treat each case (e.g. telemarketing calls). The Proportion Selection (% of customer base) figure populates against this choice. In 2c, you may create your own model performance metric. For example, attribute a monetary cost to sending a promotion that is not converted and/or a monetary gain to a promotion that is converted. You can do this by typing your own formula in "here" on the Y-axis of the chart and then selecting the Threshold that maximises it. All the data needed for a custom calculation is available in the data that feeds the chart. In area 3, the user chooses the Threshold of choice and saves it by pressing Use. Accelerator for Apache Spark Functional Specification 36

37 Figure 41: Spotfire: Model: Custom Threshold 5.3 Design and execute marketing campaigns This final part is made of 2 pages that support the business in the task of running marketing campaigns that deploy the models learnt in the previous sections. Each model targets one product. The models are deployed to the event processing layer are model sets that we call campaigns or marketing campaigns. Campaigns launch promotions for groups of products at once by bundling models and their respective thresholds together Campaign bundling The produced models can be composed together to form a co-deployed bundle. Here you can bundle existing models into a new campaign and name your campaign. Alternatively, you can load a past campaign and revise it, by adding new models or thresholds to it or by removing past models. Sections 1 and 2 of this page require user action, whilst the remainder just provide information. In Section 1, the user chooses to either create a new campaign which takes on the name he/she chooses to give it just below, or chooses to load an existing campaign for analysis by choosing one from table a) to the immediate right. The models that are part of the new or existing campaign appear in table b) on the right hand middle section of the page. The user can now use Section 2 to change the models that are part of a campaign. This can be done by choosing to add new models, which he/she collects from table c). Or by deleting existing models from the current campaign. When done, the user can save the new settings of the existing campaign. The button at the bottom Refresh available model list ensures that all the more recently run models appear in list c). Figure 42: Spotfire: Deploy: Bundle Models into Campaigns Accelerator for Apache Spark Functional Specification 37

38 5.3.2 Campaign deployment This page connects you to the real time event processing engine. Here you can see the name of the campaigns that are now running in real time and inspect their underlying models and thresholds. You can also launch a new campaign. The left hand side of this page allows user action, whilst the right presents resulting information. The button Which campaigns are currently running in real time? show the names of the campaigns that are running now. These names appear as Streambase will see them. The button Refresh list of available campaigns will update table a) so it includes all past campaigns, including the ones that have just been created. When the user chooses a campaign from this table, table b) reflects the models that are part of it. Finally, the button Deploy the new selected campaign can be pressed to launch a new campaign in real time. Figure 43: Spotfire: Model: Launch Your Campaigns Accelerator for Apache Spark Functional Specification 38

TIBCO Complex Event Processing Evaluation Guide

TIBCO Complex Event Processing Evaluation Guide This document provides a guide to evaluating CEP technologies. http://www.tibco.com Global Headquarters 3303 Hillview Avenue Palo Alto, CA 94304 Tel: +1