XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

Similar documents
Machine Learning 13. week

Frameworks in Python for Numeric Computation / ML

Deep Learning Applications

Keras: Handwritten Digit Recognition using MNIST Dataset

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Neural Network Neurons

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Perceptron: This is convolution!

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

Recurrent Neural Network (RNN) Industrial AI Lab.

Bayesian model ensembling using meta-trained recurrent neural networks

Hand Written Digit Recognition Using Tensorflow and Python

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

End-To-End Spam Classification With Neural Networks

FastText. Jon Koss, Abhishek Jindal

Visual object classification by sparse convolutional neural networks

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

House Price Prediction Using LSTM

A Systematic Overview of Data Mining Algorithms

A Deep Learning Approach to Vehicle Speed Estimation

BAYESIAN GLOBAL OPTIMIZATION

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Sentiment Classification of Food Reviews

CS 224n: Assignment #3

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

CSC 578 Neural Networks and Deep Learning

DEEP LEARNING IN PYTHON. The need for optimization

Tutorial on Machine Learning Tools

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Know your data - many types of networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Can Active Memory Replace Attention?

Deep Learning for Computer Vision II

Deep Learning. Volker Tresp Summer 2014

Deep neural networks II

Learning internal representations

Research on Evaluation Method of Product Style Semantics Based on Neural Network

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Domain-Aware Sentiment Classification with GRUs and CNNs

Deep Nets with. Keras

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Rotation Invariance Neural Network

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

arxiv: v1 [cond-mat.dis-nn] 30 Dec 2018

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

This Talk. 1) Node embeddings. Map nodes to low-dimensional embeddings. 2) Graph neural networks. Deep learning architectures for graphstructured

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Notes on Multilayer, Feedforward Neural Networks

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

Global Journal of Engineering Science and Research Management

Recurrent Neural Networks and Transfer Learning for Action Recognition

A neural network that classifies glass either as window or non-window depending on the glass chemistry.

MoonRiver: Deep Neural Network in C++

Recurrent Convolutional Neural Networks for Scene Labeling

Dynamic Routing Between Capsules

CS6220: DATA MINING TECHNIQUES

arxiv: v1 [cs.cv] 2 Sep 2018

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Data Mining. Neural Networks

Layerwise Interweaving Convolutional LSTM

Weighted Convolutional Neural Network. Ensemble.

Technical University of Munich. Exercise 8: Neural Networks

Deep Learning on Graphs

Predict the box office of US movies

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

Neural Network Weight Selection Using Genetic Algorithms

A Quick Guide on Training a neural network using Keras.

Hyperparameters and Validation Sets. Sargur N. Srihari

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

Data Mining and Analytics

27: Hybrid Graphical Models and Neural Networks

Deep Learning with R. Francesca Lazzeri Data Scientist II - Microsoft, AI Research

Rapid growth of massive datasets

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

ˆ How to develop a naive LSTM network for a sequence prediction problem.

Recurrent Neural Nets II

Graph Neural Network. learning algorithm and applications. Shujia Zhang

How to Develop Encoder-Decoder LSTMs

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Matrix Computations and " Neural Networks in Spark

Transcription:

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework Demo Paper Joerg Evermann 1, Jana-Rebecca Rehse 2,3, and Peter Fettke 2,3 1 Memorial University of Newfoundland 2 German Research Center for Artificial Intelligence 3 Saarland University Abstract. Predicting the next activity of a running process is an important aspect of process management. Recently, artificial neural networks, so called deep-learning approaches, have been proposed to address this challenge. This demo paper describes a software application that applies the Tensorflow deep-learning framework to process prediction. The software application reads industry-standard XES files for training and presents the user with an easy-to-use graphical user interface for both training and prediction. The system provides several improvements over earlier work. This demo paper focuses on the software implementation and describes the architecture and user interface. Keywords: Process management, Process intelligence, Process prediction, Deep learning, Neural networks 1 Introduction The prediction of the future development of a running case, given information about past cases and the current of the case, i.e. predicting trace suffixes from trace prefixes, is a core problem in business process intelligence (BPI). Recently, deep learning [4, 5] with artificial neural networks has become an import predictive method due to innovations in neural network architectures, the availability of high-performance GPU and cluster-based systems, and the opensourcing of of multiple software frameworks at a high level of abstraction. Because of the sequential nature of process traces, recurrent neural networks are a natural fit to the problem of process prediction. An initial application of RNN to BPI [1] used the executing activity of each trace event as both predictor and predicted variable. Evaluation on the BPI 2012 and 2013 challenge datasets showed training performance of up to 90%, but that study did not perform cross-validation. A more systematic study [2], including cross validation, showed significant overfitting of the earlier results, i.e. the neural network capitalizes on idiosyncrasies of the training sample that do not generalize. The later work also showed that the size of the neural network has a significant effect on predictive performance. Further, predictive performance can be improved when

organizational data is included as predictor. Both [2] and [1] use approaches in which each word (e.g. the name of the executing activity of each event) is embedded in a k-dimensional vector space (details in Sec. 4.2 below), which forms the to the neural network. In contrast, an independent, parallel effort [6] eschews the use of embeddings, encoding event information as numbered categories in one-hot form. That approach also adds time-of-day and day-of-week as additional predictors. Both approaches compare the predicted suffix to the target suffix by means of a string edit distance, showing similar performance. Additionally, [2] demonstrates that predicted suffixes are similar to targets by showing similar replay fitness on a model mined from the target traces. The software application described here extends the previous approaches in a number of ways. Primarily, it is more general and flexible with respect to the case and event attributes that can be used as predictor or predicted variables. Whereas [2, 6] use only the activity name and lifecycle transition of an event, and [2] also includes resource information of an event as predictor, our application can use any case- or event-level attribute as predictor. We can also predict any and multiple event attributes of subsequent events. These advantages are due to differences in encoding. Whereas [2] concatenates activity name, lifecycle transition and resource information into an string, which is then assigned a category number and embedded in a vector space for to the neural network, our approach assigns category numbers to each attribute separately, then embeds them into their own vector spaces and concatenates the embedding vectors to form the vector. Details are presented in Sec. 4.2. The advantage is much smaller vocabularies (sets of unique s) which can be adequately represented by much smaller vectors. It also allows easy mixing of categorical predictors with embedding s and numerical predictors which are directly passed to the neural net, simply by concatenating these s. Additionally, our approach can predict multiple variables, for example the activity as well as the resource information of the next event. We use either shared or separate RNN layers for each predicted attribute. Finally, the software tool presented in this demo paper includes a graphical user interface that guides novice users and avoids the need for specialized coding, it provides integration with a graphical dashboard, and a stand-alone prediction application with an easy-to-use prediction API. As a demo paper, this paper focuses on the software implementation, architecture and user interface with only a short exposition of the neural network background. In particular, we describe the and handling of the neural network (Sections 4.2, 4.4), as this is where our approach differs from [2, 6]. More details on neural networks in general can be found in [4, 5], and, applied to process prediction, in [2, 6]. Our software is open-source and available from the authors 4. 4 http://joerg.evermann.ca/software.html 42

2 Recurrent Neural Networks Overview A recurrent neural network (RNN) is one in which network cells can maintain information by feeding the back to themselves, often using a form of long short term memory () cells [3]. To make this feedback tractable within the context of backpropagation, network cells are copied, or unrolled, for a number of steps. Fig. 1 shows an example RNN with cells; detailed descriptions can be found in [2, 6]. Initial State 1 Final State 1 Initial State 2 Final State 2 Fig. 1. An RNN with 5 unrolled steps, 2 layers, batch size b and length k 3 Tensorflow Our software implementation is based on the open-source Tensorflow deep learning framework 5. Tensors are generalizations of matrices to more than two dimensions, i.e. n-dimensional arrays. A Tensorflow application builds a computational graph using tensor operations. A loss function is a graph node that is to be minimized. It compares computed s to target s in some way, for example as cross-entropy for categorical variables, or root mean squared error for numeric variables. Tensorflow computes gradients of all tensors in the graph with respect to the loss function and provides various gradient descent optimizers. Training proceeds by iteratively feeding the Tensorflor computational graph with s and targets and optimizing the loss function using back-propagated errors. 4 Training Our software system consists of two separate applications, one for training a deep-learning model, and another one for predicting from a trained model. This section describes the training application. 5 https://www.tensorflow.org 43

4.1 XES Parser Training data is read from XES log files [7], beginning with global attribute and event classifier definitions. Using these, the traces and their events are read. While the XES standard allows traces and events to omit globally declared attributes, it does not allow specification of default values for missing attributes. Hence, the XES parser omits any incomplete or empty traces. String-typed attributes are treated as categorical variables. Their unique values (categories) are numbered consecutively, encoding each as an integer in 0... l i, where l i is the number of unique values for attribute i. Datetime-typed attributes are converted to relative time differences from the previous event. The user can choose whether to standardize or to scale them to days, hours, minutes or seconds for meaningful loss functions. Numerical attributes are standardized. For multi-attribute event classifiers, the parser constructs joint attributes by concatenating the stringtyped attributes or by multiplying the numerically-typed attributes specified in the classifier definition. End-of-case markers are inserted after each trace. 4.2 Inputs RNN training proceeds in epochs. In each epoch, the entire training dataset is used. Before each epoch, the of the RNN is reset to zero while trained network parameter values are retained. Within each epoch, training proceeds in batches of size b with averaged gradients to avoid overly large fluctuations of gradients. The batch size b can be freely chosen by the user. For each unrolled step s, the RNN accepts a floating point tensor I s R b p, where b is the batch size and p can be freely chosen. Numerical and datetime predictors are encoded directly, each yielding a floating point tensor I s,i R b 1 for predictor attribute i. Categorical attributes, encoded as integer category numbers, are transformed using an embedding matrix E i R li ki. This is a lookup matrix of size l i k i, where l i is the number of categories for predictor attribute i and k i can be freely chosen. Embedding lookup transforms an integer category number j s,i to a floating point vector of size k i. The larger the value of k i, the better the attribute values are separated in the k i -dimensional space. At the same time, larger k i lead to increased computational demands. The of the embedding lookup E(.) is a tensor I s,i R b ki = E(j s,i ). The tensors for all predictor attributes are concatenated, yielding a tensor C s R b m where m is the sum of the second dimensions of the concatenated tensors. An projection Ps I R m p can be applied so that the to each unrolled step of the RNN is I s = C s Ps I. 4.3 Model In our approach, the user can train a model on multiple predicted event attributes ( target variables ) concurrently, for example predicting the activity as well as the resource of the next event. For this, the user can choose to share the RNN layers across the different target variables, or to construct a separate RNN for each target variable. The embeddings are shared in all cases. 44

4.4 Outputs The of the RNN for each unrolled step is a tensor O s R b p. For a categorical predicted variable i, this is multiplied by an projection Ps,i O R p li to yield O s,i R b li = O s Ps,i O. A softmax function is applied to generate probabilities over the l i different categories S s,i R b li = softmax li (O s,i ). A cross-entropy loss function L i = H Ss,i (T s,i ) is then applied, comparing the probabilities against one-hot encoded targets T. One-hot encoding is a vector with the element indicating the target category number equal to one and the remainder set to zero. For numerical attributes, the O s is multiplied by a Ps,i O Rp 1 projection, yielding O s,i R b 1 = O s Ps,i O. This is compared to target values T s,i using the mean square error (MSE) L i = (O s,i T s,i ) 2, the root mean square error (RMSE) L i = (O s,i T s,i ) 2, or the mean absolute error (MAE) loss function L i = O s,i T s,i 2. 4.5 Logging and TensorBoard Integration Our software logs summary information about the proportion of correct predictions and the loss function value for each training step. The embedding matrices Fig. 2. Tensorboard dashboard showing training performance for 10 folds 45

for all categorical predictor variables are saved at the end of training, together with the computational graph. This information can be read by the Tensorboard tool (Fig. 2) to visualize the training performance, the graph structure, and to analyze the embedding matrices using t-sne or principal components projection into two or three dimensions. Finally, the entire trained network is saved in a frozen, compacted form to be loaded into the prediction application. 5 Prediction The prediction application loads a trained model, saved by the training application, as well as the corresponding training configuration, and predicts trace suffixes from trace prefixes. Trace prefixes are read from XES files. Because the trained model expects batches of size b, at most b trace prefixes are loaded at a time. The network is initialized to zero and the trace prefixes are to the trained network, encoded as described in Sec. 4.2. The network s for the last element of a trace prefix are the predictions for the attributes of the next event. For categorical attributes, the integer indicating the category number is translated back to the character string value. For datetime-typed attributes, the attribute value is computed by adding the predicted value to the attribute value of the prior event. The predicted event is then added to the trace prefix and the prediction process can be repeated. In this way, suffixes of arbitrary length can be predicted. The user can stop prediction when an end-ofcase marker has been predicted, or after a specified number of events have been predicted. The predicted suffixes are written back to an XES file. 6 Software Implementation Our software is implemented in Python 2.7 and uses Tensorflow 0.12 and Tkinter for the user interface. Figure 3 shows the main screen that guides the user through the selection of an XES file, the configuration of the RNN and training parameters, to the training of the model. Fig. 3. Main screen of the training application 46

Figure 4 shows the main configuration screen with sections for multi-attribute classifiers, global event and case attributes, RNN and training parameters and a choice of optimizer. Any classifier, global event and case attribute may be included as predictor, and classifiers and global event attributes may be chosen as predicted attributes (targets). The user can specify embedding dimensions for categorical attributes; the default values are the square root of the number of categories. The RNN configuration allows the user to specify batch size, number of unrolled steps, number of RNN layers, number of training epochs, etc. Users can specify an optional mapping and the desired size of the RNN vector. Finally, users have a choice of different gradient descent optimizers offered by Tensorflow and can adjust their hyperparameters. Configurations are automatically saved and the user can load saved configurations. Fig. 4. Configuration screen. User can choose multiple predictors and targets, set general training parameters, and choose an optimizer. Figure 5 shows the dialog indicating training progress. The current operation is shown, as well as the training rate (events/second) and the global learning rate for the gradient descent optimizer. For each predicted attribute, the latest training performance (proportion of correct prediction and loss function value) is shown. Our focus is on providing a research tool for experimentation, rather than a production tool. Therefore, we have not made use of distributed Tensorflow and Tensorflow Serving. Tensorflow automatically allocates the compute operations to available CPUs and GPUs on a single machine. This provides adequate per- 47

Fig. 5. Progress screen of the training process formance for the small size of typical event logs (megabyte rather than terabyte). We use our own prediction application with a simple API. 7 Conclusion We presented a flexible deep-learning software application to predict business processes from industry-standard XES event logs. The software provides an easyto-use graphical user interface for configuring predictors, targets, and parameters of the deep-learning prediction method. We have performed initial validation of the software to verify the correct operation of the software tool. Using the BPIC 2012 and 2013 datasets with the model and training parameters reported in [2], this software tool replicates their training results in [2]. Current work with this software tool is ongoing, and focuses on using different combinations of predictors and targets, made possible by our flexible approach to handling predictors and constructing RNN s, to improve upon the -of-the-art prediction performance. References 1. Evermann, J., Rehse, J.R., Fettke, P.: A deep learning approach for predicting process behaviour at runtime. In: PRAISE Workshop at the 14th International Conference on BPM (2016) 2. Evermann, J., Rehse, J.R., Fettke, P.: Predicting process behaviour using deep learning. Decision Support Systems (2017), (in press) 3. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735 1780 (1997) 4. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436 444 (2015) 5. Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks 61, 85 117 (2015) 6. Tax, N., Verenich, I., Rosa, M.L., Dumas, M.: Predictive business process monitoring with neural networks. In: Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), Essen, Germany. Springer (2017) 7. XES Working Group: IEEE standard for extensible Event Stream (XES) for achieving interoperability in event logs and event streams. IEEE Std 1849-2016 (2016) 48