Mining User Steps An innovative Approach to faster Crash Resolution

Similar documents
Data Mining Part 3. Associations Rules

Association Rule Mining. Introduction 46. Study core 46

Mining Frequent Patterns without Candidate Generation

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Automated Test Execution and Reporting(ATER) Pluggable Solution using JIRA

This paper proposes: Mining Frequent Patterns without Candidate Generation

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Chapter 4: Association analysis:

Frequent Itemsets Melange

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Improved Frequent Pattern Mining Algorithm with Indexing

Data Mining for Knowledge Management. Association Rules

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Rule Mining

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

DATA MINING II - 1DL460

Association Rule Mining: FP-Growth

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

CS570 Introduction to Data Mining

Mining of Web Server Logs using Extended Apriori Algorithm

Adaption of Fast Modified Frequent Pattern Growth approach for frequent item sets mining in Telecommunication Industry

Comparison of FP tree and Apriori Algorithm

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Analyzing Working of FP-Growth Algorithm for Frequent Pattern Mining

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Tutorial on Association Rule Mining

Appropriate Item Partition for Improving the Mining Performance

Implementation of Data Mining for Vehicle Theft Detection using Android Application

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Association mining rules

An Improved Apriori Algorithm for Association Rules

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

A Comparative Study of Association Rules Mining Algorithms

CSCI6405 Project - Association rules mining

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Performance Based Study of Association Rule Algorithms On Voter DB

A Modern Search Technique for Frequent Itemset using FP Tree

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS

Association Rules. A. Bellaachia Page: 1

Induction of Association Rules: Apriori Implementation

Association Rule Mining

Association Rule Mining from XML Data

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

An Efficient Algorithm for finding high utility itemsets from online sell

Data Mining Techniques

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Enhanced SWASP Algorithm for Mining Associated Patterns from Wireless Sensor Networks Dataset

Memory issues in frequent itemset mining

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Mining Association Rules in Large Databases

Gurpreet Kaur 1, Naveen Aggarwal 2 1,2

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

FP-Growth algorithm in Data Compression frequent patterns

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Comparing the Performance of Frequent Itemsets Mining Algorithms

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Chapter 7: Frequent Itemsets and Association Rules

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Finding the boundaries of attributes domains of quantitative association rules using abstraction- A Dynamic Approach

An Algorithm for Mining Large Sequences in Databases

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Research and Improvement of Apriori Algorithm Based on Hadoop

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Mining High Average-Utility Itemsets

Nesnelerin İnternetinde Veri Analizi

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Mining Quantitative Association Rules on Overlapped Intervals

Mining Frequent Patterns with Counting Inference at Multiple Levels

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

An Automated Support Threshold Based on Apriori Algorithm for Frequent Itemsets

An Approximate Scheme to Mine Frequent Patterns over Data Streams

Parallelizing Frequent Itemset Mining with FP-Trees

Performance Analysis of Data Mining Algorithms

A Taxonomy of Classical Frequent Item set Mining Algorithms

Chapter 6: Association Rules

Mining Distributed Frequent Itemset with Hadoop

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Associating Terms with Text Categories

Temporal Weighted Association Rule Mining for Classification

Transcription:

Mining User Steps An innovative Approach to faster Crash Resolution Tanvi Dharmarha, Quality Engineering Manager Banani Ghosh, Software Engineer Rupak Chakraborty, Member of Technical Staff Adobe Systems

Abstract Software crashes are the most severe manifestation of software bugs. Despite many best practices and quality assurance techniques, crashes do happen in the field. With growing complexities, number of crashes increases and it becomes difficult for testers and developers to track and fix the crashes. Crashes are often intermittent and chances of reproducing are very low as testers do not know the exact sequence of steps executed to arrive at the crash. As a result, most crashes go unfixed. At Adobe, our crash logs typically carry stack trace and user steps leading to crashes. While stack traces help in bucketizing crashes by finding the module and offset; sequence of steps (leading to a crash) provided by the testing team helps speed up crash resolution time significantly. Some crashes often occur even due to configurational issues, thus analyzing the User steps testers may not even require a code debugging. Now, with thousands of crash logs with similar stack traces and user steps, developers and testers can deal with them only within a few weeks. That s not fast enough when you have thousands of frustrated users. We propose a solution that will distill the User steps from the thousands of crash logs (of similar crashes) to the exact few steps leading to crash. Applying a Reverse Analysis System across the available User Steps, the most likely set of user steps for reproducing a set of similar crashes despite different workflows can be easily predicted. Keywords: Mining, User step, Fp-tree, software crash Goal of Presentation: 1. Users Steps introduction, importance and usage in crash logs 2. Mining for recurrent sequence of User steps across all crashes of a unique stack trace to identify meaningful flows leading to crash. Introduction Today the world is deeply involved digitally. Software Companies across the world are competing to deliver new technologies to their customers ahead of timelines. To achieve this, development and testing teams go beyond the call of duty to deliver quality products. Developers code high-level managed code, whereas testers try out every possible scenario to deliver bug free software to customers. But, in a world with a billion population, it is not possible to predict, acquire and/or create environments of usage of such different and diverse minds. Thus, few scenarios still go untested leading to crashes in the delivered software. For analysing those crashes, Crash Reporting System in Adobe and in other companies collects lots of data such as, stack traces, type of crashes, trends, version information, platforms and even User Steps that were performed by the user in the entire session from the launch of the software till the crash occurred. The Testing team collects and pass on all the collected data to the Developers who then starts with code analysis to understand the problem causing the crash. Sometimes, the code works perfectly and all the possible cases are also handled for a feature or workflow, but the crash may have occurred due to some configurations or it might be the side-effect of some different Workflow and/or feature. At such an alarming situation where several frustrated Users are awaiting the fix, jumping straight to the code just adds to the debugging time. Thus, it is observed that among all the information collected from Client machine, User Steps are one of the least used reports during the crash analysis process [1][2]

It is due to the following reasons that we assume that User Steps will help speed up the debugging process for a crash 1. Users may obtain the crash within few steps or crash may happen only after using the application for long and executing tens of steps. 2. Crash at most times is intermittent thus a user may face crash after performing the same number of Steps multiple times 3. Availability of Large inventory of user steps from different User environments will suffice to back the correctness of the predicted set of User steps As per above, this inventory of User Steps is useful for crash debugging but crawling across hundreds of crash logs and manually digging out reason for crash is not just time consuming but nearly unfeasible. But we can use this inventory of User Steps to lower the efforts of both the Software Testing and Development Team by providing a system that will narrow down the most likely User steps that would repro the crash scenario even before the crashes are resolved with the Symbol Files which is a time taking process. For example, software like Microsoft s WinDBG takes approximately 2 to 3 hours but even can go upto 12-14 hours in case of high amount of crashes. Solution As we have all the crashes bucketized based on the Module and Offset obtained from the crash dumps and type of crash (as shown in fig 1), so, the User Steps attached to all the Crashes grouped on Stack Traces, should have similar steps executed just before or mostly near to the crashes (as shown in fig 2). (fig. 1. Screenshot of bucket and Screenshot of unique stack trace)

(fig 2. Screenshot of User Steps Logs) I. Reverse Analysis Algorithm (RAA) Applying a Reverse Analysis algorithm on the available User steps for a Set of similar crashes, we can narrow down to few steps that will more likely get the crash reproduced. The Algorithm follows in sequence below(as shown in fig 3): 1. Selecting the crash log file File 2 with minimum number of user steps for given Unique Stack Trace 2. Initializing in variable match_counter to a pre-define number of user steps that should reproduce the crash 3. If number of User Steps is greater than match_counter then assign that value to match_counter 4. Reverse Traversal: Iterating over each user step from File 2 in reverse order, as the last set of steps are the most likely the ones at which the crash occurs. For every step in File 2, the system runs match against user steps in File 2, File 3, File n. a. If a match found, then the user step is saved in the final User Step List US_Final and match_counter is decremented by 1 b. Else the Step is discarded and previous step (that was performed before it) is considered 5. If Step exists in 80-90% of User Steps of each crash log, then Save the Steps in the order of Occurrence a. As observed in fig 2, Users have obtained the crash by using Brush Tool at the end, but if observed carefully, in File 1 the Brush Tool is used several times without crash. b. While observing in reverse order in File 2 and File 3, the steps in common were operations with Layer tool c. As other steps are inappropriate, we discard them. Thus, we can conclude that by using Layer tool followed by Brush Tool, shall reproduce this crash.

(fig 2. Reverse Analysis Algorithm) from difflib import SequenceMatcher def get_similarity_ratio(s1, s2): seq = SequenceMatcher(None, a=s1,b=s2) return seq.ratio() def get_user_steps_file_specified_length( user_step_file_dict,min_length): for user_steps in user_step_file_dict.keys(): if len(user_steps) == min_length: filtered_user_step_file_dict[user_steps] = user_step_file_dict[user_steps] return filtered_user_step_file_dict def filter_overlapped_steps(user_step_iterable,similarity_threshold=0.7): if get_similarity_ratio( first_string,second_string) < similarity_threshold: filtered_pattern_list.add(value) Reverse Analysis Algorithm provides Faster Crash Resolution compared to Forward Traversal. But sometimes, with plethora of crash logs with huge differences in the number of steps, applying RAA became costly. To take this algorithm further, we deep dived into the mining of the user data with the extracted user steps. Here, a new set User Steps based on the frequently occurring ones are extracted for each Unique Stack trace, leading to further predicting User steps at bucket level. In Data mining, the task of finding frequent patterns from large databases is computationally expensive, especially in our case where a large number of patterns exist. The RAA is

further extended with other such efficient machine learning algorithms for mining complete set of frequent patterns are : Apriori algorithm Frequent Pattern Mining Algorithm Both the above data mining algorithms are based on Associative Rule Mining in which, given a set of transactions (in this case User Steps), find rules that will predict the occurrence of an item-based on the occurrences of other items in the transaction. II. Dataset A collection of one or more items Example: {Step 1, Step 2,..., Step N} k-dataset A dataset that contains k items Support count (σ) Frequency of occurrence of an dataset E.g. σ({step 1, Step2,.., Step N}) = 2 Support Fraction of transactions that contain an dataset E.g. sup({step 1, Step2,.., Step N}) = 2/5 Frequent dataset A dataset whose support is greater than or equal to a minsupport threshold III. Associative rule Mining Principles Association Rule V. An implication expression of the form X Y, where X and Y are data sets. Rule Evaluation Metrics VI. Support (s) - Fraction of transactions that contain both X and Y VII. Confidence (c) Measures how often items in Y appear in transactions that contain X IV. Bottlenecks with Apriori Algorithm: As the dimensionality of the database increases with different pattern sets with the increase of crashes & Unique Stack Traces in a Bucket, then: More search space needed thus increasing I/O operations No. of database scans are increased similar to RAA, thus candidate generation increased computational cost V. Frequent Pattern Mining Algorithm: On the other hand, Frequent Pattern Mining Algorithm (fp-algorithm) proposed by Han[5] proved to be an efficient and scalable mining method. This allows frequent dataset discovery without candidate dataset generations,

thus improving performance. For so much it uses a divide-and-conquer strategy[6]. The core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the item set association information. This is a two-step approach: Step 1: Build a compact data-structure called Fp-tree FP-Tree is constructed using 2 passes over the data-set: Pass 1: Scan data and find support for each item. Discard infrequent items. Sort frequent items in decreasing order based on their support. Minimum support count = 2 Scan database to find frequent 1-datasets s(a) = 8, s(b) = 7, s(c) = 5, s(d) = 5, s(e) = 3 Item order (decreasing support): A, B, C, D, E This order is used when building the FP-Tree, so common prefixes can be shared (fig 3. Create User Steps Transaction List) import pandas as pd from difflib import SequenceMatcher user_step_list = get_transaction_list(df) def get_transaction_list(dataframe): crash_id_columns = dataframe.columns dataframe.fillna(value="", inplace=true) transaction_list = list([]) for column in crash_id_columns: step_list=list(dataframe[column].values) step_list=filter(lambda x:x!= "",step_list) step_list = map(remove_step_numbers_using_regex, step_list) transaction_list.append(step_list) return transaction_list

Pass 2: Nodes correspond to items and have a counter 1. FP-Growth reads 1 transaction at a time and maps it to a path 2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prefix). In this case, counters are incremented Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) The more paths that overlap, the higher the compression. FP-tree may fit in memory. Frequent datasets extracted from the FP-Tree. import pyfpgrowth as fp support = 2 confidence = 0.7 patterns= fp.find_frequent_patterns( user_step_list,support) (fig 4. FP-Tree construction)

Step 2: Frequent Dataset Generation (fig 5. Complete FP-Tree for sample transactions) a. Extracts frequent data sets directly from the fp-tree b. Bottom-up algorithm - from the leaves towards the root c. Divide and conquer: o first look for frequent datasets ending in e, then de, etc. then d, then cd, etc. d. First, extract prefix path sub-trees ending in an item(set). (using the linked lists) filtered_patterns = filter_overlapped_steps(patterns.keys(),similarity_threshold=0.7) if get_similarity_ratio(first_string,second_string) < similarity_threshold: filtered_pattern_list.add(value)

(fig 6. Prefix Path subtrees) VI. Conditional FP-Tree to predict the New User Step Set Let the minsupport = 2 and let us extract all frequent datasets containing E th User step: Obtain the prefix path sub-tree for E: Check if E is a frequent item by adding the counts along the linked list (dotted line). If so, extract it. o Yes, count =3 so {E} is extracted as a frequent dataset. o As E is frequent, find frequent datasets ending in e. i.e. DE, CE, BE and AE. E nodes can now be removed (fig 7. Prefix Path sub-tree for User Step E) The FP-Tree that would be built if we only consider transactions containing a particular dataset (and then removing that dataset from all transactions). Here, FP-Tree is conditional on E

Sub-trees for both CDE and BDE are empty no prefix paths ending with C or B (fig 8. FP-Tree is conditional on User Step E)

Working on ADE **ADE (support count = 2) is frequent (fig 8. Suffix tree for User Step pattern ADE ) Solving next sub problem CE (fig 8. FP-Tree is conditional on suffix User Step pattern CE)

**CE is frequent (support count = 2) (fig 9. FP-Tree is conditional on suffix User Step pattern CE) Work on next sub problems: BE (no support), AE **AE is frequent (support count = 2) (fig 10. FP-Tree is conditional on suffix User Step pattern AE)

Done with AE Work on next sub problem: suffix D E, DE, ADE, CE, AE discovered in this order (fig 10. FP-Tree is conditional on suffix User Step D) Thus, Frequent itemsets found (ordered by suffix and order in which these are found): (fig 11. Result for Frequent User Steps)

Advantages of User Steps: Example Use Case Let s consider an example which appears to be a bug but the diagnosis is incorrect. User steps available for a program crash states that on Opening an existing File, modifying the same and Saving it in a new format. This causes crash every time on User s machine. Thus, clearly, a member of the testing team can also analyze that the issue must be in the Save Routine. But, few questions remain unanswered if the new file format caused the crash or the entire Save routine. This being a simple and straightforward case should have a testcase recorded with the same steps that should have been marked passed or Failed. If marked Passed, then it might be a fault of the test case executor and in case Failed then searching the Known issue thread should improvise the analysis team or it might be a Platform issue as well. Another possibility is that, due to critically low disk space in the User s machine, the Save Routine failed, thus it will not be considered as a Save Routine bug instead it would have been marked as an improvement with a minor/normal priority demanding a graceful exit. So, in none of the above scenarios direct digging in the code was required. And a closer suggestion to the developer would help him to provide a quick fix in no time. Conclusion Using this approach in day to day testing activities brings down crash isolation and resolution by 2-3 hours as testers can narrow down the reason for crash even before debugger tools can perform crash resolution References & Appendix [1]https://books.google.co.in/books?id=YmKmWVYqNx4C&lpg=PA207&ots=w4Op8kTaCU&dq=importance% 20of%20User%20steps%20in%20software%20program%20crashes&pg=PA207#v=onepage&q&f=true [2]https://developer.apple.com/library/content/technotes/tn2151/_index.html#//apple_ref/doc/uid/DTS40008184- CH1-ANALYZING_CRASH_REPORTS [3]http://www.makinggoodsoftware.com/2009/06/14/7-steps-to-fix-an-error/ [4]http://www.dumpanalysis.org/ [5] J. Han, H. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In: Proc. Conf. on the Management of Data (SIGMOD 00, Dallas, TX). ACM Press, New York, NY, USA 2000. [6] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques. 2nd edition, Morgan Kaufmann, 2006.

Author Biography Tanvi Dharmarha is working with Adobe Systems as Quality Engineering Manager and has over 10 years of experience in manual, automated and API testing. She owns the quality engineering for Adobe Crash Reporting System. Tanvi has several paper publications to her credit. She holds an engineering degree in Information Technology and is also a certified Stanford Advanced Project Manager. Banani Ghosh is working with Adobe Systems as Senior Software Engineer having 2years of experience in manual, automated and API Testing. She has been working as a quality engineer for Adobe Crash Reporting System. She holds an engineering degree in Electronics and Electrical. Prior to Adobe she has worked with Aricent Technologies in Telecom domain being responsible in developing and maintaining several Security Gateway APIs and tools. Rupak Chakraborty is working with Adobe Systems as Member of Technical Staff having an experience of 2yrs in the Computer Software Industry in building scalable and intelligent systems. He has led several Artificial Intelligence projects in Adobe Cloudtech Tools team. Rupak has several paper publications to his credit. He holds an engineering degree in Information Technology and is also a certified research intern from The German Research Center for Artificial Intelligence

THANK YOU!