A SURVEY ON AUTOMOBILE INDUSTRIES USING DATA MINING TECHNIQUES

Size: px

Start display at page:

Download "A SURVEY ON AUTOMOBILE INDUSTRIES USING DATA MINING TECHNIQUES"

Marsha Gordon
5 years ago
Views:

1 A SURVEY ON AUTOMOBILE INDUSTRIES USING DATA MINING TECHNIQUES S.Gunasekaran 1,C.Chandrasekaran 2 1 Head, Dept. Of Computer Science, King College Of Arts And Science For Women, Nallur, N.Pudupatti(Po), Namakkal (Dt.) Tamilnadu, India. Cell : Maid id : guna_as@yahoo.com Abstract:- Even though data mining has been successful in becoming a major component of various business processes as well as in transferring innovations from academic research into the business world, the gap between the problems that the re- search community works on and real-world is still significant. We believe that it is essential for the business and the academic research communities to interact frequently. The goal of this paper is to investigate the automobile industry data and reviews the algorithms that are suited to this investigation. Keywords: Clustering, Automobile Industries, K-Means, Outlier analysis, Supervised Learning, Machine Learning. I. DATA MINING An over view a. Introduction Data Mining in various forms is becoming a major component of business operations. Almost every business process today involves some form of data mining. Customer Relationship Management, Supply Chain Optimization, Demand Forecasting, Assortment Optimization, Business Intelligence, and Knowledge Management are just some examples of business functions that have been impacted by data mining techniques. [1] b.data mining Terminology Data mining: The process of efficient discovery of no obvious valuable patterns from a large collection of data. Knowledge discovery: A term often used interchangeably used with data mining. Association rule: A rule in the form of if this then that that associates events in a database. For example the association between purchased items at a supermarket. Clustering: The technique of grouping records together based on their locality and connectivity within the n- dimensional space. This is an unsupervised learning technique. 2 Associate Professor, Department Of Computer Science, Periyar University, Salem, Tamil Nadu, India. Cell: Fuzzy logic: A system of logic based on the fuzzy set theory. Fuzzy set: A set of items whose degree of membership in the set may range from 0 to 1. Fuzzy system: A set of rules using fuzzy linguistic variables described by fuzzy sets and processed using fuzzy logic operations. Machine learning: A field of science and technology concerned with building machines that learn. In general it differs from Artificial Intelligence in that learning is considered to be just one of a number of ways of creating an artificial intelligence. Neural network: A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights. Outlier analysis: A type of data analysis that seeks to determine and report on records in the database that are significantly different from expectations. The technique is used for data cleansing, spotting emerging trends and recognizing unusually good or bad performer. Supervised learning: A class of data mining and machine learning applications and techniques where the system builds a model based on the prediction of a well defined prediction field. This is in contrast to unsupervised learning where there is no particular goal aside from pattern detection. Unsupervised learning: A data analysis technique whereby a model is built without a well defined goal or prediction field. The systems are used for exploration and general data organization. Clustering is an example of an unsupervised learning system. Visualization: Graphical display of data and models which helps the user in understanding the structure and meaning of the information contained in them. II. LITERATURE SURVEY 30

2 To understand the hazards of automobile industries discussions were held with industry professionals and labours. We have also gathered details from the internet. By analyzing these all we came to know that by applying data mining techniques it would be useful to the automobile industries to increase their business. For the data resource we decided to make survey and find out how the body building units are used the spare parts and in type of models of body building moved among industry people from here to throughout India, for that by approaching the industry people in Namakkal which is famous for automobile Lorry body building units and prepared the queries and collected the data as real time. The different factors of the automobile body building industries were interviewed based on the questionnaires prepared.[8] The collected data are compiled and grouped based on various factors. There will be low extreme and high extreme between them. For equal distribution among the data as per statiscal methods, it is scaled viz finding Mean and Standard deviation and converts it into Binary values. The collected data are to be discussed with the following data mining applications Techniques. The applications are proposed to implement by Weka data miner tool. [10] Figure:1 illustrates the same data collected.[6] III. PROPOSED IMPLEMENTATION OF DATA MINING APPLICATIONS TO THE PROBLEM. a. Classification We describe the most commonly used systems for induction of decision for classication isc4.5. ID3 and C4.5 (J48 in weka data miner tool) are algorithms introduced by Quinlan for inducing Classification Models, also called Decision Trees, from data. We are given a set of records. Each record has the same structure, consisting of a number of attribute/value pairs. One of these attributes represents the category of the record. The problem is to determine a decision tree that on the basis of answers to questions about the non-category attributes predicts correctly the value of the category attribute. Usually the category attribute takes only the values {true, false}, or {success, failure}, or something equivalent. In any case, one of its values will mean failure. For example, we may have the results of measurements taken by experts on some widgets. For each widget we know what the value for each measurement is and what was decided, if to pass, scrap, or repair it. That is, we have a record with as non categorical attributes the [5] Measurements, and as categorical attribute the disposition for the widget. Here is a more detailed example. We are dealing with records reporting on weather conditions for playing golf. The categorical attribute specifies whether or not to play. The non-categorical attributes are: Figure 2: ATTRIBUTE outlook POSSIBLE VALUE : sunny, overcast, rain temperature : continuous humidity windy : continuous : true, false ========================================= the training data is in Figure : 3. The basic ideas behind ID3 are that: 1) In the decision tree each node corresponds to a noncategorical attribute and each arc to a possible value of that attribute. A leaf of the tree specifies the expected value of the categorical attribute for the records described by the path from the root to that leaf. [This defines what is a Decision Tree.] 2) In the decision tree at each node should be associated the non-categorical attribute which is most informative among the attributes not yet considered in the path from the root. [This establishes what is a "Good" decision tree.] 3) Entropy is used to measure how informative is a node. [This defines what we mean by "Good". By the way, this notion was introduced by Claude Shannon in Information Theory.] C4.5 is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on. [5] 31

FIGURE :1 TEMPORARY DATA COLLECTED FROM AUTOMOBILE INDUSTRY IN WEKA 3.6.

3 FIGURE :1 TEMPORARY DATA COLLECTED FROM AUTOMOBILE INDUSTRY IN WEKA TABLE : 2 EXAMPLE TRAINING DATA SET FOR ID3 ALGORITHM FOR FIGURE : 2 OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY ===================================================== sunny false Don't Play sunny true Don't Play overcast false Play rain false Play rain false Play rain true Don't Play overcast true Play sunny false Don't Play sunny false Play rain false Play sunny true Play overcast true Play overcast false Play rain true Don't Play 32

4 1) Definitions If there are n equally probable possible messages, then the probability p of each is 1/n and the information conveyed by a message is -log(p) = log(n). [In what follows all logarithms are in base 2.] That is, if there are 16 messages, then log(16) = 4 and we need 4 bits to identify each message. In general, if we are given a probability distribution P = (p1, p2,.., pn) then the Information conveyed by this distribution, also called the Entropy of P, is: I(P) = -(p1*log(p1) + p2*log(p2) pn*log(pn)) For example, if P is (0.5, 0.5) then I(P) is 1, if P is (0.67, 0.33) then I(P) is 0.92, if P is (1, 0) then I(P) is 0. [Note that the more uniform is the probability distribution, the greater is its information.] If a set T of records is partitioned into disjoint exhaustive classes C1, C2,.., Ck on the basis of the value of the categorical attribute, then the information needed to identify the class of an element of T is Info(T) = I(P), where P is the probability distribution of the partition (C1, C2,.., Ck): P = ( C1 / T, C2 / T,..., Ck / T ) In our golfing example, we have Info(T) = I(9/14, 5/14) = 0.94, and in our stock market example we have Info(T) = I(5/10,5/10) = 1.0. If we first partition T on the basis of the value of a non-categorical attribute X into sets T1, T2,.., Tn then the information needed to identify the class of an element of T becomes the weighted average of the information needed to identify the class of an element of Ti, i.e. the weighted average of Info(Ti): Ti Info(X,T) = Sum for i from 1 to n of ---- * Info(Ti) In the case of our golfing example, for the attribute Outlook we have Info(Outlook,T) = 5/14*I(2/5,3/5) + 4/14*I(4/4,0) + 5/14*I(3/5,2/5) = Consider the quantity Gain(X,T) defined as Gain(X,T) = Info(T) - Info(X,T) This represents the difference between the information needed to identify an element of T and the information needed to identify an element of T after the value of attribute X has been obtained, that is, this is the gain in information due to attribute X. In our golfing example, for the Outlook attribute the gain is: Gain(Outlook,T) = Info(T) Info(Outlook,T) = = If we instead consider the attribute Windy, we find that Info(Windy,T) is and Gain(Windy,T) is Thus Outlook offers a greater informational gain than Windy. We can use this notion of gain to rank attributes and to build decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root. The intent of this ordering are twofold: i) To create small decision trees so that records can be identified after only a few questions. ii) To match a hoped for minimality of the process represented by the records being considered(occam's Razor). 2) The ID3 Algorithm (j48 in weka tool) The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes C1, C2,.., Cn, the categorical attribute C, and a training set T of records. function ID3 (R: a set of noncategorical attributes, C: the categorical attribute, S: T a training set) returns a decision tree; begin If S is empty, return a single node with value Failure; If S consists of records all with the same value for the categorical attribute, return a single node with that value; 33

5 If R is empty, then return a single node with as value the most frequent of the values of the categorical attribute that are found in records of S; [note that then there will be errors, that is, records that will be improperly classified]; Let D be the attribute with largest Gain(D,S) among attributes in R; Let {dj j=1,2,.., m} be the values of attribute D; Let {Sj j=1,2,.., m} be the subsets of S consisting respectively of records with value dj for attribute D; Return a tree with root labeled D and arcs labeled d1, d2,.., dm going respectively to the trees ID3(R-{D}, C, S1), ID3(R- {D}, C, S2),.., ID3(R-{D}, C, Sm); end ID3; In the Golfing example we obtain the following decision tree: Outlook overcast sunny Play Humidity Windy rain <=75 >75 true false and partitioning clustering. Clustering algorithms [9] differ among themselves in their ability to handle different types of attributes, numeric and categorical. 1) The K-means method: K-means is the simplest and most popular classical clustering method that is easy to implement. The classical method can only be used if the data about all the objects is located in the main memory. The method is called K-means [2] since each of the K clusters is represented by mean of the objects(called centroid) within it. It is also called the centroid method since at each step the centroid point of each cluster is assumed to be known and each of the remaining points are allocated to the cluster whose centroid is closest to it. Once this allocation is completed, the centroids of the clusters are recomputed using simple means and the process of allocating points to each cluster is repeated until there is no change in the clusters. The method may also be looked at as a search problem where the aim is essentially find the optimum clusters given the number of clusters and seeds specified by the user. The K-means method uses the Euclidean distance measure. 2) K-means algorithm: k[2]. Select the number of clusters. Let this number be 1. Pick k seeds as centroids of the k clusters. The seeds may be picked randomly unless the user has some insight into the data. 2. Compute the Eulidean distance of each object in the data set from each of the centroids. Don'tPlay Play Play Don'tPlay Eulidean distance : D(x,y) = ( (x i y i ) 2 ) 1/2 b) Clustering 3. Allocate each object to the cluster it is nearest to based on the distances computed in the previous step. Clustering [2] is a useful technique for the discovery of data distribution and patterns in the underlying data. The goal of clustering is to discover both the dense and the spare regions in a data set. It is also suitable socioeconomic health hazards. There are two main approaches to clustering Hierarchical clustering 4. Compute the centroids of the clusters by computing the means of the attribute values of the objects in each cluster. 34

5. Check if the stopping criterion has been met. If yes go to setp 7. If not to step 3. 6.

6 5. Check if the stopping criterion has been met. If yes go to setp 7. If not to step [Optional] One may decide to stop at this stage or to spilt a cluster or combine two clusters heuristically until a stopping criterion is met. Author profile: The method is scalable and efficient and is guaranteed to find a local minimum. Conclusion The survey done in the present study on the data mining application techniques for the automobile industries in Namakkal District of Tamil Nadu, India will surely help to extract various hidden patterns in the raw data, through which can give precautious to automobile retailers and can help in their business decision making. Reference [1]. Han J, Kamber M. Data mining concepts and techniques. 2 nd Edition, Morgan Kaufmann Publishers. [2]. Pujari AK. Data mining techniques. University Press. [3]. Gupta GK. Introduction to Data mining with case studies. PHI Learning Private Ltd, New Delhi. [4] Industry application of Data mining- White paper [5]. Integrating Demand And Supply Chains In The Global Automotive Industry - -Deloitte [6].Data collected using Questionnaries Prepared. Corresponding Author 1 : Mr.S.Gunasekaran completed his M.Sc(CS).,in Thanthai Hans Roever College, Perambalur, Salem, MPhil(CS) Under M.C.A., Under Periyar University, Manonmaniyam Sundranar University, Tirunelveli and persuade his Ph.D in Comp.Sci Under Dravidan University, Kuppam. He has been working as Head, Dept. of Comp.Sci in King College of Arts and Science, Namakkal, Tamil Nadu, India, with 12years of teaching experience and published various journals and area of research is Data Mining application Techniques and Network. Corresponding Author 2 : Dr.C.Chandrasekaran M.C.A.Ph.D(Comp.Sci), Has been working as a Associate Professor, Department Of Computer Science, Periyar University,Salem, Tamil Nadu, India. with 15 years of teaching experience and 8 years of research experience.he has guiged more than 22 research scholars, Published various research articles in reputed journals and chaired many seminars and conferences, his area of research is Data Mining, Network and image processing [7].Data mining for Business application- KDD Workshop - Rayid Ghani, Carlose Soares. [8]. Data mining as an Automated Service- P.S Bradley. [9]. Data mining applications in the Automative Industry- Rudolf Kruse, Christian Moewas. [10].Weka Data miner tool-waikato University, Newzealand. 35

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,