EMPOWERING SCIENTIFIC DISCOVERY BY DISTRIBUTED DATA MINING ON A GRID INFRASTRUCTURE

Size: px

Start display at page:

Download "EMPOWERING SCIENTIFIC DISCOVERY BY DISTRIBUTED DATA MINING ON A GRID INFRASTRUCTURE"

Victor Oliver
6 years ago
Views:

1 EMPOWERING SCIENTIFIC DISCOVERY BY DISTRIBUTED DATA MINING ON A GRID INFRASTRUCTURE A PROPOSAL FOR DOCTORAL RESEARCH by Haimonti Dtta SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT UNIVERSITY OF MARYLAND BALTIMORE COUNTY 1000 HILLTOP CIRCLE, BALTIMORE, MD, JULY 2006

2 Table of Contents Table of Contents Abstract i iii 1 Introdction Motivation Proposed Research Objectives Backgrond The Grid Introdction The Grid Architectre Classification of Grids The Data Grid Introdction Data Distribtion Scenarios Middleware, Protocols and Services Data Mining on the Grid Distribted Data Mining Introdction Classification Clstering Distribted Data Stream Mining The Challenges Preliminary Work Introdction Orthogonal Decision Trees Decision Trees and the Forier Representation Compting the Forier Transform of a Decision Tree Constrction of a Decision Tree from Forier Spectrm Removing Redndancies from Ensembles Experimental Reslts i

3 3.3 DDM on Data Streams Introdction Experimental Reslts Monitoring in Resorce Constrained Environments Grid Based Physiological Data Stream Monitoring - A Dream or Reality? DDM on Federated Databases The National Virtal Observatory Data Analysis Problem: Analyzing Distribted Virtal Catalogs The DEMAC system WS-DDM DDM for Heterogeneosly Distribted Sky-Srveys WS-CM Cross-Matching for Heterogeneosly Distribted Sky-Srveys DDM Algorithms: Definitions and Notation Virtal Catalog Principal Component Analysis Case Stdy: Finding Galactic Fndamental Planes Smmary Ftre Work The DEMAC system - frther explorations Grid-enabling DEMAC PCA based Otlier Detection on DEMAC Proposed Plan of Research Bibliography 95 ii

4 Abstract The grid-based compting paradigm has attracted mch attention in recent years. The sharing of distribted compting resorces (sch as software, hardware, data, sensors, etc) is an important aspect of grid compting. Comptational Grids focs on methods for handling compte intensive tasks while Data Grids are geared towards data-intensive compting. Grid-based compting has been pt to se in several application areas inclding astronomy, chemistry, engineering, climate stdies, geology, oceanography, ecology, physics, biology, health sciences and compter science. For example, in the field of biomedical informatics, researchers are bilding an infrastrctre of networked high-performance compters, data integration standards, and other emerging technologies, to pave the way for medical researchers to transform the way diseases are being treated. In Oceanography, efforts are being made to federate ocean observatories into an integrated knowledge grid. Breakthroghs in telescope, detector, and compter technology allow astronomical srveys to prodce terabytes of images and catalogs, thereby prodcing a data avalanche. However, extracting meaningfl knowledge from these gigantic, geographically distribted, heterogeneos data repositories reqires development of architectres, sophisticated data mining algorithms and efficient schemes for commnication. This proposal considers research in grid-based distribted data mining. It aims to bring together the relatively new research areas of distribted data mining and grid mining. While architectres for data mining on the grid have already been proposed, we arge that the inherently distribted, heterogeneos natre of the grid, calls for distribted data mining. Conseqently, research shold be geared towards development of distribted schema integration, qery processing, algorithm development and workflow management. As a proof of concept, we first explore the feasibility of execting distribted data mining algorithms on astronomy catalogs obtained from two different sky srveys Sloan Digital Sky Srvey (SDSS) and The Two Micron All Sky Srvey (2MASS). In particlar, we examine a techniqe for cross-matching indices of different catalogs thereby aligning them, se a randomized distribted algorithm for principal component analysis and propose to develop an otlier detection algorithm based on a similar techniqe. While this serves as a proof of concept, efforts are in way to grid-enable the application. This reqires research on service-oriented architectral paradigms to spport distribted data mining on the grid. The data repositories ported on the grid are not all static. Streaming data from web click streams, network intrsion detection applications, sensor networks, wearable devices, mltimedia applications are also finding their way into the grid. This is particlarly sefl since in this way, researchers do not need to set p or own mobile devices, expensive eqipment sch as telescopes and satellites bt can access interesting data streams pblished on the grid. However, in order to discover meaningfl knowledge from these distribted, heterogeneos streams, efforts have to be made to bild new architectres and algorithms to spport distribted data streams. We propose to address these isses to enable data stream mining on the grid. iii

5 Chapter 1 Introdction 1.1 Motivation Advances in science has been gided by analysis of data. For example, hge genome seqences [132] available online motivates collaborative research in biology, catalogs of sky srveys [242, 3] enables astronomers to answer qeries that may have taken years of observation, high-resoltion, long-dration simlation data from experiments and models enables research in climatology, physics, geosciences and chemistry [197, 115, 225], and advanced imaging capabilities sch as Magnetic Resonance Imaging (MRI), Compted Tomography (CT) scans prodce large volmes of data for medical professionals [29]. However, as pointed ot by Ian Foster, converting data into scientific discoveries reqires "connecting data with people and compters". It involves: 1. Finding the data of interest. 2. Moving the data to desired locations. 3. Managing large scale comptations 4. Schedling resorces on data and 5. Managing who can access the data when. For example, the main goal of CERN, the Eropean Organization for Nclear Research in Geneva, Switzerland is to stdy the fndamental strctre of matter and the interaction of forces. In particlar, sbatomic particles are accelerated to nearly the speed of light and then collided. Sch collisions are called events and are measred at time intervals of only 25 nanoseconds in for different particle detectors of the Large Hadron Collider (LHC) CERN s next generation accelerator which has started data collection in According to the MONARC Project 1 each of the 4 main experiments will prodce arond 1 Petabyte of data a year over a life span of abot two 1 MONARC: Models of Networked Analysis at Regional Centers for LHC experiments, 1

6 decades. This data needs to be analysed by abot 5,000 physicists arond the world. Since CERN experiments are collaborations of over a thosand physicists from many dif ferent niversities and instittes, the experiments data is not only stored locally at CERN bt is distribted world wide in so called Regional Centres (RCs), in national instittes and niversities. Ths, complex distribted compting infrastrctres motivate the need for Grid environments. To extract meaningfl information from distribted, heterogeneos data repositories on the grid, sophisticated knowledge discovery architectres have to be designed. The process of data mining on the grid, is still a relatively new area of research ([40, 43, 46, 37, 220, 264, 233]). While several architectres have been developed for this prpose, the framework of distribted data mining on the grid infrastrctre still has a long way to go. The aim of this proposal is to motivate research in this direction. 1.2 Proposed Research There has been a growing interest in grid compting in recent years (see section 2.1). Grids can be classified into two main categories: (1) Comptational Grids designed to meet the increasing demand of compte intensive science and (2) Data Grids designed to meet the needs of data intensive applications. In this proposal, the primary focs is on Data Grids. The objective of setting p a Data Grid is to encapslate the nderlying mechanisms of storage, qerying and transfer of data. Ths an ser of the grid does not need to bother abot the nderlying mechanism of data storage, athentication, athorization, resorce management and secrity bt can still have the benefits of large scale distribted compting. Several protocols, services and middleware architectres have been proposed for storage, integration and qerying of data on the grid. Of particlar interest is the Open Grid Service Architectre - Data Access and Integration (OGSA -DAI) [205] which was conceived by the UK Database Task force and works closely with Database Access and Integration Service - Working Grop (DAIS-WG) of the Global Grid Form (GGF) and the Globs team. Their aim is to develop a service based architectre for data access and integration. Several other projects sch as Knowledge Grid [40], Grid Miner [220], Discovery Net [264], TeraGrid [257], ADaM (Algorithm Development and Mining) [233] on NASA s Information Power Grid, and the DataCtter project [191] have focsed on the creation of middleware / systems for data mining and knowledge discovery on top of the Data Grid. Motivated by this research, we propose to develop service based architectres for distribted data mining on the grid infrastrctre. We have developed a system for distribted data mining on astronomy catalogs (see section 3.4) sing the resorces from the National Virtal Observatory. The system demonstrates how distribted data mining algorithms can be designed on top of the heterogeneos astronomy catalogs withot they being downloaded onto a centralized server. In particlar we examine a randomized algorithm for distribted principal component analysis and provide experimental reslts to show that this algorithm replicates reslts obtained in the centralized setting at a lower commnication cost. Encoraged by these reslts we also plan to develop a distribted otlier detection algorithm for astronomy catalogs. We also propose to develop a service based architectre for distribted stream min- 2

7 ing on the grid. Distribted data streams obtained from network intrsion detection applications, sensor networks, vehicle monitoring systems and web click streams are being ported onto the grid. Mining these inherently distribted heterogeneos streams on the grid reqires development of new architectres and algorithms since existing architectres sch as those described in section may not be sited for streaming data. Ths the overall focs of or attention is on developing a synergy between distribted data mining and the grid-based data mining. We propose to develop service based architectres for grid-based distribted data mining relying on application scenarios from astronomy. 1.3 Objectives The main objectives of the proposed research are as follows: 1. Develop a service-oriented architectre for enabling distribted data mining on the Grid. This incldes (a) Development of services for distribted schema integration, integration of indices and qery processing. (b) Development of distribted workflows for service composition. 2. Develop a prototype system for implementing the above architectre. The application area is astrophysics and the objectives of bilding the system are as follows: (a) Access and integrate federated astronomy databases sing the Open Grid Services Architectre - Data Access and Integration (OGSA-DAI) [205] middleware as a starting point. Extension of OGSA-DAI to incorporate schema integration, workflow composition are ftre objectives. (b) Perform distribted data mining on these repositories inclding dimension redction by Principal Component Analysis (PCA), classification and otlier detection. (c) Provide a client side browsing that enables astrophysicists to perform distribted data mining on the federated databases withot having to intricately manage resorce allocation, athorization and athentication and commnication of grid resorces. 3. Develop a service-oriented architectre for distribted data stream mining on the Grid. The remainder of this proposal is organized as follows. Chapter 2 offers an overview of the grid infrastrctre with emphasis on the Data Grid, existing architectres for data mining, integration of streams on the grid and identifying the challenges for distribted data mining on the grid. Chapter 3 presents or preliminary work on distribted classification sing Orthogonal Decision Trees (ODTs) and shows the applicability of ODTs 3

8 in streaming resorce-constrained devices. It also presents a feasibility stdy for distribted scientific data mining on astronomy catalogs. Chapter 4 otlines the directions for ftre research. 4

9 Chapter 2 Backgrond 2.1 The Grid Introdction The science of the 21st centry reqires large amonts of comptation power, storage capacity and high speed commnication [124, 99]. These reqirements are increasing at an exponential rate and scientists are demanding mch more than is available today. Several astronomy and physical science projects sch as CERN s 1 Large Hadron Collider (LHC) [170], Sloan Digital Sky Srvey (SDSS) [242], The Two Micron All Sky Srvey (2MASS)[3], bioinformatics projects inclding the Hman Genome Project [132], gene and protein archives [216, 251], meteorological and environmental srveys [197, 239] are already prodcing peta and tera bytes of data which reqires to be stored, analyzed, qeried and transferred to other sites. To work with collaborators at different geographical locations on peta scale data sets, researchers reqire commnication of the order of Gigabits / sec. Ths compting resorces are failing to keep p with the challenges they face. The concept of the "Grid" has been envisioned to provide a soltion to these increasing demands and offer a shared, distribted compting infrastrctre. In an early article [99] that motivates the need for Grid compting, Ian Foster describes the Grid "vision" "...to pt in place a new international scientific infrastrctre with tools that, together, can meet the challenging demands of 21st-centry science." Today, mch of this dream has become reality with nmeros research projects working on different aspects of grid compting inclding development of the core technologies, deployment and application of grid technology to different scientific domains 2. In the following sections, we briefly review the grid architectre and provide a classification of different types of grids. It mst be noted that the objective of this proposal is not to provide a detailed overview of grid compting and related isses, bt 1 Conseil Eropéen por la Recherche Ncléaire - Eropean Organization for Nclear Research 2 A list of applications in different scientific domains sing the grid technology can be fond at Applications 5

10 to introdce the concept of mining and knowledge discovery on a data grid (introdced later in section 2.2.1). Conseqently, a reader interested in grid compting shold refer to [101] for a detailed overview The Grid Architectre Figre 2.1: The hor glass model The sharing of distribted compting resorces inclding software, hardware, data, sensors, etc is an important aspect of grid compting. Sharing can be dynamic depending on the crrent need, may not be limited to client server architectres and the same resorces can be sed in different ways depending on the objective of sharing. These characteristics and reqirements for resorce sharing necessitate the formation of Virtal Organizations (VOs) [135]. Ths "VOs enable disparate grops of organizations and / or individals to share resorces in a controlled fashion, so that members may collaborate to achieve a shared goal." [135] An example of a virtal organization is the International Virtal Data Grid Laboratory (ivdgl) [140], an NSF fnded project that aims to share compting resorces for experiments in high energy physics [170], gravitational wave searches (LIGO) [173] and astronomy [242]. The architectre for grid compting, henceforth referred to as grid architectre is a protocol architectre that otlines how the sers of a virtal organization interact with 6

11 Figre 2.2: The Grid Protocol Architectre one another for resorce sharing. Proposed by Ian Foster and Carl Kesselman [135], the grid architectre follows the principles of an "horglass model". The Figre 2.1, obtained from [99] illstrates this architectre. The "narrow neck" of the horglass defines the core set of protocols and abstractions, the top contains the high level behaviors and the base contains the nderlying infrastrctre. The Grid protocol architectre comprises of several layers (illstrated in Figre 2.2) inclding the Fabric layer responsible for local, resorce specific operations, the Connectivity layer which performs the network connections, the Resorce layer containing protocols for sharing single resorces, the Collective layer for co-ordination among nderlying resorces and the Applications layer containing ser applications. These layers provide the basic protocols and services that are necessary for sharing of resorces among different grops in a virtal organization. Ths if the ser application is a data mining scenario, the fabric layer wold contain participating compters with data repositories, the connectivity layer comprises of the service discovery, athorization and athentication services and commnication, the resorce layer contains the access to comptation and data, the collective layer contains the resorce discovery, system monitoring and other application specific reqirements. A frther enhancement to the protocol based grid architectre, was the Open Grid Services Architectre (OGSA), proposed in [134]. OGSA introdces the concept of Grid Services which can be regarded as specialized web services that contain interfaces for discovery, dynamic service creation, lifetime management, etc and conform to the Web Services Description Langage (WSDL) specifications. Varios VO strctres can be configred sing the grid services interfaces for creation, registration and discovery. Ths, the se of grid services provides a way to virtalize components in a grid environment and ensres abstraction across different layers of the architectre. The implementation of the OGSA architectre can be fond in the crrent release of 7

12 the Globs Toolkit [133]. In this section we briefly reviewed the grid protocol architectre and the web services based Open Grid Services Architectre (OGSA). The following sbsection discsses methods to classify grids Classification of Grids Classification of grids may be done based on different criteria sch as the kind of services provided, the class of problems they address or the commnity of sers [127]. However, a common method of discrimination depends on whether they offer comptational power (Comptational Grids) or data storage (Data Grids). 1. Comptational Grid: The comptational grid has been defined as ".. a hardware and software infrastrctre that provides dependable, consistent, pervasive, and inexpensive access to high-end comptational capabilities." [100] They are designed to meet the increased need of comptational power by large scale pooling of resorces sch as compte cycles, idle CPU times between machines, software services etc. An example where a comptational grid cold be pt to se is a health maintenance organization of a metropolitan area, reqiring collaboration between medical personnel, patients, health insrance representatives, financial experts and administrative personnel. The resorces to be shared inclde high end compte servers, hndreds of workstations, patient databases, medical imaging archives and medical instrmentation (sch as Compted Tomography (CT) scan, Medical Resonance Imaging (MRI),ElectroCardioGram (ECG), Ultrasonography eqipment). The formation of a comptational grid enables compter aided diagnosis by tilizing information from different medical disciplines, facilitates cross domain medical research, searching of imaging archives, enhanced recommendation schemes for health insrance facilities, detection of frad on financial data (sch as hospital bills, insrance claims etc). 2. Data Grid: It is primarily geared towards management of data intensive applications [64, 17] and focses on the synthesis of knowledge discovered from geographically distribted data repositories, digital libraries and archives. An example of a data grid wold be a collaboration of astronomy sky srveys sch as Sloan Digital Sky Srvey [242], Two Micron All Sky Srvey [3] which are prodcing large volmes of astronomical data. The prpose of formation of a data grid wold be to enhance astronomy and astrophysical research making se of distribted data mining and knowledge discovery techniqes. In this proposal we are mainly concerned with data grids and focs on how efficient distribted algorithms can be designed on top of them. The next section offers an overview of the architectre of the data grid, some already existing data grids, and efforts made towards implementation of data mining and knowledge discovery services on the data grid

13 2.2 The Data Grid Introdction In many scientific domains sch as astronomy, high energy physics, climatology, comptational genomics, medicine and engineering large repositories of geographically distribted data are being generated ([265], [13], [14], [267], [192], [136], [228]). Researchers needing to access the data may come from different commnities and are often geographically distribted. The se of these repositories as commnity resorces has motivated the need for developing an infrastrctre capable of having storage and replica management facilities, efficient qery exection techniqes, data transfer schemes, caching and networking. The Data Grid [64] has emerged to provide an architectre for distribted management and storage of large scientific data sets. The objective of sch an architectre is 1. To provide a framework sch that the low level mechanisms of storage, transfer of data etc are well encapslated 2. Design isses that can lead to significant performance implications are allowed to be maniplated by an ser 3. To be compatible with a Grid infrastrctre and benefit from the grid s facilities of athentication, resorce management, secrity and niform information infrastrctre. Figre 2.3: The Data Grid Architectre The Figre illstrates the basic components of the Data Grid as envisioned by Chevranek et all. The core grid services are tilized to provide basic mechanisms 4 This figre has been adapted from [64] with slight modifications. 9

14 of secrity, resorce management etc. The high level components (sch as replica management, replica selection) are allowed to be bilt on top of these basic grid components. Their work assmes the data access and meta-data access as the fndamental services necessary for a data grid architectre. The data access handles isses related to accessing, managing, transfer of data to third parties etc. The metadata access services are explicitly concerned with the handling of information regarding the data sch as information related to how the data was created, how to se it and how file instances can be mapped to storage locations. Other basic grid services that can be incorporated into a data grid framework inclde an athorization and athentication infrastrctre (sch as Grid Secrity Infrastrctre), resorce allocation schemes and performance management isses. It mst be noted that any nmber of high level components can be designed by the ser, sing the afore mentioned basic grid services. Several projects sch as GriPhyN (Grid Physics Network [115]), Eropean Data Grid Project ([93]) have already implemented the data grid architectre. In order to harness the petascale data resorces obtained from for data intensive physics experiments (ATLAS and CMS [170], LIGO [173], SDSS [242]), the GriPhyN project conceptalized the idea of Petascale Virtal Data Grids (PVDGs) and Virtal Data. Petascale Virtal Data Grids[18, 17] are aimed at serving a diverse commnity of scientists and researchers, enabling them to retrieve, modify and perform experiments and analyses on the data. The idea of virtal data revolves arond creation of a virtal space of data prodcts derived from experimental data. The Eropean Data Grid Project is also motivated by the needs of High Energy Physics, Earth Observation and BioInformatics research commnities who need to store, access and process large volmes of the data. The architectre [275, 130, 75, 4] of the data grid proposed, is modlar in natre, and has similar characteristics to the GriPhyN project. The sbsystems of the architectre inclde (in order from bottom to top) Fabric Services, Grid Services, Collective Services, Grid Applications and Local Compting modles. Management of workload distribtion, resorce sharing and management, monitoring falt tolerance, providing an interface between grid services and nderlying storage are some of the fnctionalities handled by the Fabric Services. The Grid Services typically comprise of the SQL databases services, athentication and athorization, replica management and service indices 5. The grid service schedlers, replica managers form the blk of the Collective Services, while application specific services are handled in the Grid Applications layer. The Local Compting layer resides otside the Grid infrastrctre, and are typically the desktop machines from where end sers may access the data grid. In this section we discssed the motivation, featres and components of the data grid architectre and two projects GriPhyN and Eropean Data Grid Project that have implemented this architectre. At this point it will be interesting to discss some of the data distribtion scenarios that are commonly seen in the data grid. The next sbsection introdces this topic. 5 The services that allow a large nmber of decentralized grid components to collaboratively work in virtal data environments. 10

15 2.2.2 Data Distribtion Scenarios In a data grid, the repositories may contain data in different formats. We discss several different data distribtion schemes here Centralized Data Sorce: This is one of the simplest scenarios since the data can be thoght of as residing in a single relational database, a flat file, or as nstrctred data (XML). Grid / Web services needing to access this data sorce can do so by sing metadata to obtain the physical data locations and then make se of relevant qery langages. 2. Distribted Data Sorce: When the data is assmed to be distribted among different sites, two different scenarios can arise. (a) Horizontally Partitioned Data: The horizontal partitioning ensres that each site contains exactly the same set of attribtes. Note that we refer to the data as horizontally partitioned with respect to a virtal global table. An example of horizontally partitioned data cold be a departmental store sch as Walmart which has shops at different geographical locations. Each shop maintains information abot its cstomers sch as name, address, telephone nmber and prodcts prchased. Althogh the shops are geographically distribted, each database keeps track of exactly the same information abot its cstomers. (b) Vertically Partitioned Data: The vertical partitioning reqires that different attribtes are observed at different sites. The matching between the tples can be determined sing a niqe identification or key that is shared among all the sites. An example of vertically partitioned data can be different sky srveys sch as SDSS[242], 2-MASS[3], Deep[77], CfA[51] all observing different attribtes of the same objects seen in the sky. In either case, horizontal or vertical partitioning of data, grid services can provide a level of abstraction and encapslation so that the ser is not brdened with writing cstom code to access the data. De to the different distribtion schemes available, porting databases to the Grid reqires development of new protocols and services and middleware architectres. We discss some of the relevant work in this area in the following section Middleware, Protocols and Services As noted in section 2.2.1, the objective of developing a data grid is to encapslate the low level mechanisms of storage, integration and qerying of data stored at different geographical locations in tables, archives, libraries and other repositories. The datagrid commnity has to develop techniqes to handle heterogeneos data repositories in an integrated way. Ths, the infrastrctre shold spport standard protocols and services 6 It is assmed that the reader is familiar with basic steps involved in the development of a web service and we refrain from a detailed discssion of this context 11

16 and bild re-sable middleware components for storage, access and integration of data. These are areas of active research and we briefly discss them in this section. GridFTP: The Data Transfer Protocol [9, 269] was designed to provide secre and efficient data movement in the Grid environment. In many applications on the data grid [257, 205, 274, 86], GridFTP is sed as the protocol for data transfer between different components of the system. It is an extension of the standard FTP protocol. In addition to the standard featres of the FTP, the GridFTP also provides Grid Secrity Infrastrctre(GSI) and Kerberos spport, third party control of data, parallel, striped and partial data transfer. While a detailed discssion of the protocol specifications is otside the scope of the proposal, an interested reader shold refer to [268] for frther details. Several other projects have resorted to development of middleware for data access and integration. A relational database middleware approach ([270]) and serviceoriented approaches ([205, 69, 68]) have already been proposed. In the Spitfire project [246] within the Eropean Data Grid project [93] grid enabled middleware services have been sed to access the relational data tables. The client and the middleware service commnicate sing XML over GSI enabled secre HTTP. The middleware service and the relational databases commnicate sing JDBC / ODBC calls. While this approach is interesting, it is limited to the model of qeries and transactions [271] and reqires a lot of application dependent code to be written by programmers themselves. This creates a lack of portability among different databases and does not provide a metadata driven approach as advocated by the data grid architectre. In contrast to the relational database middleware approach, the service oriented approach focses on providing services for the generic database fnctionalities sch as qerying, transactions etc. This introdces a level of abstraction or encapslation, since the service descriptions contain definitions of what fnctionality is available to an ser, withot specifying how they are implemented in the nderlying system. Ths a virtal database service may provide the illsion that a single database is being accessed, whereas in fact the nderlying service can access several different types of data repositories. The Open Grid Services Architectre - Data Access and Integration (OGSA-DAI) project [205, 185, 161], conceived by the UK Database Task Force 7 is developing a common middleware soltion to be sed by different organizations, allowing niform access to data resorces sing a service based architectre. The project aims to expose different types of data resorces (inclding relational and nstrctred data) to grids, allow data integration, provide a way of qerying, pdating, transforming and delivering data via web services and provide metadata abot data resorces to be accessed. The architectre of the OGSA-DAI infrastrctre, illstrated in Figre 2.4 8, depends on three main types of services: 1. Data Access and Integration Service Grop Registry (DAISGR): The prpose of this service is to pblish and locate metadata regarding the data resorces 7 OGSA-DAI is working closely with the Database Access and Integration Services - Working Grop (DAIS-WG) of the Global Grid Form (GGF), the Open Middleware Infrastrctre Institte (OMII) and the Globs team 8 This Figre has been adapted from [161] 12

17 Figre 2.4: The Architectre of the OGSA-DAI Services and other services avaiable. Ths, clients can se DAISGR to qery metadata of registered services and select the service that best sites their reqirements. 2. Grid Data Service Factory (GDSF): It acts as an access point to data resorces and allows creation of Grid Data Services (GDS). 3. Grid Data Service (GDS): It acts as a transient access point for the data sorce. Clients can access data resorces sing the GDS. When the service container starts p, the DAISGR is invoked and gets instantiated. On creation, a GDSF may register as a service with the DAISGR and enables discovery of other services and data resorces sing relevant metadata. The Grid Data Services are invoked at the reqest of clients wanting to access a particlar resorce. It is interesting to note that several different types of data resorces are spported by the OGSA-DAI inclding Oracle, MySQL, DB2, SQLServer, PostgreSQL, Clodscape, IBM Content Manager and even Data Streams. The infrastrctre developed by the OGSA-DAI project is a poplar 9 data access and integration service for developing data grids and has been sed in several astronomy, bioinformatics, medical research, meteorology, geo-science applications 10. More recently, a service based Distribted Qery Processor (DQP) has been developed to work with OGSA-DAI. OGSA-DQP extends OGSA- DAI by incorporating two new services: (1) Grid Distribted Qery Service (GDQS): that compiles, optimizes, partitions and schedles distribted qery exection plans over mltiple exection nodes in the Grid. (2)Grid Qery Evalation Service (GQES): which is responsible for partitioning the qery exection plans assigned by the GDQS. 9 The Grid toolkits sch as Globs GT3.0 and Unicore ( do not have the facility of niform data access sing web services 10 A complete listing of projects sing the OGSA-DAI software is available at 13

18 It is interesting to note that none of the above mentioned projects still meet the need for schema integration. Traditionally, in the database commnity, data integration [168] has been defined as "..the problem of combining data residing at different sorces, and providing the ser with a nified view of this data." This means that the process of data integration involves modelling of the relation between individal database schemas and the global schema obtained by integrating them. Given that the grid cold contain horizontally or vertically partitioned data as mentioned in section 2.2.2, nstrctred data and data streams, the problem of data integration becomes a non-trivial one 11. A decentralized service based data integration architectre for Grid databases has been proposed in the Grid Data Integration System (GDIS) [70]. It ses the middleware architectre provided by OGSA-DQP, OGSA-DAI and the Globs Toolkit. It is based on Peer Database Management System (PDMS) [10], a P2P based decentralized data management architectre for spporting data integration isses in relational databases. The basic idea is that any peer in the PDMS can contribte data, schema information, or mappings between schema forming an arbitrary graph of interconnected schemas. The GDIS system offers a wrapper / mediator based approach to integrate the data sorces. The process of data storage, access and integration on the grid is still an area of active research. Existing protocols, services and middleware architectres described in this section, aim to solve related problems, bt it is still open for research. As the architectres are evolving, researchers have also focsed on data mining and knowledge discovery on the data grid infrastrctre. The next section reviews this topic in some detail Data Mining on the Grid Several research projects inclding the Knowledge Grid [40, 43, 46, 37], Grid Miner [220], Discovery Net [264, 1], TeraGrid [257], ADaM (Algorithm Development and Mining) [233] on NASA s Information Power Grid, and the DataCtter project [191] have focsed on the creation of middleware / systems for data mining and knowledge discovery on top of the data grid. We briefly review related work in this area. 1. The Knowledge Grid: Bilt on top of a grid environment, it ses basic grid services sch as athentication, resorce management, commnication and information sharing to extract sefl patterns, models and trends in large data repositories. It is organized into two layers Core K-grid layer and High level K-grid layer which is implemented on top of the core layer. The Core K- grid layer is responsible for management of metadata describing data sorces, third party data mining tools, algorithms and visalization. It comprises of two main services - Knowledge Discovery Services (KDS) and the Resorce Allocation and Exection Management (RAEM). The knowledge discovery services 11 An example wold be cross matching heterogeneosly distribted astronomy catalogs from sky srveys described in section

19 Figre 2.5: The Knowledge Grid Architectre extends the basic globs monitoring services and is responsible for managing metadata regarding which data repositories are to be mined, data maniplation and pre-processing and certain specific exection plans. The information ths managed is stored in three repositories - Knowledge Metadata Repository(KMR) which contains metadata regarding the data, software and tools, Knowledge Base Repository(KBR) that stores learned knowledge and Knowledge Exection Plan Repository (KEPR) which keeps track of the exection plans for a knowledge discovery process. The interested reader is referred to [71, 252, 45, 41, 42, 39] for frther details regarding each of these repositories. The RAEM finds mapping between exection plans and resorces available on the grid. The High level grid layer, bilt on top of the core grid mainly incldes services sed to compose, execte and validate the specific distribted data mining operations. The main services provided by it inclde data access services, tools and algorithms access services, exection plan management services and reslts representation service. The Figre 2.5 adapted from [43] illstrates the basic components. The architectre has been implemented in a toolset named VEGA (Visal Environment for Grid Applications) [73]. It is responsible for task composition, consistency checking and generation of the exection plan. A visal interface for the work flow management is an attractive featre. However, it appears that a more fndamental problem is to be able to design an algorithm or the steps reqired to perform distribted work flow management. As of now a distribted work flow management scheme does not exist, and this appears to a very important need for the Grid commnity. It wold be interesting to see, if the work flow manager designed by the athors can be extended to a distribted work flow management scheme. While the athors take particlar care in the description of an elaborate architectre, the distribted data mining algorithms implemented 15

20 on this architectre seems to be very short and inadeqate for the crrent context. They perform experiments on network intrsion data the size of which was reported to be abot 712 MB. This is a considerably small dataset considering that they are trying to make a case for a grid mining scenario. The data mining tasks described here also appear to be pretty straightforward and does not inclde a real distribted algorithm reqiring extensive commnication or synchronization 12. It wold be really sefl to see how the crrent system scales with real distribted algorithms (sch as clstering algorithms reqiring mltiple ronds of commnication, distribted association rle mining algorithms) and larger datasets stored at different geographical locations. Figre 2.6: The GridMiner Components 2. The Grid Miner: This project aims to integrate grid services, data mining and On-Line Analytical Processing (OLAP) technologies. The GridMiner-Core framework [127, 220] is bilt on top of the Open Grid Services Infrastrctre (OGSI) and ses the services provided by it. On top of this infrastrctre the following components are bilt: (1)GridMiner Information Service (GMIS): It is responsible for collecting and aggregating service data from all available grid services and has qery interfaces for resorce discovery and monitoring.(2)grid Miner Logging and Bookkeeping Service (GMLB): It collects schedling information, resorce reservations and allocations, logging and error handling. (3)Grid Miner Resorce Broker (GMRB): It is responsible for workload management and grid resorce management. (4) Grid Data Mediation System (GDMS): This [219] is responsible for accessing and maniplation of the data repositories by providing an API that is capable of abstracting the process of data access from repositories to higher level knowledge discovery services. This system comprises of several components inclding: (a) GridMiner Service Factory(GMSF): provides a service creation facility (b) GridMiner Service Registry 12 A recent work [72] provides a meta-learning example, bt a completely implemented system is still nder development 16

21 (GMSR): provides a directory facility for the OGSA-DAI services (c)gridminer Data Mining Service (GMDMS): provides the data mining algorithms and tools (d) GridMiner PreProcessing Service (GMPPS): encapslates the fnctionality that is needed for pre-processing the data (e)gridminer Presentation Service (GMPRS): provides facilities for visalization of models. (5) Replica Management: This comprises of a GridMiner Replica Manager(GMRM) and a Grid- Miner Replica Catalog (GMRC). (5) GridMiner Orchestration Service (GMOrchS): The orchestration service is an optional component that is capable of aggregating a seqence of data mining operations into a job. It acts as a workflow engine that exectes the steps involved in the complete data mining task (either seqentially or in parallel). It provides an easy mechanism for handling long rnning jobs. The Figre 2.6, adapted from [127] illstrates the components of the GridMiner architectre. A prototype application, Distribted Grid-enabled Indction of Decision Trees (DIGIDT) [128, 127] has been developed to rn on the GridMiner. It is based on concepts introdced in SPRINT [244] bt has a modified data partitioning scheme and workload management strategy. The interested reader is referred to [127, 220] for details of implementation of the algorithm and performance reslts presented herein. It mst be noted that this algorithm closely resembles a trly distribted scenario as described in section2.3 bt appears to be capable of handling homogeneosly partitioned data only. 3. Discovery Net: The Discovery Net (DNET) [1, 264] project aims to bild a platform for scientific discovery from data collected by high throghpt devices. The infrastrctre is being sed by scientists from three different application domains inclding Life Sciences, Environmental Monitoring and Geo-hazard Modelling. The DNET architectre develops High Throghpt Sensing(HTS) applications by sing the Kensington Discovery Platform on top of the Globs services. The knowledge discovery process is based on a workflow model. Services provided are treated as black boxes with known inpt and otpt and are then strng together into a seqence of operations. The architectre allows sers to constrct their own workflows and integrate data and analysis services. A niqe featre is the Discovery Process Markp Langage (DPML) [143] which is an xml based representation of the workflows. Processes created by DPML are re-sable and can be shared as a new service on the Grid by other scientists. Abstraction of workflows and encapslating them as new services can be achieved by the Discovery Net Deployment Tool. A workflow warehose [143] acts as a repository of ser workflows and allows the capability of qerying and meta-analysis of workflows. Another interesting featre is the InfoGrid [200] infrastrctre, which allows dynamic access and integration of varios data sets in the workflows. Interfaces to SQl databases, OGSA-DAI sorces, Oracle databases and cstom designed wrappers are bilt to enable data integration. The interested reader is referred to [1, 264, 200, 143] for detailed descriptions of the architectre, components and applications of DNET. 4. TeraGrid: The TeraGrid project aims to provide a "CyberInfrastrctre" [258, 17

22 122] by making se of resorces available at for main sites - The San Diego Spercompter Center (SDSC), Argnonne National Laboratory (ANL), Caltech and National Center for SperCompting Applications (NCSA). The architectre of the TeraGrid project makes se of existing Grid software technologies and bilds a "virtal" system that is comprised of independent resorces at different sites. It consists of two different layers - the basic software components (Grid services) and the application services (TeraGrid Application Services) implemented sing these components. The objective is to bild a knowledge grid [26] throghot the science and engineering commnity. For example the Biomedical Informatics Research Grop has been developed to allow researchers at geographically different locations to share and access brain image data and extract sefl patterns and models from them. This enables TeraGrid to act as a knowledge grid in the biomedical informatics domain. A knowledge grid, ths conceived, is "the convergence of a comprehensive comptational infrastrctre along with scientific data collections and applications for rotinely spporting the synthesis of knowledge from that data" [26]. In September 2004, the deployment of TeraGrid was completed enabling access to 40 teraflops of compting power, 2 petabytes of compting storage, specialized data analysis and visalization schemes and high speed network access. 5. Algorithm Development and Mining (ADaM): The ADaM toolkit [233, 238], conceptalized on NASA s Information Power Grid, has been developed by the Information Technology and Systems Center (ITSC) at the University of Alabama in Hntsville. It consists of over 100 data mining and image processing components and is primarily designed for scientific and remote sensing data. The ADaM toolkit has been grid enabled by making se of Globs and Condor- G frameworks. Several projects inclding Modeling Environment for Atmospheric Discovery (MEAD)[229], Linked Environments for Atmospheric Discovery (LEAD) [239] have made se of the grid enabled toolkit for data mining operations inclding classification, clstering, association rle mining, optimization, image processing and segmentation, shape detection and filtering schemes. In MEAD the goal is to develop a cyber infrastrctre for storm and hrricane research allowing sers to configre, model and mine simlated data, retrospective analysis of meteorological phenomenon, and visalization of large models. 6. Data Ctter: The Data Ctter project enables processing of scientific datasets stored in archival storage systems across a wide-area network. It has a core set of services on top of which application developers can bild services on a need basis. The main design objective is to enable range qeries and cstom defined aggregations and transformations on distribted sbsets of data. The system is modlar in natre and contains client components, that interact with clients and obtain mlti-dimensional range qeries from them. The data access services enables low level I/O spport and provides access to archival storage systems. The indexing modle allows hierarchical mltidimensional indexing on datasets inclding R-trees and their variants and other sophisticated spatial indexing schemes. The prpose of the filtering modle is to provide an effective way of sbsetting and data aggregation. The Data Ctter project however, does 18

23 not provide spport for extensive distribted data mining facilities. It also does not spport distribted stream based applications. Grid Enabled WEKA Research has also been done to Grid enable WEKA [273], a poplar java based machine learning toolkit. Some of the projects working with this objective inclde Weka4WS [274, 86], GridWeka [231, 114], Federated Analysis Environment for Heterogeneos Intelligent Mining (FAEHIM)[94] and WekaG[184]. We briefly smmarize the contribtion of each of these projects. 1. Weka4WS: The goal of this project is to spport exection of data mining algorithms on remote Grid nodes by exposing the Weka Library as web services sing the Web Services Resorce Framework (WSRF) 13. The architectre of Weka4WS comprises of three kinds of nodes - storage nodes which contain the datasets to be mined, compte nodes on which the data mining algorithms are rn and the ser nodes which are the local machines of sers. Local data mining tasks at a grid node are compted sing the Weka library resident at that particlar node, while remote comptations are roted throgh the ser nodes. The compte nodes contain web services compliant with WSRF and are therefore capable of exposing the data mining algorithms implemented in the Weka library as a service. GridFTP servers are exected on each storage node to allow data transfer. While the architectre for grid enabling Weka sed here is interesting, it appears to have some disadvantages. First, the athors have limited themselves to se of weka data mining algorithms only. Ths they are nable to se a trly distribted data mining algorithm sch as those described in section 2.3 and the framework appears to be rnning centralized data mining algorithms at different grid nodes. It is also nclear how these algorithms adapt in case of a heterogeneos data partitioning scheme as described in section Second, the se of GridFTP server in storage nodes, is also restrictive since it does not allow complete flexibility of the type of data resorces sed GridWeka: This is an ongoing work at the University of Dblin, which aims to distribte data mining weka algorithms (in particlar weka classifiers) over compters in an ad-hoc Grid. A client-server architectre is proposed sch that all machines that are part of the Weka Grid have to implement the Weka Server. The client is responsible for accepting a learning task and inpt data, distribting the task of learning, load balancing, monitoring falt tolerance of the system and crash recovery mechanisms. Tasks that can be done sing the Grid Weka inclde bilding a classifier on a remote machine, labelling a dataset sing a previosly bilt classifier, cross validation and testing. Needless to say, the client server architectre is not ideally sited for the Grid and there is no service oriented schemes in place yet for making se of the basic grid featres sch as secrity, resorce management etc. 3. FAEHIM: The Federated Analysis Environment for Heterogeneos Intelligent Mining (FAEHIM) project[8] aims to provide a data mining toolkit sing web For example, it is nclear how nstrctred data is handled by gridftp scheme 19

24 services and the Triana problem[255, 260] solving environment. The primary data mining activities spported inclde classification, clstering, association rle mining and visalization modles. The basic fnctionality is derieved from the Weka library and converted into a set of web services. Ths the toolkit consists of a set of data mining services, tools to interact with the services and a workflow management system to assemble the services and tools. This project appears to be very similar to the Weka4WS project mentioned above. 4. WekaG: The WekaG toolkit aims at adapting the Weka Toolkit for the Grid sing a client server architectre. The server side implements the data mining algorithms while the WekaG client is responsible for creation of instances of grid services and acts as an interface to sers. The athors describe WekaG as an implementation of a more general architectre called Data Mining Grid Architectre (DMGA) which is geared towards copling data sorces and provides facilities for athorization, data discovery based on meta data, planning and schedling resorces. A prototype for this toolkit has been developed sing the Apriori algorithm integrated into the Globs Toolkit 3. Ftre work for the project is aimed at development of other data mining algorithms for the Grid and compatibility with the WSRF technology. Next generation grid based systems are moving towards P2P and Semantic Grids [218, 254, 253, 138, 44]. However, we will leave this area virtally ntoched in the proposal considering that work in this area is still in the nascent stages. Another interesting direction of research is Privacy Preserving Data Mining on Grids. While there is a need to solve problems in this arena, little [21] work has been done. The next section explores architectres and applications of data streams on the grid. Grid Data Streams Application areas like e-science ( AstroGrid [15], GridPP [113]), e-health ( Telemedicine [256], Mobile Medical Monitoring [194]), e-bsiness ( INWA [139]) prodce distribted streaming data. This data needs to be analyzed and one way to ensre that researchers have easy access to streams is to port them to the grid [230]. In this way, individals do not need to set p mobile devices, expensive eqipment sch as telescopes and satellites bt can access interesting data streams pblished on the Grid. In the Eqator Medical Devices Project [50] for example, the athors have adapted Globs Grid Toolkit (GT3) to spport remote medical monitoring applications on the Grid. Two different medical devices - the monitoring jacket and the blood glcose monitor are made available on the grid as services. Data miners can access the data for the prpose of knowledge discovery and pattern recognition withot having to go throgh the troble of setting p an environment for collecting the data or even owning the eqipments themselves. Several other advantages of porting data streams to the grid inclde sharing of data on-the-fly, easy storage of large streams and redction in network traffic. In recent years, different architectres have been proposed for porting data streams to the grid [259, 221, 223, 222, 25, 61, 176]. We briefly review related architectres 20

and applications, making note of the fact that none of these architectres incorporate distribted data mining facilities on grid data streams. Figre 2.

25 and applications, making note of the fact that none of these architectres incorporate distribted data mining facilities on grid data streams. Figre 2.7: The Virtal Stream Store Plale et al ([222, 221] propose a model for bringing data streams to the grid, based on the ability of stream systems to act as a data resorce. They arge, it is possible to treat each stream sorce as a grid service bt the approach may not scale to the entire range of data stream generation devices from large hadron colliders in physics experiments to tiny motes in sensor networks. Ths an architectre for porting streams to the grid mst cater to the needs of different types of data stream generation devices. The model proposed is based on three main assmptions: (1) data streams can be aggregated (2) They can be accessed throgh database operations and qery langages (3) It is possible to access streams as grid services. The main motivation for proposing sch a model comes from the fact that data streams can be viewed as indefinite seqences of time-seqenced events and can be treated as a data resorce like a database. This enables qerying of global snapshots of streams and development of a virtal stream store as the architectral basis of stream resorces. A virtal stream store is defined as follows: "...collection of distribted, domain-related data streams and set of comptational resorces capable of execting qeries on the streams. The virtal stream store is accessed throgh a grid service. The grid service provides qery access to the data streams for clients." [221] The concept of the virtal stream store is illstrated in Figre 2.7, adapted from [221]. It comprises of nine data streams and comptational resorces. The comptational resorces s are located very close to the streams (indicated by S in the Figre) bt they cold also act as stream generators. In general, it is not necessary for the generators to be a part of the virtal stream store. The model can act like a database 21

26 system for data stream stores having access to modified SQL type qery langages. This architectre has been integrated into the dquob [223] and Calder systems [201, 277, 202]. While it is a first step towards integration of data streams to grid, it has some drawbacks. It appears that this architectre does not consider heterogeneity of stream data and it is nclear how continos qeries can deal with heterogeneos streaming data 15. The Grid-based AdapTive Exection on Streams (GATES) project [61] aims to design and develop middleware for processing distribted data streams. The system is bilt sing the Open Grid Sorces Architectre (OGSA) and GT3. It offers a high-level interface that allows the sers to specify the algorithm(s) and the steps involved in processing data streams withot being concerned with resorce discovery mechanisms, schedling or allocating grid-based services. Hence the system is " self-resorcediscovering". The athors also refer to the system as "self-adapting" since a high degree of accracy is obtained in analyzing data streams by tweaking certain parameters sch as sampling rates, smmary strctres or algorithms. The goal of the self adaptation algorithm is to provide real time stream processing while keeping the analysis as precise as possible. This is obtained by maintaining a qeing network model of the system. It appears to be very close to the dquob system [223] althogh their scheme has capabilities for resorce discovery and adaptation in distribted environments. In StreamGlobe [230], the athors propose the processing and sharing of data streams on Grid-based P2P infrastrctres. The motivation is derieved from an astrophysical e-science application. The key featres of the system inclde: (1) pblishing data and retrieving information by interactively registering peers, data streams, and sbscriptions (2) sharing of existing data streams in the network, thereby providing optimization and roting facilities (3) network traffic management capabilities by preventing overloading. This is a relatively new project and ftre work in the area is geared towards providing spport for sbscriptions with mltiple inpt data streams and joins. Benford et al [25] describe their experiences in monitoring life processes in frozen lakes in Antarctic. They deploy remote monitoring devices on lakes of interest which send data to base stations over satellite phone networks. Integration of the sensing devices into a grid infrastrctre as services, enables archiving of sensor measrements. The complete system consists of several components inclding (1) the Antarctic sensing device deployed on icy srface (2) A satellite telephony network to a base compter where the raw data is pre-processed (3) An OGSA compliant web service that makes the sensing device and its data available on the Grid (4) The data archived in a Grid accessible data repository (5) The data analysis and visalizing components of interest to the Antarctic scientist. While hrdles like erroneos remote sensor readings, software and hardware failre de to extreme weather conditions are still being sorted ot, this system provides proof of concept of data analysis on streams ported to the Grid. This section emphasizes the idea that mch interest has gone into developing architectres for spporting streams on the Grid. While the architectres themselves are in a nascent stage, even less research has been done to develop data mining algorithms on grid data streams. However, the need for knowledge discovery mechanisms is in- 15 More recent work of the grop incldes stdying the feasibility of continos qery grid services [202]. 22

27 evitable. Keeping this in mind, we explore the relatively new area of distribted data mining in the next section. 2.3 Distribted Data Mining Introdction A primary motivation for Distribted Data Mining (DDM) discssed in literatre and in this proposal, is that a lot of data is inherently distribted. Merging of remote data at a central site to perform data mining will reslt in nnecessary commnication overhead and algorithmic complexities. As pointed ot in [226], "Bilding a monolithic database, in order to perform non-distribted data mining, may be infeasible or simply impossible" (pg 4). For example, consider the NASA Earth Observing System Data and Information System (EOSDIS) which manages data from earth science research satellites and field measrement programs. It provides data archiving, distribtion, and information management services and holds more than 1450 datasets that are stored and managed at many sites throghot the United States. It manages extraordinary rates and volmes of scientific data. For example, Terra spacecraft prodces 194 gigabytes (GB) per day; data downlink is at 150 Megabits/sec and the average amont of data collected per orbit is Megabits/sec 16. A centralized data mining system may not be adeqate in sch a dynamic, distribted environment. Indeed, the resorces reqired to transfer and merge the data on a centralized site may become implasible at sch a rapid rate of data arrival. Data mining techniqes that minimize commnication between sites are qite valable. Simply pt, DDM is data mining where the data and comptation are spread over many independent sites. For some applications, the distribted setting is more natral than the centralized one becase the data is inherently distribted. Typically, in a DDM environment, each site has its own data sorce and data mining algorithms operate on it prodcing local models. Each local model represents knowledge learned from the local data sorce, bt cold lack globally meaningfl knowledge. Ths the sites need to commnicate by message passing over a network, in order to keep track of the global information. For example, a DDM environment cold have sites representing independent organizations whose operation and data collection have nothing to do with each other and who commnicate over the Internet. Typically commnication is a bottleneck. Since commnication is assmed to be carried ot exclsively by message passing, a primary goal of many DDM methods in the literatre is to minimize the nmber of messages sent. Some methods also attempt to load-balance across sites to prevent performance from being dominated by the time and space sage of any individal site. In the following sections we briefly review DDM algorithms for classification and clstering. In a sbseqent section we also give an overview of stream data mining which has been receiving increasingly more attention in the last ten years. Since 16 This information has been obtained from 23

28 the focs of this proposal is DDM on the grid infrastrctre with emphasis on clstering, classification and data streams, we will leave many areas of DDM virtally ntoched. For different perspectives on this exciting field, the reader is referred to [154], [156],[214], [278], [279] Classification Distribted Classification is closely related to ensemble based classifier learning [213]. Ensemble-based classifiers work by generating a collection of base models and combining their otpts sing some pre-specified schemes. Typically, voting (weighted or nweighted) schemes are employed to combine the otpts of the base classifiers. A large volme of research reports that ensemble classifier models often perform better than any of the base classifiers sed [79, 207, 23, 190, 166]. Two poplar ensemble models of this kind are Bagging [34] and Boosting [103, 104, 240]. Both Bagging and Boosting bild mltiple classifiers from different sbsets of the original training data set. However, they take sbstantially different approaches in sampling sbsets and combining classifications. In Bagging, each training set is constrcted by taking a bootstrap replicate of the original training set. This means that given a training set T that contains m tples, the new training set is constrcted by niformly sampling (with replacement) from T. Classifiers are bilt on the new training set and the reslts aggregated by majority voting. In Boosting, a set of weights are maintained over the original training set T and they are adjsted after classifiers are learned sing a learning algorithm. The adjstment procedre increases the weight of tples that are mis-classified by the base learning algorithm and decrease the weight of those that are correctly classified. There are two different ways by which the weights can be sed to form the new training set - boosting with sampling (tples are drawn with replacement from the training set with probability proportional to the weight) and boosting by weighting (the learning algorithm can take a weighted training set directly). Other ensemble based approaches inclde Stacking[276, 95], Random Forest [35] and a recent work Rotation Forest [232]. Stacking [276, 95] learns from pre-partitioned training sets -s and validation sets -s. Given a set of learning algorithms = { }, it bilds a classification model in two stages. Dring the first stage, a set of classifiers is learned from. In the second stage, classifications for each!" $#&%' ( are tested with a set of classifiers learned from the previos stage, which reslts in, ( ) #* #+, ) #+! ). By applying the process repeatedly for all -s, a new training set is formed, which is then sed to create the so-called meta-classifier. Classification of an nseen data instance is also made in two stages. A meta-level testing instance ( )#+ #* )# ) is first formed, which is then passed to the meta-classifier for the final classification. In Random Forests, a forest of classification trees is grown as follows: (1) If the training set has T tples, sample T cases at random (with replacement), from the original data. (2) If there are M attribtes, a nmber m - M is specified sch that at each node, m variables are selected at random ot of the M and the best split on these m is sed to split the node. The vale of m is held constant dring the forest growing. (3) Each tree is grown to the largest extent possible. There is no prning. The error rate of the Random Forest depends on the correlation between any two trees in the forest 24

29 and the strength of each individal tree in the forest. While Random Forests have been shown to have good accracy of classification, the major disadvantage is that the size (total nmber of nodes in trees of the forest) of the model bilt can be very large. A more recent work, Rotation Forest [232], bilds classifier ensembles based on featre selection. In order to create the training data set for the base classifier, the attribte space is randomly split into K (a parameter of the algorithm) sbsets and Principle Component Analysis (PCA) is applied to the sbsets, retaining all the principle components. There are K axis rotations creating a new featre space for the base classifiers. The objective of this method is to increase the individal accracy of classifiers and the diversity within the ensemble. Experimental reslts reported in the work claim that Rotation Forest can provide a better accracy than Random Forest, Bagging and Boosting. All of the above ensemble learning schemes can be directly adapted for distribted classification. The individal sites can prodce the base models from local data, and ensemble based aggregation schemes [164, 171, 172, 217] can be sed for prodcing the final reslt. Figre 2.8: The Meta-Learning Framework Homogeneosly Distribted Classifiers: A slightly different ensemble learning techniqe for homogeneosly distribted data is the meta-learning framework [55, 53, 54, 58, 59, 56, 57]. It follows three steps: (1) Bild the base classifiers sing learning algorithms on local data at a site. (2) Base classifiers are collected and combined at 25

30 a central site. Prodce the meta-level data by sing a validation set and predictions of base classifiers on this set. (3) Bild the meta-level classifier from the meta-level data. The Figre 2.8 illstrates the process. Different strategies for combining mltiple predictions from classifiers in the meta-learning phase can be adopted. These inclde: 1. Voting: Each classifier is assigned a single vote and majority wins. A variation of this process is Weighted Voting, where some classifiers are assigned preferential treatment depending on the performance on some common validation set. 2. Arbitration: The arbiter acts like a "jdge" and its predictions are selected if the participating classifiers themselves cannot reach a consenss. Ths the arbiter itself is a classifier which chooses a final otcome based on the predictions of other classifiers. 3. Combiner: The combiner makes se of the knowledge abot how classifiers behave with respect to one another and thereby enables meta-learning. There are several ways by which the combiner can be learnt. One way is to se the base classifiers and their otpts. Yet another option is to learn the combiner from data comprised of training examples, correct classifications and base classifier otpts. The meta-learning framework is implemented in a system called Java Agents for Metalearning (JAM) [247, 248]. In general, meta-learning helps improve performance by execting in parallel and ensres better predictive capabilities by combining learners with different indctive bias sch as search space, representation schemes and search heristics. Other meta-learning based distribted classification schemes inclde [118] (Distribted Learning and Knowledge Probing) and [111]. Heterogeneosly Distribted Classifiers: The problem of learning over heterogeneosly partitioned data is inherently difficlt since different sites observe different attribtes from the original data set. Traditional ensemble based approaches generate high variance local models and are not adept at identification of correlated featres distribted over different sites. Hence the problem of learning over heterogeneosly partitioned data is challenging. Park and his colleages [210] have addressed the problem of learning from heterogeneos sites sing an evoltionary approach. Their work first identifies a portion of the data that none of the local classifiers can learn with a great deal of confidence. This sbset of the data is merged at the central site and another new classifier is bilt from it. When a new instance cannot be learned with high confidence from a combination of local classifiers, the central classifier is sed. The approach prodces better reslts than simple aggregation of local models. However, the algorithm is sensitive to the confidence threshold selected. An algorithm to constrct decision trees over heterogeneosly partitioned data has been proposed by Giannella et all [107]. The algorithm is designed sing random projection based dot prodct estimation and a message sharing strategy. Their work assmes that each site has the same nmber of tples which are ordered to facilitate matching i.e.. 0/ tple on each site matches and also have the same class labels. The 26

31 aim is to constrct a decision tree sing attribtes from all the sites. The problem boils down to estimating the information gain offered by attribtes in making splitting decisions. To redce commnication, the information gain estimation is approximated sing a random projection based approach. It mst be noted that the decision tree obtained from the distribted framework may not be identical to the one obtained if all the data were centralized. However, increasing the nmber of messages exchanged can make the distribted tree arbitrarily close to the centralized tree. Also, the distribted algorithm reqires more local comptation than the centralized algorithm. Ths the overall benefit of the algorithm is based on a trade-off: increased local comptation or redced commnication. The work does not take into accont actal commnication delays in the network in the distribted setting and ths cold benefit from a more detailed timing stdy of the centralized and distribted settings. The above problem of constrcting decision trees over heterogeneosly partitioned data has also been addressed in work by Caragea et all [48, 47, 49]. However, their work focses on prodcing an exact distribted learning algorithm, compared with decision trees constrcted on centralized data. Ths it is fndamentally different from the inexact random projection based approach described in [107]. An order statistics based approach to combining classifiers generated from heterogeneos sites has been presented in [261]. Their techniqe works by ordering predictions of the classifiers and provides mechanisms for selecting an appropriate order statistic and making a linear combination of them. The work provides an analytical framework for qantifying the redction in error obtained when an order statistics based ensemble is sed. Their experimental reslts sggest that when there is significant variability among the classifiers, order statistics based approaches perform better than ordinary combiners. Collective Data Mining: The framework for Collective Data Mining (CDM) has been proposed by Kargpta and his colleages [150]. It has its roots in the theory of commnications, machine learning, statistics and distribted databases. The main objective of CDM is to ensre that partial models prodced from local data at sites are correct and can be sed as bilding blocks for forming the global model. The steps involve: (1) Chose an appropriate orthonormal representation for the type of data model to be bilt. (2) Constrct approximate orthonormal basis coefficients at each local site. (3) If reqired, transfer a sample of the datasets from each local site to a coordinator site and generate approximate basis coefficients corresponding to the non-linear cross terms. (4) Combine the local models and otpt the final model sing a ser specified representation format (sch as decision trees). A major contribtion, is the se of the CDM approach to constrct decision trees from data throgh the Forier analysis. It has been pointed ot that there are several different techniqes to do this. One possibility is to se the Non-niform Forier Transformation (NFT) [36]. Comptation of the Forier coefficients reqire all the members of the domain to be available. However, in most learning frameworks, a training set is sed to learn the model and it is tested on a validation (test) set. In sch a framework, estimation of the Forier coefficients themselves is not easy. The Non-niform Forier Transformation provides a potential soltion. If it is assmed that the class label is zero for all members of the domain that are not in the learning set, then the Forier spec- 27

32 trm exactly represents the data. This is called NFT and it can be shown that one can exactly constrct the decision tree from the NFT of the data. However, this approach has potential drawbacks. It does not garantee polynomial description or exponentially decaying magnitde of the coefficients. Also commnication of the NFT of the data may reqire a sbstantial overhead. Another approach is to estimate the Forier spectrm of the tree directly from the data, instead of the NFT of the data. This method has several advantages inclding: (1) Decision trees with bonded depth are generally sefl for data mining. (2) The Forier representation of bonded depth (say d) decision trees has polynomial nmber of non-zero coefficients. Coefficients corresponding to partitions involving more than d featres are zero. (3) If the nmber of defining featres determine the order of a partition, then the magnitde of the Forier coefficients decay exponentially with the order of the corresponding partition. The existence of these properties garantees that if there is a straightforward way to estimate the coefficients themselves, then decision trees can be bilt over distribted data sing very little commnication. An iterative approach to modelling the error obtained in approximating the Forier coefficients obtained from non-local sites has been proposed. A more detailed analysis of this work is presented in [211]. The Forier representation of decision trees and procedre to reconstrct trees from the Forier Spectrm has been stdied in mch detail [209, 211, 212, 149, 152, 151]. Removing Redndancies from Ensembles: Existing ensemble-learning techniqes work by combining (sally a linear combination) the otpt of the base classifiers. They do not strctrally combine the classifiers themselves. As a reslt they often share a lot of redndancies. The Forier representation of decision trees, referred to in the discssion above, offers a niqe way to fndamentally aggregate the trees and perform frther analysis to constrct an efficient representation. The work on Orthogonal Decision Trees [153, 89, 119] focses on this isse. Consider a matrix 1 where : <;$#, where 79 4: <;)# is the otpt of the tree 79 4: for inpt ;=%?>. 1 is >A@BDC matrix >A@ is the size of the inpt domain and C is the total nmber of trees in the ensemble. An ensemble classifier that combines the otpts of the base classifiers can be viewed as a fnction defined over the set of all rows in 1. If 1FE*34 denotes the G - th colmn matrix of 1 then the ensemble classifier can be viewed as a fnction of 1 E*3 1 E*3 HHH1 E+3 I. When the ensemble classifier is a linear combination of the otpts of the base classifiers we have JK5ML 1 E*3 N L 1 E*3 ON HHHL I 1 E*3 I, where J is the colmn matrix of the overall ensemble-otpt. Since the base classifiers may have redndancy, it is possible to constrct a compact low-dimensional representation of the matrix 1. However, explicit constrction and maniplation of the matrix 1 is difficlt, since most practical applications deal with a very large domain. We can try to constrct an approximation of 1 sing only the available training data. One sch approximation of 1 and its Principal Component Analysis-based projection is reported elsewhere [190]. Their techniqe performs PCA of the matrix 1, projects the data in the representation defined by the eigenvectors of the covariance matrix of 1, and then performs linear regression for compting the coefficients LPL=HHH< and L=I. While the approach is interesting, it has a serios limitation. First of all, the con- 28

33 strction of an approximation of 1 even for the training data is comptationally prohibiting for most large scale data mining applications. Moreover, this is an approximation since the matrix is compted only over the observed data set of the entire domain. In recent work [153, 89, 119], a novel way to perform a PCA of the matrix containing the Forier spectra of trees has been reported. The approach works withot explicitly generating the matrix 1. It is important to note that the PCA-based regression scheme [190] offers a way to find the weightage for the members of the ensemble. It does not offer any way to aggregate the tree strctres and constrct a new representation of the ensemble which the crrent approach does. Now consider a matrix Q where Q 3 4 5SR69 4:T3 9 :, i.e. the coefficient corresponding to the. -th member of the partition set U from the spectrm of the tree 7V9 4:. It can be shown that the covariance matrices of 1 and Q are identical [119]. Note that Q is UW@XBYC dimensional matrix. For most practical UZ@\[A[M@ >A@. Therefore analyzing Q sing techniqes like PCA is significantly easier. PCA of the covariance matrix of W prodces a set of eigenvectors HHH, (. The eigenvale decomposition constrcts a new representation of the nderlying domain. Since the eigenvectors are nothing bt a linear combination of the original colmn vectors of W, each of them also form a Forier spectrm and we can reconstrct a decision tree from this spectrm. Moreover, since they are orthogonal to each other, the tree constrcted from them also maintain the orthogonality condition. The analysis presented above, offers a way to constrct the Forier spectra of a set of fnctions that are orthogonal to each other and therefore redndancy-free. These fnctions also define a basis and can be sed to represent any given decision tree in the ensemble in the form of a linear combination. Orthogonal decision trees can be defined as an immediate extension of this framework. We present the theoretical definitions and experimental reslts on the performance of Orthogonal Decision Trees (ODTs) in section 3.2. In this section we discssed several algorithms and techniqes for distribted classification. The following section introdces distribted clstering Clstering Distribted clstering algorithms can be broadly divided into two categories: (1) methods reqiring mltiple ronds of message passing and (2) centralized ensemble methods [38, 141]. Algorithms that fall into the first category reqire significant amont synchronization. The second consists of methods that bild local clstering models and transmit them to a central site (asynchronosly). The central site forms a combined global model. These methods reqire only a single rond of message passing, hence, modest synchronization reqirements. In the next two sbsections we discss these isses in some details. Mltiple Commnication Rond Algorithms Kargpta et al. [157] develop a principle components analysis (PCA) based clstering techniqe on the CDM framework for heterogeneosly distribted data. Each local site perform PCA, projects the local data along the principle components, and applies a known clstering algorithm. Having obtained these local clsters, each site 29

34 sends a small set of representative data points to a central site. This site carries ot PCA on this collected data and comptes global principal components. The global principle components are sent back to the local sites. Each site projects its data along the global principle components and applies its clstering algorithm. A description of locally constrcted clsters is sent to the central site which combines the clster descriptions sing different techniqes sch as nearest neighbor methods. Klsch et al. [160] consider kernel-density based clstering over homogeneosly distribted data. They adopt the definition of a density based clster from [126]: data points which can be connected by an phill path to a local maxima, with respect to the kernel density fnction over the whole dataset, are deemed to be in the same clster. Their algorithm does not find a clstering of the entire dataset. Instead each local site finds a clstering of its local data based on the kernel density fnction compted over all the data. In principal, their approach cold be extended to prodce a global clstering by transmitting the local clsterings to a central site and combining them. However, carrying ot this extension in a commnication efficient manner is non-trivial task and is not discssed by Klsch et al. Eisenhardt et al. [90] develop a distribted method for docment clstering. They extend ] -means with a "probe and echo" mechanism for pdating clster centroids. Each synchronization rond corresponds to a ] -means iteration. Each site carries ot the following algorithm at each iteration. One site initiates the process by marking itself as engaged and sending a probe message to all its neighbors. The message also contains the clster centroids crrently maintained at the initiator site. The first time a node receives a probe (from a neighbor site ^ with centroids _ ), it marks itself as engaged, sends a probe message (along with _ ) to all its neighbors (except the origin of the probe), and pdates the centroids in _ sing its local data as well as compting a weight for each centroid based on the nmber of data points associated with each. If a site receives an echo from a neighbor ^ (with centroids _ and weights QY_ ), it merges _ and QY_ with its crrent centroids and weights. Once a site has received either a probe or echo from all neighbors, it sends an echo along with its local centroids and weights to the neighbor from which it received its first probe. When the initiator has received echoes from all its neighbors, it has the centroids and weights which take into accont all datasets at all sites. The iteration terminates. Dhillon and Modha [78] develop a parallel implementation of the ] -means clstering algorithm on homogeneosly distribted data. A similar approach is taken by Forman and Zhang [105] to extend it to the problem of ] -harmonic means. The problem of clstering on P2P networks has been addressed in recent work [2, 236]. Their algorithm works as follows: each node in the P2P network is provided with a random nmber generator (same for all the sites) that prodces the same set of initial centroid seeds when the algorithm begins. The points in the local data are first assigned to the nearest centroid. Then the centroids are pdated to prodce dimensionwise mean of the points. If there is a drastic change in centroids (measred by a ser defined parameter) then a flag is raised indicating a change in centroids. A particlar node N, will poll neighboring nodes for their centroids. The choice of neighborhood is determined in two different ways - niform sampling and immediate neighborhood. The node N comptes the weighted mean of the centroids it receives with its local centroid to prodce the final set of centroids for a particlar iteration. While this is 30

35 e the first known P2P clstering algorithm, it appears to be an asynchronos algorithm. Moreover, it does not deal with dynamic network topology, as is common in peer to peer networks. A frther extension of this algorithm taking into consideration large dynamic networks, has been stdied by Datta et all [236]. It mst be noted that while all algorithms mentioned in this category reqire mltiple ronds of message passing, [157] and [160] reqire only two ronds. The others reqire as many ronds as the algorithm iterates. Centralized Ensemble-Based Methods These algorithms typically have low synchronization reqirements and potentially offer two other nice properties: (1) If the local models are mch smaller than the local data, their transmission will reslt is excellent message load reqirements. (2) Sharing only the local models may be a reasonable soltion to privacy constraints in some sitations [188]. A brief srvey of the literatre is presented. Johnson and Kargpta [145] develop a distribted hierarchical clstering algorithm on heterogeneosly distribted data. It first generates local clster models and then combines these into a global model. At each local site, the chosen hierarchical clstering algorithm is applied to generate local dendograms which are then transmitted to a central site. Using statistical bonds, a global dendogram is generated. Samatova et al. [237] develop a method for merging hierarchical clsterings from homogeneosly distribted, real-valed data. Lazarevic et al. [167] consider the problem of combining spatial clsterings to prodce a global regression-based classifier. They assme homogeneosly distribted data and that the clstering prodced at each site has the same nmber of clsters. Each local site comptes the convex hll of each clster and transmits the hlls to a central site along with regression model for each clster. The central site averages the regression models in overlapping regions of the hlls. Strehl and Ghosh [250] develop methods for combining clster ensembles in a centralized setting (they did not explicitly consider distribted data). They arge that the best overall clstering maximizes the average normalized mtal information over all clsters in the ensemble. However, they report that finding a good approximation directly is very time-consming. Instead they develop three more efficient algorithms which are not theoretically shown to maximize mtal information, bt are empirically shown to do a decent job. Fred and Jain [102] develop a method for combining clsterings in a centralized setting. Given ` clsterings of a data points, their method first constrcts an abb6a, coassociation matrix (the same as ccod as described in [250]). Next a merge algorithm is applied to the matrix sing a single link, threshold, hierarchical clstering techniqe. For each pair ).TGf# whose co-association entry is greater than a predefined threshold, merge the clsters containing these points. In principal both Strehl and Ghosh s ideas and Fred and Jain s approach can be readily adapted to heterogeneosly distribted data. However, for Strehl and Ghosh s ideas to be adapted to a distribted setting, the problem of constrcting an accrate centralized representation of g sing few messages need be addressed. In order for Fred and Jain s approach to be adapted to a distribted setting, the problem of bilding 31

36 an accrate co-association matrix in a message efficient manner mst be addressed. Merg and Ghosh [188] develop a method for combining generative models 17 prodced from homogeneosly distribted data. Each site prodces a generative model from its own local data. Their goal is for a central site to find a global model from a predefined family (e.g. mltivariate, 10 component Gassian mixtres). which minimizes the average Kllback-Leibler distance over all local models. They prove this to be eqivalent to finding a model from the family which minimizes the KL distance from the mean model over all local models (point-wise average of all local models). They assme that this mean model is compted at some central site. Finally the central site comptes an approximation to the optimal model sing an EM-style algorithm along with Markov-chain Monte-carlo sampling. They did not discss how the centralized mean model was compted. Bt, since the local mode ls are likely to be considerably smaller than the actal data, transmitting the models to a central site seems to be a reasonable approach. Janzaj et al. [144] extend a density-based centralized clstering algorithm, DB- SCAN, by one of the athors to a homogeneosly distribted setting. Each site carries ot the DBSCAN algorithm, a compact representation of each local clstering is transmitted to a central site, a global clstering representation is prodced from local representations, and finally this global representation is sent back to each site. A clstering is represented by first choosing a sample of data points from each clster. The points are chosen sch that: (i) each point has enogh neighbors in its neighborhood (determined by fixed thresholds) and (ii) no two points lie in the same neighborhood. Then ] -means clstering is applied to all points in the clster, sing each of the sample points as an initial centroid. The final centroids along with the distance to the frthest point in their ] -means clster form the representation (a collection point, radis pairs). The DBSCAN algorithm is applied at the central site on the nion of the local representative points to form the global clstering. This algorithm reqires an h parameter defining a neighborhood. The athors set this parameter to the maximm of all the representation radii. Methods [144], [188], and [237] are representatives of the p-and-coming class of distribted clstering algorithms, centralized ensemble-based methods. These algorithms focs on transmitting compact representations of a local clstering to a central site which combines to form a global clstering representation. The key to this class of methods is in the local model (clstering) representation. A good one faithflly captres the local clsterings, reqires few messages to transmit, and is easy to combine. Two techniqes which solve a closely related bt different problem (which they call distribted clstering ). They address the problem of forming clsters of distribted datasets. Each one of their clsters is a collection of datasets, not a collection of tples from datasets. McClean et al. [186] consider clstering a collection of data cbes. Parthasarathy and Ogihara [234] consider clstering homogeneosly distribted tables. Having discssed distribted classification and clstering algorithms, we now focs or attention on distribted data stream mining. The following section introdces the topic. 17 a generative model is a weighted sm of mlti-dimensional probability density fnctions i.e. components 32

37 2.3.4 Distribted Data Stream Mining Distribted Data Stream Mining (DDSM) is becoming an area of active research ([20, 181, 163, 27, 148]) de to the emergence of geographically distribted stream-oriented systems sch as online network intrsion detection applications, sensor networks, vehicle monitoring systems, web click streams, and systems analyzing mltimedia data. In these applications, data streams originating from mltiple remote sorces needs to be monitored. A central stream processing system will not provide a good soltion since streaming data rates may exceed the infrastrctre of storage, commnication, and processing [20]. Ths there arises a need for distribted data stream mining 18. Several projects deal with data mining on streams inclding [117, 131, 87, 97, 106]. While these are closely related, it is not clear whether all of them can be directly applied in a distribted setting. In this section we provide a brief review of crrent work done in the field of distribted data stream mining. Babcock and Olston [20] describe a distribted top-k monitoring algorithm, designed to continosly report the k largest vales from distribted data streams (top-k monitoring qeries). Sch qeries are particlarly sefl in tracking atypical behavior, sch as distribted denial of service attacks, exceptionally large or small vales in telephone call records, action bidding patterns and web sage statistics. The approach to solving the problem is as follows: The co-ordinator maintains the top k set initially. It installs arithmetic constraints at each monitor node over partial data vales. As pdates occr in the distribted streams, the arithmetic constraints shold always be satisfied. If there is a conflict between the coordinator and the monitor nodes a conflict resoltion scheme is resorted to. Ths distribted commnication is only needed when the constraints imposed in the system are violated. The main drawback of this scheme seems to be the fact that the procedre of pdating and reallocation is not instantaneos and ths overall conflict resoltion scheme may not happen in real time. The problem of mining freqent item sets from mltiple distribted data streams has been stdied by Manjhi et all [181]. A naive soltion to the problem is to combine freqency conts from the distribted nodes. However, as the nmber of nodes increases, a large amont of data strctres need to be stored. The athors sggest a soltion which is based on the precision of the freqency cont maintained at each node.they introdce a hierarchical commnication strctre that maintains an error tolerance for freqency conts at each level. This is referred to as the precision gradient. The setting of the precision gradient is posed as an optimization problem with the objective of (1) minimizing load on the central node to which answers are delivered or (2) minimizing the worst case load on any commnication link in the hierarchical strctre. Kotecha et all [163] address the problem of distribted classification of mltiple targets in a wireless sensor network. They cast it as a hypothesis testing problem. The major concern in mlti-target classification is that as the nmber of targets increases, the hypotheses increase exponentially. The athors propose to re-partition the hypoth- 18 There exists a significant amont of work in stream data architectres ([22, 177]), qery processing ([206, 60, 178, 19]), stream-based programming langages and algebras ( [208, 243, 74]), and applications ( [235, 204]). The crrent proposal will not focs on these problems and a reader interested in data streams is referred to [22] for a detailed overview. 33

38 esis space to redce the exponential complexity. Ghoting and Parthasarathy [11] present algorithms for mining distribted streams with interactive response times. Their work performs a Directed Acyclic Graph (DAG) based decomposition of qeries over distribted streams and makes se of this scheme to perform k-median clstering. They introdce a way to effectively pdate clstering parameters (sch as k) by distribted interactive operator schedling. A ticket based schedling algorithm is presented along with an optimal distribted operator allocation for interactive data stream processing in a distribted setting. The athors adapt a graph partitioning scheme for stream qery decomposition. While this is an interesting approach, it remains to be seen how well this approach can scale in large real time systems. The VEhicle DAta Stream Mining (VEDAS) project [148] is an experimental system for monitoring vehicle data streams in real time. It is one of the very early distribted data mining systems that perform most of the data analysis and knowledge discovery operations in onboard compting devices. The data collected from onboard monitoring devices sch as PDAs are sbjected to principal component analysis for dimensionality redction. Since performance of PCA in a resorce-constrained environment may be expensive, the athors present ways to monitor changes in covariance matrix which is sefl for incremental PCA and avoids re comptation in the entire PCA process. The falt detection modle of the application handles vehicle health data. It makes se of incremental clsters to represent safe regimes of operation and can atomatically monitor otliers from new vehicle data. The paper also provides mechanisms for drnk driver detection, which can be viewed as locating deviations from normal or characteristic behavior. In this section, we discssed several algorithms and applications for distribted stream mining. We arge that since data repositories on the grid are heterogeneos in natre, can be static or streaming, distribted data mining can play an important role in extracting patterns from repositories on the grid. In the next section we analyze the challenges for distribted data mining on the grid. 2.4 The Challenges A Data Grid can be thoght of as a distribted system having the following characteristics: 1. It comprises of several resorces (compters, sensors etc) storing data repositories (Relational, XML databases, flat files etc). 2. The resorces do not share a common memory or a clock. 3. They can commnicate with one another by exchanging messages over a commnication network. 4. Each resorce has its own memory and can perform limited / extensive data intensive tasks. 34

39 Join Attribte(X) A B i jk l+m i j l i jf l,n i m j l Table 2.1: Matched Catalog P and Q. 5. The data repositories owned and controlled by a resorce are said to be local to itself, while repositories owned by other machines are considered remote. 6. Accessing remote resorces in the network are more expensive than local resorces, since this incldes commnication delays and CPU overhead to process commnication protocols. 7. The resorces are capable of forming virtal organizations amongst themselves. Members of a virtal organization are allowed to share data nder local policies which specify what is shared, who is allowed to share and the conditions for sharing. Sharing amongst disparate virtal organizations is allowed, althogh policies for sharing cold be gided by different rles. Ths, a Data Grid can be conceived sch that there is either (1) a hierarchy amongst virtal organizations. (2) complete de-centralization amongst virtal organizations ([253, 254]). Given the characteristics of the Data Grid, let s examine what a service-oriented architectre for distribted data mining on the grid reqires. 1. Distribted Data Integration: The prpose of this is to integrate heterogeneos data repositories. Schema integration is a difficlt problem, given that the data repositories contain different types of data (relational, nstrctred, data streams), different attribtes, indices are not all aligned and the criteria reqired to integrate them may be complex. We illstrate with examples from astronomy. Example 1 (1) Consider the catalogs P and Q shown in Table 2.2. For the sake of discssion, we assme Catalog P has X and A attribtes and Catalog Q has X and B attribtes. We frther assme that X is the join attribte. (2) The Table 2.1 illstrates one possibility of aligning the catalogs. Notice that the attribte i matches with j and both l and l n. Ths attribte i appears twice in the matched catalog. Also, if either Catalog P or Q has a join attribte that the other does not have, that tple will not show in the matched catalog. Example 2 Consider that catalog A has Cartesian co-ordinates i!opq# and Catalog B stores co-ordinates i V!Vqp#, both representing the spatial positions of astronomical objects (stars, galaxies etc). It is reqired to perform a 35

40 Tple ID Join Attribte(X) A ^r i jk ^ i j ^ m i m j Tple ID Join Attribte(X) B s i m l s i l s m i l,n s n i l m Table 2.2: Catalog P (Left) and Catalog Q (Right). join between the two catalogs based on the probabilistic calclation that minimizes the paramter t in the following eqation t 5 v=w x L'y) i z i # N! z! # N )q z q # +{ z v< y i N! N q z { (2.1) Note that L is a weighting parameter calclated from the astrometric precision of the srvey, and is the Langrange mltiplier in the minimization to ensre that the i p!opq*# is a nit vector. The co-ordinates from catalog A and B where the vales of t are minimized are chosen to be the cross matched vertices and form the integrated table. Complex data integration on the grid may also reqire maniplation of indexing schemes (e.g spatial indices sch as B-trees or R-trees) and integration of XML data sorces. Very little work has been fond in literatre that specifically deals with schema integration on grids. The OGSA-DAI and OGSA-DQP [205] projects are dealing with service based architectres for Data Access and Integration and Distribted Qery Processing. However, these projects do not explicitly deal with schema integration isses. Carmela Comito and Domenico Talia propose the Grid Data Integration System (GDIS) [70] for integrating heterogeneos XML data sorces. This is a de-centralized service based integration architectre and handles semantic heterogeneity over data repositories. Alexander Wohrer et al. [6, 5] propose a Grid Data Mediation Service, which serves as a data integration system for the GridMiner framework (described in section 2.2.4). The main prpose of the mediation service is to present distribted, heterogeneos data sorces as one virtal data sorce on the grid sing a flexible mapping scheme. They spport both strctred and nstrctred data. The InfoGrid [200] infrastrctre of the Discovery Net project described in section offers another interesting approach for data integration. However, all of the previosly mentioned approaches do not deal with integration or pdation of indexing schemes inherently present in the data repositories. Many scientific databases, sch as hge astronomy catalogs typically depend pon spatial indexing mechanisms (B-trees, R-trees and their variants, or Hierarchical Trianglar Mesh (HTM 19 ). Attempts to integrate data repositories shold

41 also consider incorporation of these indexing schemes in the virtal tables created. 2. Distribted Data Preprocessing: Data normalizations, dealing with missing vale problems are some of the basic data pre-processing operations that need to be performed on the repositories. In order to enable niformity in the policies chosen across all repositories, some amont of commnication needs to take place amongst the resorces. Services designed to handled distribted data preprocessing will be particlarly sefl. 3. Distribted Data Mining: The DDM algorithms implemented on the grid infrastrctre are expected to have the following featres: (a) Scalability: The Data Grid was envisioned to spport data of the order of tera or peta bytes. Centralization of data is generally not an option. Ths DDM algorithms for the Grid shold be designed sch that they have low commnication cost and are independent of the resorces present in the grid. (b) Decentralized: The algorithms shold be able to rn in the absence of a central co-ordinator. (c) Locality Sensitive: The DDM algorithms on the grid mst commnicate with members in the virtal organization. This necessitates them to be local algorithms commnicating only within a certain neighborhood. However, if the Data Grid is organized in sch a manner that all the resorces belong to a single virtal organization, then this restriction can be relaxed. In this particlar case, the distribted algorithms can afford to be global algorithms having complete knowledge of the network. (d) Asynchronos: Since data repositories are of different sizes, have different schemas and qery mechanisms, distribted data mining operations on resorces may take varying amonts of time. Also, the commnication (bandwidth) delay in the network is npredictable. Ths synchronized algorithms may not sit the reqirements of Data Grids. (e) Falt Tolerance: Process failres, commnication link failres on the grid necessitate that the DDM algorithms are falt tolerant. (f) Privacy and Secrity: The algorithms shold honor the privacy of the individal resorces in a Data Grid. 4. Distribted Workflow Composition: In order to enable distribted schema and data integration, exection of algorithms, co-ordination of partial reslts, visalization and graphical representation of DDM reslts, it is important to be able to sccessflly perform the composition of grid-based services. Workflow composition has ths been an area of active research [52, 82, 81, 80, 198, 169, 142]. There are several well known web services flow specification langages sch as BPEL4WS 20 and Web Services Choreography Interface (WSCI 21 ). Related

42 workflow exection engines inclde IBM s Bsiness Process Exection Langage for Web Services JavaTM Rn Time (BPWS4J 22 ), Collaxa 23 BPEL4WS Server and the Self-Serv Environment for web service composition [24]. Triana [66, 179, 180] is a workflow based Problem Solving Environment (PSE) that has been sed by scientists for a range of tasks inclding signal, text and image processing. It aims to make seamless access to distribted services, connects heterogeneos grids and abstracts the core capabilities needed for service-based compting (in P2P, Grid Compting or Web Service). Some of the interesting featres of the Triana framework inclde: (1) Easy GUI based composition of web services (2) Distribted exection of composite web services (3) It accomodates featres for Sensitivity Analysis (4) It allows sers to annotate workflows and maintain provenance information. The web service composition comprises of several mechanisms inclding: (1) service discovery: It allows the location of relevant services. Qerying the UDDI or importing a WSDL docment are two possibilities for service discovery. (2) service composition sing a GUI (3) Transparent exection methods and distribtion of workflows across P2P or Grid frameworks (4) Transparent pblishing of services. A data mining toolkit that enables web service composition has been developed by Ali, Rana and Taylor [7]. It is bilt on top of WEKA and ses the Triana Workflow environment. The toolkit is able to handle classification, clstering and association rle mining. Althogh crrent technologies provide the basic fondations for web service composition, there are still many open research problems. We list some of these isses, noting that these challenges need to be addressed before a composite service based distribted data mining toolkit can be bilt on the grid: (a) Service composition still reqires significant amont of low level programming on the part of the developers and sers. Ths it reqires a lot of overhead for development, testing and maintenance. (b) The nmber of services to be composed can be large, dynamic and may reqire significant commnication among themselves. (c) Very few [24, 66] of the existing technologies deal with distribted workflows. In this section, we otlined the challenges in development of a distribted data mining system on a grid infrastrctre. The following chapter examines preliminary work

43 Chapter 3 Preliminary Work 3.1 Introdction The primary prpose of the proposal is to motivate research in distribted data mining on a grid infrastrctre. While sbstantial work is being done in the fields of distribted data mining (see section 2.3) and grid mining (see section 2.2.4) separately, there seems to have been very little effort in bringing the two technologies together. In the astronomy commnity, for example, several sky-srveys (sch as SDSS [242], 2Mass [3], POSS [224]) are coming online as part of the National Virtal Observatory [203]. An effort is being made to bring the NVO to the grid, sing the TeraGrid 1 framework. As mining across several srveys becomes an appealing idea, architectres for distribted data mining on the grid are still evolving. Consider the problem of classification of galaxies and stars across mltiple sky srveys. To exploit information from different catalogs, there needs to be a mapping between an object in one catalog to an object in another (cross-match), ths creating a virtal matched table. However, the different resoltions of srveys can case two close objects in one srvey to appear as a single object in another 2. The problem can be addressed by probabilistic associations between objects. Once a cross-match has been performed, a single classifier or an ensemble may be sed for bilding the model. A related problem is that of learning, in large NASA astrophysics mission data streams 3 for atomatic discovery of merging / colliding galaxies. In order to achieve this, architectres need to be developed for incorporating data streams to the grid and algorithms for distribted data stream mining have to be bilt. In this section we first discss Orthogonal Decision Trees (introdced in section 2.3.2) as a method for removing redndancies in ensemble classifiers. We briefly review the Forier representation of decision tree ensembles introdced elsewhere [151, 211, 212, 209], provide a techniqe to constrct redndancy-free Orthogonal Decision Trees (ODTs) based on eigen-analysis of the ensemble and offer experimen

44 tal reslts to docment the performance of ODTs on gronds of accracy and model complexity. Next, we examine the application of ODTs to a data stream scenario 4. Finally, we examine the feasibility of distribted data mining on federated astronomy catalogs. We describe the framework of the National Virtal Observatory, propose an architectre for the Distribted Exploration of Massive Astronomy Catalogs (DEMAC) system and its integration with the grid. 3.2 Orthogonal Decision Trees Decision tree [227] ensembles are freqently sed in data mining and machine learning applications. Boosting [103, 88], Bagging[34], Stacking [276], and Random Forests [35] are some of the well-known ensemble-learning techniqes. Many of these techniqes often prodce large ensembles that combine the otpts of a large nmber of trees for prodcing the overall otpt. Ensemble-based classification and otlier detection techniqes are also freqently sed in mining continos data streams [96, 249]. Large ensembles pose several problems to a data miner. They are difficlt to nderstand and the overall fnctional strctre of the ensemble is not very actionable since it is difficlt to manally combine the physical meaning of different trees in order to prodce a simplified set of rles that can be sed in practice. Moreover, in many time-critical applications sch as monitoring data streams in resorce-constrained environments [151], maintaining a large ensemble and sing it for continos monitoring are comptationally challenging. So it will be sefl if we can develop a techniqe to constrct a redndancy-free meaningfl compact representation of large ensembles. This section presents a techniqe to constrct redndancy-free decision-tree-ensembles by constrcting orthogonal decision trees. The techniqe first constrcts an algebraic representation of trees sing mltivariate discrete Forier bases. The new representation is then sed for eigen-analysis of the covariance matrix generated by the decision trees in Forier representation. The proposed approach then converts the corresponding principal components to decision trees. These trees are defined in the original attribtes-space and they are fnctionally orthogonal to each other. These orthogonal trees are in trn sed for accrate (in many cases with improved accracy) and redndancy-free (in the sense of orthogonal basis set) compact representation of large ensembles. The main motivation behind this approach is to create an algebraic framework for meta-level analysis of models, prodced by many ensemble learning, data stream mining, distribted data mining, and other related techniqes. Most of the existing techniqes treat the discrete model strctres sch as decision trees in an ensemble primarily as a black box. Only the otpt of the models are considered and combined in order to prodce the overall otpt. Forier bases offers a compact representation of a discrete strctre that allows algebraic maniplation of decision trees. For example, we can literally add two different trees, prodce weighted average of the trees themselves or perform eigen analysis of an ensemble of trees. Forier representation of decision trees may offer something that is philosophically similar to what spectral representation 4 We se a physiological health monitoring data stream for illstration 40

45 of graphs [65] offers an algebraic representation that allows deep analysis of discrete strctres. The Forier representation allows s to bring in the rich volme of well-nderstood techniqes from Linear Algebra and Linear Systems Theory. This opens p many exciting possibilities for ftre research, sch as qantifying the stability of an ensemble classifier, mining and monitoring mission-critical data streams sing properties of the eigenvales of the ensemble. The following section reviews the Forier representation of decision trees Decision Trees and the Forier Representation This section reviews the Forier representation of decision tree ensembles, introdced elsewhere [149, 152]. Decision Trees as Nmeric Fnctions The approach described in this section makes se of linear algebraic representation of the trees. In order to do that that we first need to convert the trees into a nmeric tree jst in case the attribtes are symbolic. A decision tree defined over a domain of categorical attribtes can be treated as a nmeric fnction. First note that a decision tree is a fnction that maps its domain members to a range of class labels. Sometimes, it is a symbolic fnction where attribtes take symbolic (non-nmeric) vales. However, a symbolic fnction can be easily converted to a nmeric fnction by simply replacing the symbols with nmeric vales in a consistent manner. Since the proposed approach of constrcting orthogonal trees ses this representation as an intermediate stage and eventally the physical tree is converted back the exact scheme for replacing the symbols (if any) does not matter as long as it is consistent. Once the tree is converted to a discrete nmeric fnction, we can also apply any appropriate analytical transformation as necessary. Forier transformation is one sch interesting possibility. Forier representation of a fnction is a linear combination of the Forier basis fnctions. The weights, called Forier coefficients, completely define the representation. Each coefficient is associated with a Forier basis fnction that depends on a certain sbset of featres defining the domain. This section reviews the Forier representation of decision tree ensembles, introdced elsewhere [151]. A Brief Review of the Forier Basis in the Boolean Domain Forier bases are orthogonal fnctions that can be sed to represent any discrete fnction. In other words, it is a fnctionally complete representation. Consider the set of all } -dimensional featre vectors where the. -th featre can take different categorical vales. The Forier basis set that spans this space is comprised of ~ < basis fnctions. Each Forier basis fnction is defined as, ƒ #5 ~ ˆ ~ IO ˆ ŠŒ Ž*, A 4 41

46 where and are vectors of length } ; i I and G I are C -th attribte-vale in x and j, respectively; i I G I % p k HHH 0 and represents the featre-cardinality vector, "HHH ; ƒ )# is called the j-th basis fnction. The vector is called a partition, and the order of a partition is the nmber of non-zero featre vales it contains. A Forier basis fnction depends on some i only when the correspondingg O 5š. If a partition has exactly L nmber of non-zeros vales, then we say the partition is of order L since the corresponding Forier basis fnction depends only on those L nmber of variables that take non-zero vales in the partition. A fnction œˆžÿd, that maps an } -dimensional discrete domain to a realvaled range, can be represented sing the Forier basis fnctions: )#P5 R ƒ )#. where R is the Forier Coefficient (FC) corresponding to the partition ƒ ƒ and )# is the complex conjgate of # ; R 5 S ƒ )# )#. The Forier coefficient R can be viewed as the relative contribtion of the partition to the fnction vale of #. Therefore, the absolte vale of R can be sed as the significance of the corresponding partition. If the magnitde of some R is very small compared to other coefficients, we may consider the -th partition to be insignificant and neglect its contribtion. The order of a Forier coefficient is nothing bt the order of the corresponding partition. We shall often se terms like high order or low order coefficients to refer to a set of Forier coefficients whose orders are relatively large or small respectively. Energy of a spectrm is defined by the smmation R. Let s also define the inner prodct between two spectra?9 : and 9 : where 9 : 5 R9 :T3 R9 :T3 HHH R9 :T3«ª kª is the colmn matrix of all Forier coefficients in an arbitrary bt fixed order. Sperscript denotes the transpose operation P@ denotes the total nmber of coefficients in the spectrm. The inner prodct, [²?9 0: F9 :³ 5 R9 0: 3 R9 :T3 We will also se the definition of the inner prodct between a pair of real-valed fnctions defined over some domain >. This is defined as [µ )#+, )# ³ 5 S f " )# #* The following section considers the Forier spectrm of decision trees and discsses some of its sefl properties. Properties of Decision Trees in the Forier Domain For almost all practical prposes decision trees have bonded depths. This section will therefore consider decision trees of finite depth bonded by some constant. The nderlying fnctions in sch decision trees are comptable by a constant depth Boolean AND and OR circit (or eqivalently circit). Linial et al. [174] noted that the Forier spectrm of 6 circit has very interesting properties and proved the following lemma. Lemma 1 (Linial, 1993) Let and ¹ be the size and depth of an circit. Then º» R? v vœà $Á Â Ã,Ä ª*¼ 9 :T½ $¾ where ÅŒ p# denotes the order (the nmber of non-zero variable) of partition j and Æ is a non-negative integer. The term on the left hand side of the ineqality represents the 42

47 Ë 5 º º ƒ ƒ ƒ ; º ƒ ; ƒ energy of the spectrm captred by the coefficients with order greater than a given constant Æ. The lemma essentially states the following properties abot decision trees: 1. High order Forier coefficients are small in magnitde. 2. The energy preserved in all high order Forier coefficients is also small. The key aspect of these properties is that the energy of the Forier coefficients of higher order decays exponentially. This observation sggests that the spectrm of a Boolean decision tree (or eqivalently bonded depth fnction) can be approximated by compting only a small nmber of low order Forier coefficients. So Forier basis offers an efficient nmeric representation of a decision tree in terms of an algebraic fnction that can be easily stored and maniplated. The exponential decay property of Forier spectrm also holds for non-boolean decision trees. The complete proof is given in the appendix. There are two additional important characteristics of the Forier spectrm of a decision tree that we will se in this section: 1. The Forier spectrm of a decision tree can be efficiently compted [151]. 2. The Forier spectrm can be directly sed for constrcting the tree. In other words, we can go back and forth between the tree and its spectrm. This is philosophically similar to the switching between the time and freqency domains in the traditional application of Forier analysis for signal processing. These two isses will be discssed in details later in this proposal. However, before that we wold like to make a note of one additional property. Forier transformation of decision trees preserves inner prodct. The fnctional behavior of a decision tree is defined by the class labels it assigns. Therefore, if ˆ <"HHH ª ª are the members of the domain > then the fnctional behavior of a decision tree # can be captred by the vector " 5 ˆ# )<Ç#rHHH ) ª ª # È, where the sperscript denotes the transpose operation. The following lemma proves that the inner prodct between two sch vectors is identical to the same in between their respective Forier spectra. Lemma 2 Given two fnctions #P5 R69 0:T3 # and #P5 R69 :T3 # in Forier representation. Then [µ )#+, )# ³ 5É[Ê 9 : 9 :³. Proof: [µ V#* o# ³ 5 ")# o#5 f " º R9 :T3 R69 :03 ; 3 ; º f " f " º 3 ; )# R9 :T3 ƒ #5 )# R9 : 3 ; The forth step is tre since Forier basis fnctions are orthonormal. # R9 0: 3 R9 :T3 5É[µ F9 0: F9 :³ 43

48 X X 1 1 X 2 X Figre 3.1: A Boolean decision tree Compting the Forier Transform of a Decision Tree The Forier spectrm of a given tree can be compted efficiently by efficiently traversing the tree. This section first reviews an algorithm to do that. It discsses aggregation of the mltiple spectra compted from the base classifiers of an ensemble. It also extends the techniqe for dealing with non-boolean class labels. Kshilevitz and Mansor [165] considered the isse of learning the low order Forier spectrm of the target fnction (represented by a Boolean decision tree) from a data set with niformly distribted observations. Note that the crrent contribtion is fndamentally different from their goal. We do not try to learn the spectrm directly from the data. Rather it considers the problem of compting the spectrm from the decision tree generated from the data. Schema Representation of a Decision Path For the sake of simplicity, let s consider a Boolean decision tree as shown in Figre The Boolean class labels correspond to positive and negative instances of the concept class. We can express a Boolean decision tree as a fnction ÌœÍÎ Ç Ï. The fnction maps positive and negative instances to one and zero respectively. A node in a tree is labeled with a featre i. A downward link from the node i is labeled with an attribte vale of the. -th featre. The path from the root node to a sccessor node represents the sbset of data that satisfies the different featre vales labeled along the path. These sbsets of the domain are essentially similarity-based eqivalence classes and we shall call them schemata (schema in singlar form). If Ð is a schema, then ÐÑ%S p k,ò, where Ò denotes a wildcard that matches any vale of the corresponding featre. For example, the path o i m i i i " in Figre 3.22 represents the schema Ò, since all members of the data sbset at the final node of this path take featre vales and for i and i m respectively. We often se the term ÅpÓ¹VÔÇÓ to represent the nmber of non-wildcard vales in a schema. The following section describes an algorithm to extract Forier coefficients from a tree. 44

49 Ë º 5 º Á º Extracting and Calclating Significant Forier Coefficients from a Tree Considering a decision tree as a fnction, the Forier transform of a decision tree can be defined as: R ÕÖ@ # #P5 Õ@ # # N f Õ@ # º # N N f Õ@ # # f VØÙ ÕÖ@ $Ð Ü# $Ð Ü"# ÕÖ@ $Ð=ÝV#Ò $Ð=ÝV# N ÕÖ@ )Ð=ÞÏ# )Ð=ÞÏ# (3.1) Where Õ denotes the complete instance space, Û is an instance sbspace which. )ß leaf node à covers and Ð ; is a schema defined by a path to à respectively (Note that any path to a node in a decision tree is essentially a sbspace or hyperplane, ths it is a schema). )#P5. Lemma 3 For any Forier basis fnction, f Proof: Since Forier basis fnctions form an orthogonal set, š f )#P5 S f # #5š k Here, is the zero-th Forier basis fnction, which is constant (one) for all. Ë Lemma 4 Let Ð ; be a schema defined by the path to a leaf node à. Then if j has a non-zero attribte vale at a position where Ð ; has no vale (wild-card), f VØ"Ù )# )#P5² )Ð # f VØÙ #P5S k Where Û is the sbset that Ð ; covers. Proof: Let 5 Ç x Ç¼áÇ0#, where Ç x are featres which are inclded / and p¼á are featres not in / respectively. Since all vales for p x are fixed in Ð=;, Ú i # is constant for all i %ÎÛ. And Û forms redndant (mltiples of) complete domain with respect to ¼ áç. Therefore for a leaf node à, f VØÙ # #â5 º f VØÙ )Ð ; # #P5² )Ð ; # º f VØÙ Ú # ã ä+å #P5 )Ð ; # $Ð ; # Lemma 5 For any Forier coefficient R whose order is greater than the depth of a leaf node à, 8 f VØÙ )#b5æ. If the order of R is greater than the depth of tree, then R 5S. Proof: The proof immediately follows from Lemma 4. Ë º f VØÙ 0ã ä+å )#P5š Ï 45

50 5 N v v v # Ths, for a FC R to be non-zero, there shold exist at least one schema h that has non-wild-card attribtes for all non-zero attribtes of j. In other words, there exists a set of non-zero FCs associated with a schema h. This observation leads s to a direct way of detecting and calclating all non-zero FCs of a decision tree: For each schema h (or path) from the root, we can easily detect all non-zero FCs by enmerating all FCs associated with h. Before describing details of the algorithm, let s define some notations. Operator ç is defined over a vector h and an attribte-vale pair, i (. #. If i ( is the è -th attribte (or featre), it otpts a new vector Ð ( by replacing è -th vale of h with.. For example, " for h = 1** and the featre i, Ð&çŸ #P5 Ò. Here, we assme indexing starts from zero. ç operates on both schemata and partitions. Next, let s define a fnction éœ i ( #, which takes in i ( as an inpt and otpts a set of èr.0# pairs. Here, è denotes that i ( is the è -th attribtes and. is a non-zero vale of i (. If i ( has the cardinality of (, then éœ i ( # 5Ñ f$èr #*$è\ #* èr (6z #. Let s define another operator ê over a set of partition S and éœ i ( #. It otpts a new set of partitions by applying ç over all possible pairs between S and éœ i (#. This can be considered as Cartesian prodct of S and éœ i (#. For example, let S = {000,010} and i v be the x"ë attribte of cardinality two (with possible non-zero vale of 1). Then, V éœ i Ç# = {(2,1)} and ìÿêêéœ i p# 5Ñ p " V OçÎ #+ çí # 5î Ç V Finally, let s consider a non-leaf node a that has ¹ children. In other words, there exist ¹ disjoint sbtrees below a. If i ( is the featre appearing in a, then J pï ). # denotes the average otpt vale of domain members covered by a sbtree accessible throgh the. -th child of a. For example, in Figre 3.2, J Á ) V# is and J # is one. Note that J ï ). # is eqivalent to the average of schema h, where h denotes the path (from the root node) to. -th sbtree of the node where i ( appears. The algorithm starts with pre-calclating all J ï ).0# -s (This is essentially recrsive Tree-Visit operation). Initially, ìµ5ð p " V k and corresponding R ñòñòñ is calclated with overall average of otpt. In Figre 3.2, it is: B n N B 5ôóõ. The algorithm contines to extract all remaining non-zero FCs in recrsive fashion from the root. If we assme that the tree in Figre 3.2 is bilt from data with three attribtes i V i and i then we can write ìöêîéœ i Ç# 5 p Eqation 3.1:. and RO * is compted sing RO +0 ø5 v B? TÒ oò"# * "TÒ fò# N v B? TÒ Ò"# * "TÒ Ò#ù5 v B?J Á $ V# *0 otò oò"# N v B v B v B Bö z # 5 ú z v 5 z ú For i, ì85ð Ç V " k and ìyê8éœ i p#ö5ð Ç " ". RO * and RO * are compted similarly as RO *. The psedo code of the algorithm is presented in Figre 3.3. v B?J Á *0 otò Ò"# Forier Spectrm of an Ensemble Classifier The Forier spectrm of an ensemble classifier that consists of mltiple decision trees can be compted by aggregating the spectra of the individal base models. Let # be the nderlying fnction compted by a tree-ensemble where the otpt of the ensemble is a weighted linear combination of the otpts of the base tree-classifiers. 46

51 x v R R X 1 Average = ½ Average = X 2 1 Average = 0 Average = Figre 3.2: An instance of Boolean decision tree that shows average otpt vales at each sbtree. #û5 jk* "# N jf V# N N j x x )#P5Sjk º 4 Á 9 0: # N N j x º 4 Ú 9 x : )# where )# and jf are. )ß decision tree and its weight respectively. f is set of nonzero Forier coefficients that are detected by. )ß decision tree and R 9 : is a Forier coefficient in o. Now eqation 3.2 is written as: )#P5 4 R #, where R 5 ˆ j R 9 : and S5Kü x ˆ. The following section extends the Forier spectrmbased approach to represent and aggregate decision trees to domains with mltiple class labels. Forier Spectrm of Mlti-Class Decision Trees A mlti-class decision tree has è ³ different class labels. In general, we can assme that each label is again assigned a niqe integer vale. Since sch decision trees are also fnctions that map an instance vector to nmeric vale, the Forier representation of sch tree is essentially not any different. However, the Forier spectrm cannot be directly applied to represent an ensemble of decision trees that ses voting as its aggregation scheme. The Forier spectrm faithflly represents fnctions in closed forms and ensemble classifiers are not sch fnctions. Therefore, we need a different approach to model a mlti-class decision trees with the Forier basis. Let s consider a decision tree that has è classifications. Then let s define ý to be the Forier spectrm of a decision tree whose class labels are all set to zero except the. -th class. In other words, we treat the tree to have a Boolean classification with respect to the. -th class label. If we define 9 (+: )# to be a partial fnction that comptes the inverse Forier transform sing ý (, classification of an inpt vector x is written as: 47

52 1 Fnction ExtractFS(inpt: Partition Set S, Node a<å¹vô, Schema h) 2 i (Aþ featre appearing in a<å¹oô 3 ÿ þ ìfê éœ i (# 4 è þ position of i ( 5.TqoÔþ ªx ¼ ë ƒ ª ï ª ª 6 for each &%?ÿ 7 for each possible vale. of joǽ Ǽ Ó 8 Ð=;rþ Ð çz èr.0# 9 R þ R N.TqoÔAB?J pï.0# )Ð= T# 10 end 12 end 13 ìwþôì2ü ÿ 14 for. )ß child node a<å¹oô of a<å¹oô 15 Ð ; þ Ð çí$èr.0# 16 ExtractFS(S,a<Å¹oÔ Ð ; ) 17 end 18 end Figre 3.3: Algorithm for obtaining Forier spectrm of a decision tree. ( denotes the cardinality of attribte i ( a<å¹oôœ@ denotes the size of sbspace a<å¹oô ÕÖ@ is the size of the complete instance space. è in line 4 denotes that i ( is the è -th attribte. # 5 9 0: )# N 9 : # N HHH N 9 : )#, where each corresponds to a mapped vale for the. -th classification. Note that if x belongs to G -th class, 9 : #O5 when.=5 G, and 0 otherwise. Now let s consider an ensemble of decision trees in weighted linear combination form. Then 9 (+: )# can be written as: 9 (+: )#'5 jœ 9 : # N jfç 9 : )# N HHH j 9 : )#, where j and 9 (+: )# represent the weight of. -th tree in the ensemble and its partial fnction for the è -th classification respectively. Finally, the classification of an ensemble of decision tree that adopts voting as its aggregation scheme can be defined as: # 5 argmax( 9 (+: )# # In this section, we discssed the Forier representation of decision trees. We showed that the Forier spectrm of a decision tree is very compact in size. In particlar, we proved the exponential decay property is also tre for a Forier spectrm of non-boolean decision trees. In the next section, we will describe how the Forier spectrm of an ensemble can be sed to constrct a single tree Constrction of a Decision Tree from Forier Spectrm This section discsses an algorithm to constrct a tree from the Forier spectrm of an ensemble of decision trees. The following section first shows that the information gain needed to choose an attribte at the decision nodes can be efficiently compted from the Forier coefficients. 48

53 z º z z z z z Schema Average and Information Gain Consider a classification problem with Boolean class labels Ç Ï. Recall that a schema h denotes a path to a node a ( in a decision tree. In order to compte the information gain introdced by splitting the node sing a particlar attribte, we first need to compte the entropy of the class distribtion at that node. We do that by introdcing a qantity called schema average. Let s define the schema average fnction vale as follows: )Ð ÐO@ º f #* (3.2) where # is the classification vale of x denotes the nmber of members in schema h. Note that the schema average $Ð=# is nothing bt the freqency of all instances of the schema Ð with a classification vale of. Similarly, note that the freqency of the tples with classification vale of is $Ð=##. It can therefore be sed to compte the entropy at the node a (. confidence$ð=#û5 entropy)ð #û5 max )Ð #+ )Ð # )Ð # # $Ð=# z )Ð # # r The comptation of )Ð # sing the above expression for a given ensemble is not practical since we need to evalate all í%?ð. Instead we can se the following expression that comptes $Ð=# directly from the given FS: : )Ð # 5 _ Á º _ R9 3òñòñò3_ Á 3òñòñò3_ 3òñòñò3,: *ŠŒ 9 Á$Á Á ñòñ $Ð=## (3.3) where h = Ò Ò Òl<Ò ÒPÒ l+ˆòpò Ò l+iöò Ò"Ò that has C non-wildcard vales l at position GÇ and ^ %í Ç Ï 4. A similar Walsh analysis-based approach for analyzing the behavior of genetic algorithms can be fond elsewhere [110]. Note that the smmations in Eqation 3.3 are defined only for the fixed (non-wild-card) positions that correspond to the featres defining the path to the node a=(. Using Eqation 3.3 as a tool to obtain information gain, it is relatively straightforward to come p with a version of ID3 or C4.5-like algorithms that work sing the Forier spectrm. However, a naive approach may be comptationally inefficient. The comptation of )Ð # reqires an exponential nmber of FCs with respect to the order of h. Ths, the cost involved in compting $Ð=# increases exponentially as the tree becomes deeper. Moreover, since the Forier spectrm of the ensemble is very compact in size, most Forier coefficients involved in compting )Ð # are zero. Therefore, the evalation of )Ð # sing Eqation 3.3 is not only inefficient bt also involves nnecessary comptations. Constrction of a more efficient algorithm to compte )Ð # is possible by taking advantage of the recrsive and decomposable natre of Eqation 3.3. When compting the average of order à schema h, we can redce some comptational steps if any of 49

54 z Ò z z 1 Fnction TCFS(inpt: Forier Spectrm FS) 2 Initialize Candidate Featre Set CFSET 3 create ÓpÅÅpÆ node 4 h þ (***...***) 5 ÓÅÅpÆPþ Bild(h, FS,SFSET) 6 retrn ÓpÅÅpÆ 7 end Figre 3.4: Algorithm for constrcting a decision tree from Forier spectrm (TCFS). the order à -1 schemata which sbsmes h is already evalated. For a simple example in the Boolean domain, let s consider the evalation of Ò ÒÖ DÒ&Ò#. Let s also assme that Ò ÒÖÒVÒ# is pre-calclated. Then, TÒ Ò6 bòöò# is obtained by simply adding R +0 and z RO *0 +0 to Ò Ò6Ò"Ò#. This observation leads s to an efficient algorithm to evalate schema averages. Recall that the path to a node from the root in a decision tree is represented as a schema. Then, choosing an attribte for the next node is essentially the same as selecting the best schema among those candidate schemata that are sbsmed by the crrent schema and whose orders are jst one higher. In the following section, we describe a tree constrction algorithm that is based on these observations. Bottom-p Approach to Constrct a Tree Before describing the algorithm, we need to introdce some notations. Let Ð (+ < and Ð be two schemata. The order of Ð (+ < is one higher than that of Ð. Schema Ð (+ < is identical to Ð except at one position the è -th featre is set to.. For example, consider schemata h = (*1**2) and Ð m ˆ 5 Ò pv #. Here we assign an integer nmber-based ordering among the featres (zero for the leftmost featre). $Ð=# denotes a set of partitions that are reqired to compte )Ð # (See Eqation 3.3). A è -fixed partition is a partition with a non-zero vale at the è -th position. Let k èk# be a set of order one è -fixed partitions; ù)ð (+ < T# be the partial sm of $Ð (+ < # which only incldes è -fixed partitions. Now the information gain achieved by choosing the è -th featre with a given h is redefined sing these new notations: ƒ º ï À Gain)Ðèk#â5 entropy$ð=# z ( entropy$ð (+ < # \ entropy$ð (+ < T# 5 )Ð (+ < # )Ð (+ < 0## z $Ð=(* < # # r )Ð (+ < # # $Ð=(* < # 5 ù)ð (+ < T# 5 )Ð # N ù)ð (+ < T# º )Ð (+ < T# R 9 : 9 (+: where ê is the Cartesian prodct and ( is the cardinality of the è -th featre, respectively. 50

55 # 1 Fnction Bild(inpt: Schema h, Forier Spectrm FS, Candidate Featre Set CFSET) 2 create ÓÅÅpÆ node 3 odr þ 4 Marked þ ÅpÓ¹oÔÇÓf$Ð=# N 5 for each Forier Coefficient R; within odr from FS 6 ft = intersect(h,i,cfset) 7 if ft is not 8 for each vale G of ft 9 pdate $Ð! r4 # with R 10 end 11 add R ; to Marked 12 end 13 end 14 if Marked is 15 set label for ÓÅÅpÆ sing average of h 16 retrn ÓpÅÅpÆ 17 end 18 for each featre in CFSET 19 "ŒjV.Ta< <þ$#ajo.ta )Ð $# 20 end 21 remove è with the maximm "ŒjV.Ta from CFSET 22 ÓÅÅpÆPþ è 23 FS þ FS - Marked 24 for each possible branch l,ó of è 25 Ð&% <; þ pdate h with è25s. 26 l,ó þ Bild(Ð (+ <,FS, CFSET) 27 end 28 add è into CFSET 29 add Marked into FS 30 retrn ÓÅÅpÆ 31 end Figre 3.5: Algorithm for constrcting a decision tree from Forier spectrm (TCFS). order(h) retrns the order of schema h. intersect(h, i) retrns the featre to be pdated sing R ;, if sch a featre exists. Otherwise it retrns. Now we are ready to describe the Tree Constrction from Forier Spectrm (TCFS) algorithm, which essentially notes the decomposable definition of $Ð (* < # and focses on compting $Ð (+ # -s. Note that with a given h (the crrent path), selecting the next featre is essentially identical to choose the è -th featre that achieves the maximm #Ajo.Ta )Ðèk#. Therefore, the basic idea of TCFS is to associate most p-to-date )Ð (+ < # -s with the è -th featre. In other words, when TCFS selects the next node (after some. is chosen for Ð ( 5ð. ), Ð (+ < becomes the new h. Then, it identifies a 51

56 z ( set of FCs (We call these appropriate FCs) that are reqired to compte all Ð (+ < -s for each featre and comptes the corresponding entropy. This process can be considered to pdate each $Ð (+ T# for the corresponding è -th featre as if it were selected. The reason is that sch comptations are needed anyway if a featre is to be selected in the ftre along the crrent path. This is essentially pdating )ÐP(+ < # -s for a featre è sing bottom-p approach (following the flavor of dynamic programming). Note that )Ð (+ < # is, in fact, comptable by adding ù)ð (+ < # to )Ð #. Here ù)ð (+ < # -s are partial sms that only crrent appropriate FCs contribte. Detection of all appropriate FCs reqires a scan over the FS. However, they are split from the FS once they are sed in comptation, since they are no longer needed for the calclation of higher order schemata. Ths it takes a lot less time to compte higher order schemata; note that it is jst opposite to what we encontered in the naive implementation. The algorithm stops growing a path when either the original FS becomes an empty set or the minimm confidence level is achieved. The depth of the reslting tree can be set to a pre-determined bond. A pictorial description of the algorithm is shown in Figre 3.6. Psedo code of the algorithm is presented in Figres 3.4 and 3.5. The TCFS ses the same criteria to constrct a tree as that of the C4.5. Both of them reqire a nmber of information-gain-tests that grows exponentially with respect to the depth of the tree. In that sense, the asymptotic rnning time of TCFS is the same as that of the C4.5. However, while the C4.5 ses original data to compte information gains, TCFS ses a Forier spectrm. Therefore, in practice, a comparison of the rnning time between the two approaches will depend on the sizes of the original data and that of Forier spectrm. The following section presents an extension of the TCFS for handling non-boolean class labels. Extension of TCFS to Mlti-Class Decision Trees The extension of TCFS algorithm to mlti-class problems is immediately possible by redefining the entropy fnction. It shold be modified to captre an entropy from the mltiple class labels. For this, let s first define 9 : )Ð # to be a schema average fnction that ses ý (See Section 3.2.2) only. Note that it comptes the average occrence of the. -th class label in h. Then the entropy of a schema is redefined as follows. entropy$ð=#â5 º ˆ 9 : )Ð # 9 : )Ð # where è is the nmber of class labels. This expression can be directly sed for compting the information gain to choose the decision nodes in a tree for classifying domains with non-boolean class labels. In this section, we discssed a way to assign a confidence to a node in a decision tree, and considered a method to estimate information gain sing it. Conseqently, we showed that a decision tree constrction from the Forier spectrm is possible. In particlar, we devised TCFS algorithm that exploits the recrsive and decomposable natre of tree bilding process in spectrm domain, ths constrcting a decision tree efficiently. In the following section, we will discss empirical verification of the proposed Forier spectrm-based aggregation approach. 52

57 X1 X2 X3 X4 FS 0 X2 1 (0***) (1***) (*0**) (*1**) (**0*) (**1*) (***0) (***1) Update 0 X3 1 (00**) (10**) (*00*) (*01*) (*0*0) (*0*1) Update X1 (000*) (100*) (*000) (*001) Update A snapshot of Tree Constrction Evalated Schemata Split of Forier Spectrm Figre 3.6: Illstration of the Tree Constrction from Forier Spectrm (TCFS) algorithm. It shows the constrcted tree on the left. The schemata evalated at different orders are shown in the middle. The rightmost tree shows the splitting of the set of all Forier coefficients sed for making the process of looking p the appropriate coefficients efficient Removing Redndancies from Ensembles Existing ensemble-learning techniqes work by combining (sally a linear combination) the otpt of the base classifiers. They do not strctrally combine the classifiers themselves. As a reslt they often share a lot of redndancies. The Forier representation offers a niqe way to fndamentally aggregate the trees and perform frther analysis to constrct an efficient representation. Let # be the nderlying fnction representing the ensemble of C different decision trees where the otpt is a weighted linear combination of the otpts of the base classifiers. Then we can write, )# 5 L 79 0: )# N L 79 : # N HHH N L I 79 I : )#P5šL º 4 ' Á R9 0: 3 ƒ )# N HHH N L I Where L is the weight of the. )ß decision tree and U is the set of all partitions with non-zero Forier coefficients in its spectrm. Therefore, )# 5 4 ' R69 :T3 ƒ #, where R9 :T3 5 I ˆ L= $R69 :T3 and U 5 ü I ˆ Uù. Therefore, the Forier spectrm of )# (a linear ensemble classifier) is simply the weighted sm of the spectra of the member trees. º ' R69 I : 3 ƒ )#+ 53

58 º ( Consider the matrix 1 where 1 ) : ; #, where 79 4: ; # is the otpt of the tree 79 4: for inpt ; %Ê>. 1 is >A@XBWC matrix >A@ is the size of the inpt domain and C is the total nmber of trees in the ensemble. An ensemble classifier that combines the otpts of the base classifiers can be viewed as a fnction defined over the set of all rows in 1. If 1FE*34 denotes the G - th colmn matrix of 1 then the ensemble classifier can be viewed as a fnction of 1ŸE*3 Ç1ŸE*3 HHH1ŸE+3 I. When the ensemble classifier is a linear combination of the otpts of the base classifiers we have JK5ML 1 E*3 N L 1 E*3 ON HHHL I 1 E*3 I, where J is the colmn matrix of the overall ensemble-otpt. Since the base classifiers may have redndancy, we wold like to constrct a compact low-dimensional representation of the matrix 1. However, explicit constrction and maniplation of the matrix 1 is difficlt, since most practical applications deal with a very large domain. We can try to constrct an approximation of 1 sing only the available training data. One sch approximation of 1 and its Principal Component Analysis-based projection is reported elsewhere [190]. Their techniqe performs PCA of the matrix 1, projects the data in the representation defined by the eigenvectors of the covariance matrix of 1, and then performs linear regression for compting the coefficients LPL=HHH< and L=I. While the approach is interesting, it has a serios limitation. First of all, the constrction of an approximation of 1 even for the training data is comptationally prohibiting for most large scale data mining applications. Moreover, this is an approximation since the matrix is compted only over the observed data set of the entire domain. In the following we demonstrate a novel way to perform a PCA of the matrix containing the Forier spectra of trees. The approach works withot explicitly generating the matrix 1. It is important to note that the PCA-based regression scheme [190] offers a way to find the weightage for the members of the ensemble. It does not offer any way to aggregate the tree strctres and constrct a new representation of the ensemble which the crrent approach does. The following analysis will assme that the colmns of the matrix 1 are meanzero. This restriction can be easily removed with a simple extension of the analysis. Note that the covariance of the matrix 1 is 1 1. Let s denote this covariance matrix by. The., Gf# -th entry of the matrix, [µ1wtòf. #*1WTÒfTGo# ³ 5É[ 79 : #* 79 4: # ³ 5 The forth step is tre by Lemma 2. Now let s the consider the matrix Q R69 :T3 ( R69 4:T3 ( 5É[Ê 9 : 9(3.4) 4:³ where Q 34 5šR9 4: 3 9 :, i.e. the coefficient corresponding to the. -th member of the partition set U from the spectrm of the tree 7"9 4:. Eqation 3.4 implies that the covariance matrices of 1 and Q are identical. Note that Q is UÎ@ÏBYC dimensional matrix. For most practical UÎ@Œ[A[Ñ@ >A@. Therefore analyzing Q sing techniqes like PCA is significantly easier. The following discorse otlines a PCA-based approach. PCA of the covariance matrix of W prodces a set of eigenvectors HHH, (. The eigenvale decomposition constrcts a new representation of the nderlying domain. Note that since the eigenvectors are nothing bt a linear combination of the original colmn vectors of W, each of them also form a Forier spectrm and we can reconstrct a decision tree from this spectrm. Moreover, since they are orthogonal to 54

59 each other, the tree constrcted from them also maintain the orthogonality condition. The following section defines orthogonal decision trees that makes se of these eigen vectors. Orthogonal Decision Trees The analysis presented in the previos sections offers a way to constrct the Forier spectra of a set of fnctions that are orthogonal to each other and therefore redndancyfree. These fnctions also define a basis and can be sed to represent any given decision tree in the ensemble in the form of a linear combination. Orthogonal decision trees can be defined as an immediate extension of this framework. A pair of decision trees Vp# and V# are orthogonal to each other if and only if [ *)f#* *+# ³ 5 when j 5 l and [Ñ )f)#+ +)# ³ 5 otherwise. The second condition is actally a slightly special case of orthogonal fnctions orthonormal condition. A set of trees are pairwise orthogonal if every possible pair of members of this set satisfy the orthogonality condition. The orthogonality condition garantees that the representation is not redndant. These orthogonal trees form a basis set that spans the entire fnction space of the ensemble. The overall otpt of the ensemble is compted from the otpt of these orthogonal trees. Specific details of the ensemble otpt comptation depends on the adopted techniqe to compte the overall otpt of the original ensemble. However, for most poplar cases considered here boils down to compting the average otpt. If we choose to go for weighted averages, we may also compte the coefficients corresponding to each -, by simply performing linear regression Experimental Reslts This section reports the experimental performance of orthogonal decision trees on the following data sets - SPECT, NASDAQ, DNA, Hose of Votes and Contraceptive Method Usage Data. For each data set, the following three experiments are performed sing known classification techniqes: 1. C4.5: The C4.5 classifier is bilt on training data and validated over test data. 2. Bagging: A poplar ensemble classification techniqe, bagging, is sed to test the classification accracy of the data set. 3. Random Forest: Random forests are bilt on the training data, sing approximately half the nmber of featres in the original data set. The nmber of trees in the forest is identical to that sed in the bagging experiment 5. We then perform another set of experiments for comparing the techniqes described in the previos sections in terms of error in classification and tree complexity. 5 We sed the WEKA implementation( of Bagging and Random Forests 55

60 1. Reconstrcted Forier Tree(RFT): The training set is niformly sampled, with replacement and C4.5 trees are bilt on each sample. The Forier representation of each individal tree is obtained preserving a certain percentage(e.g. 90%) of the energy. This representation of a tree is sed to reconstrct a decision tree sing the TCFS algorithm described in Section The performance of a reconstrcted Forier tree is compared with the original C4.5 tree. The error in classification and tree complexity of each of the reconstrcted trees is reported. The prpose of this experiment is to stdy the effect of representation of a tree by its Forier spectrm and how mch accracy is lost in the entire cycle of smmarizing a decision tree by its Forier spectrm and then re-learning the tree from the spectrm. 2. Aggregated Forier Tree(AFT): The training set is niformly sampled, with replacement and C4.5 decision trees are bilt on each sample (This is identical to bagging). A Forier representation of each tree is obtained(preserving a certain percentage of the total energy), and these are aggregated with niform weighting, to obtain the spectrm of an Aggregated Forier Tree (AFT). The AFT is reconstrcted sing the TCFS algorithm described before and the classification accracy and the tree complexity of this aggregated Forier tree is reported. 3. Orthogonal Decision Trees: The matrix containing the Forier coefficients of the decision trees is sbjected to principal component analysis. Orthogonal trees are bilt, corresponding to the principal components. In most cases it is fond that that the first principal component captres most of the variance, and ths the orthogonal decision tree constrcted from this principal component is of particlar interest. So we report the error in classification and tree complexity of the orthogonal decision tree obtained from the first principal component. We also perform experiments where we keep k 6 significant components.the trees are combined by weighting them according to the coefficients obtained from a Least Sqare Regression 7. Each orthogonal decision tree is weighted sing coefficients calclated from Least Sqare Regression. For this, we allow all the orthogonal decision trees to individally prodce their classification on the test set. Ths each ODT prodces a colmn vector of its classification estimate. Since the class-labels in the test set are already known, we se the least sqare regression to obtain the weights to assign to each ODT. The accracy of the orthogonal decision trees is reported as ODT-LR(ODTs combined sing Least Sqare Regression). In addition to reporting the error in classification, we also report the tree complexity, the total nmber of nodes in the tree. Similarly, the term ensemble complexity reflects the total nmber of nodes in all the trees in the ensemble. A smaller ensemble tree complexity implies a compact representation of an ensemble and therefore it is desirable. Or experiments show that ODTs sally offer significantly redced ensemble 6 We select the vale of k in sch a manner that the total variance captred is more than 90%. One cold potentially do cross-validation to obtain a sitable vale of k as pointed ot in [189] bt this is beyond the crrent scope of the work and will be explored in ftre. 7 Several other regression techniqes sch as ridge regression, principal component regression can also be tried. This is left as ftre work 56

61 tree complexity withot any redction in the accracy. The following section presents the reslts for the SPECT data set. SPECT Data set This section illstrates the idea of orthogonal decision trees sing a well known binary data set. The dataset, available from the University of California Irvine, Machine Learning Repository, describes diagnosing of cardiac Single Proton Emission Compted Tomography (SPECT) images into two categories, normal or abnormal. The database of 267 SPECT image sets (patients) is processed to extract featres that smmarize the original SPECT images. As a reslt, 44 continos featre patterns are obtained for each patient, which are frther processed to obtain 22 binary featre patterns. The training data set consists of 80 instances and 22 attribtes. All the featres are binary, and the class label is also binary (depending on whether a patient is deemed normal or abnormal). The test data set consists of 187 instances and 22 attribtes. Method of classification Error Percentage C (%) Bagging (%) Random Forest (%) Aggregated Forier Tree (AFT) 19.78(%) ODT from 1st PC 8.02(%) ODT-LR 8.02(%) Table 3.1: Classification error for SPECT data Method of classification Tree Complexity C Bagging (average of 40 trees) 5.06 Random Forest (average of 40 trees) Aggregated Forier Tree(AFT)(40 trees) 3 Orthogonal Decision Tree from 1st PC 17 Orthogonal Decision Trees (average of 15 trees) 4.3 Table 3.2: Tree complexity for SPECT data. Table 3.3 shows the error percentage obtained in each of the different classification schemes. The root mean sqared error for the 10 fold cross validation in the C4.5 experiment is fond to be and the standard deviation is For Bagging, the nmber of trees in the ensemble is chosen to be forty. Or experiments reveal that frther increase in nmber of trees in the ensemble cases a decrease in accracy of classification of the ensemble possibly de to over-fitting of the data. For experiments with Random Forests, forest of 40 trees, each constrcted while considering 12 random featres is bilt. The average Ot of bag error is reported to be 57

62 C4.5 RFT 15 C4.5 RFT 80 Accracy of Classification Tree Complexity i th Tree in the ensemble i th tree in the ensemble Figre 3.7: The accracy and tree complexity of C4.5 and RFT for SPECT data The Figre 3.7(Left) compares the accracy of the original C4.5 ensemble with that of the Reconstrcted Forier Tree(RFT) ensemble preserving 90% of the energy of the spectrm. The reslts reveal that if all of the spectrm is preserved, the accracy of the original C4.5 tree and RFT are identical. When the higher order Forier coefficients are removed, this becomes eqivalent to prning a decision tree. This explains the higher accracy of the reconstrcted Forier tree preserving 90% of the energy of the spectrm. The Figre 3.7(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. In order to constrct the orthogonal decision trees, the coefficient matrix is projected onto the first fifteen most significant principal components. The most significant principal component captres % of the variance and the tree complexity of the ODT constrcted from this component is 17 with an accracy of 91.97%. The Figre 3.8 shows the variance captred by all the fifteen principal components. Table 3.2 illstrates the tree complexity for this data set. The orthogonal trees are fond to be smaller in complexity, ths redcing the complexity of the ensemble. NASDAQ Data set The NASDAQ data set is a semi-synthetic data set with 1000 instances and 100 discrete attribtes. The original data set has three years of NASDAQ stock qote data. It is preprocessed and transformed to discrete data by encoding percentages of changes in stock qotes between consective days. For these experiments we assign, 4 discrete vales, that denote levels of changes. The class labels, predict whether the Yahoo stock is likely to increase or decrease based on attribte vales of the 99 stocks. We randomly select 200 instances for training and the remaining 800 instances forms the test data set. Table 3.3 illstrates the classification accracies of different experiments performed 58

63 Percentage of variance captred Principle Component Figre 3.8: Percentage of variance captred by principal components for SPECT Data. Method of classification Error Percentage C (%) Bagging (%) Random Forest (%) Aggregated Forier Tree(AFT) (%) ODT from 1st PC 31.12(%) ODT-LR 31.12(%) Table 3.3: Classification error for NASDAQ data Method of classification Tree Complexity C Bagging (average of 60 trees) 17 Random Forest (average of 60 trees) Aggregated Forier Tree(AFT) (60 trees) 15.2 Orthogonal Decision Tree from 1st PC 3 Orthogonal Decision Trees (average of 10 trees) 6.2 Table 3.4: Tree Complexity for NASDAQ data. on this data set. The root mean sqared error for the 10 fold cross validation in the C4.5 experiment is fond to be and the standard deviation is C4.5 has the best classification accracy, thogh the tree bilt has the highest tree complexity also. For the bagging experiment, C4.5 trees are bilt on the dataset, sch that the size of each bag (sed to bild the tree) as a percentage of the data set is 40%. Also, Random forest of 60 trees, each constrcted while considering 50 random featres is bilt on the 59

64 training data and tested with the test data set. The average ot of bag error is reported to be C4.5 RFT C4.5 RFT Accracy of Classification Tree Complexity i th tree in the ensemble i th tree in the ensemble Figre 3.9: The accracy and tree complexity of C4.5 and RFT for Nasdaq data Figre 3.9(Left) compares the accracy of the original C4.5 ensemble with that of the Reconstrcted Forier Tree(RFT) ensemble preserving 90% of the energy of the spectrm. Figre 3.9(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. For the orthogonal trees, we project the data along the first 10 most significant principal components. The Figre 3.10 illstrates the percentage of variance captred Percentage of variance Principle Component Figre 3.10: Percentage of variance captred by principal components for Nasdaq Data. 60

65 by the ten most significant principal components. Table 3.4 presents the tree-complexity information for this set of experiments. Both the aggregated Forier tree and the orthogonal trees performed better than the single C4.5 tree or bagging. The tree-complexity reslt appears to be qite interesting. While a single C4.5 tree had twenty nine nodes in it, the orthogonal tree from the first principal component reqires jst three nodes, which is clearly a mch more compact representation. DNA Data Set The DNA data set 8 is a processed version of the corresponding data set available from UC Irvine repository. The processed StatLog version replaces the symbolic attribte vales representing the ncleotides (only A,C,T,G) by 3 binary indicator variables. Ths the original 60 symbolic attribtes are changed into 180 binary attribtes. The ncleotides A,C,G,T are given indicator vales as follows: 5 V k, æ5m Ï.#æ5 " 5 " V. The data set has three class vales 1, 2, and 3 corresponding to exonintron bondaries (sometimes called acceptors), intron-exon bondaries (sometimes called donors), and the case when neither is tre. We frther process the data sch that, there are are only two class labels i.e. class 1 representing either donors or acceptors, while class 0 representing neither. The training set consists of 2000 instances and 180 attribtes of which 47.45% belongs to class 1 while the remaining 52.55% belongs to class 0. The test data set consists of 1186 instances and 180 attribtes of which 49.16% belongs to class 0 while the remaining 50.84% belongs to the class 1. Table 3.5 reports the classification error. The root mean sqared error for the 10 fold cross validation in the C4.5 experiment is fond to be and the standard deviation is Also, Random forest of 10 trees, each constrcted while considering 8 random featres is bilt on the training data and tested with the test data set. The average ot of bag error is reported to be Method of classification Error Percentage C (%) Bagging (%) Random Forest (%) Aggregated Forier Tree(AFT) 8.347(%) ODT from 1st PC 10.70(%) ODT-LR 10.70(%) Table 3.5: Classification error for DNA data It may be interesting to note, that the first five eigenvectors are sed in this experiment. The Figre 3.11 shows the variance captred by these components. As before, the redndancy free trees are combined by the weights obtained from Least Sqare Regression. Table 3.6 reports the tree complexity for this data set. 8 Obtained from 61

66 Method of classification Tree Complexity C Bagging (average of 10 trees) 34 Random Forest (average of 10 trees) Aggregated Forier Tree(AFT)(10 trees) 3 Orthogonal Decision Tree from 1st PC 25 Orthogonal Decision Trees (average of 5 trees) 7.4 Table 3.6: Tree Complexity for DNA data Percentage of variance captred Principle Components Figre 3.11: Percentage of variance captred by principal components for DNA Data. Figre 3.12(Left) compares the accracy of the original C4.5 ensemble with that of the Reconstrcted Forier Tree(RFT) ensemble preserving 90% of the energy of the spectrm. Figre 3.12(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. Hose of Votes Data Set The 1984 United States Congressional Voting Records Database is obtained from the University of California, Machine Learning Repository. This data set incldes votes for each of the U.S. Hose of Representatives Congressmen on the 16 key votes identified by the CQA inclding water project cost sharing, adoption of bdget resoltion, mx-missile, immigration etc. It has 435 instances, 16 boolean valed attribtes and a binary class label(democrat or repblican).or experiments se the first 335 instances for training and the remaining 100 instances for testing. In or experiments, missing vales in the data are replaced by one. The reslts of classification are shown in the Table 3.7 while the tree complexity is shown in Table 3.8. The root mean sqared error for the 10 fold cross validation in 62

67 90 80 C4.5 RFT C4.5 RFT 70 Accracy of Classification Tree Complexity i th tree in the ensemble i th tree in the ensemble Figre 3.12: The accracy and tree complexity of C4.5 and RFT for DNA data the C4.5 experiment is fond to be and the standard deviation is For Bagging, fifteen trees are constrcted sing the dataset, since this prodced the best classification reslts. The size of each bag was 20% of the training data set. Random Forest of fifteen trees, each constrcted by considering 8 random featres prodces an average ot of bag error of The accracy of classification and the tree complexity of the original C4.5 and RFT ensemble are illstrated in the left and right hand side of Figre 3.13 respectively. For orthogonal trees, the coefficient matrix is projected onto the first five most significant principal components. The Figre 3.14(Left) illstrates the amont of variance captred by each of the principal components. Method of classification Error Percentage C (%) Bagging 11.0(%) Random Forest 5.6(%) Aggregated Forier Tree(AFT) 11(%) ODT from 1st PC 11(%) ODT-LR 11(%) Table 3.7: Classification error for Hose of Votes data. Contraceptive Method Usage Data Set This dataset,is obtained from the University of California Irvine, Machine Learning Repository and is a sbset of the 1987 National Indonesia Contraceptive Prevalence 63

68 C4.5 RFT Accracy of Classification i th tree in the ensemble ( i varies from 1 to 15) 12 C4.5 RFT 10 8 Tree Complexity i th tree in ensemble (i varies from 1 to 15) Figre 3.13: The accracy and tree complexity of C4.5 and RFT for Hose of Votes data Srvey. The samples are married women who are either not pregnant or do not know if they are at the time of interview. The problem is to predict the crrent contraceptive method choice of a woman based on her demographic and socio-economic characteristics. There are 1473 instances, and 10 attribtes inclding a binary class label. All attribtes are processed so that they are binary. Or experiments se 1320 instances for the training set while the rest form the test data set. The reslts of classification are tablated in the Table 3.9 while Table 3.10 shows the tree complexity. The root mean sqared error for the 10 fold cross validation in the C4.5 experiment is fond to be and the standard deviation is Random Forest bilt with 10 trees, considering 5 random featres prodces an average error in classification of abot 45.88% and an average ot of bag error of Figre 3.15(Left) compares the accracy of the original C4.5 ensemble with that of the Reconstrcted Forier Tree(RFT) ensemble preserving 90% of the energy of the spec- 64

69 Method of classification Tree Complexity C4.5 9 Bagging (average of 15 trees) Random Forest (average of 15 trees) Aggregated Forier Tree (AFT)(15 trees) 5 Orthogonal Decision Tree from 1st PC 5 Orthogonal Decision Trees (average of 5 trees) 3 Table 3.8: Tree Complexity for Hose of Votes data Percentage of variance captred Percentage of variance captred Principle Components Siginificant Principle Components Figre 3.14: Percentage of variance captred by principal components for (Left) Hose of Votes Data and (Right) Contraceptive Method Usage data. trm. Figre 3.15(Right) compares the tree complexity of the original C4.5 ensemble with that of the RFT ensemble. For ODTs, the data is projected along the first ten principal components. The Figre 3.14(Right) shows the amont of variance captred by each principal component. It is interesting to note that the first principal component captres only abot 61.85% of the variance and ths the corresponding ODT generated from the first principal component has a relatively high tree complexity. 3.3 DDM on Data Streams Introdction Several challenging new applications demand the ability to do data mining on resorce constrained devices. One sch application is that of monitoring physiological data streams obtained from wearable sensing devices. Sch monitoring has applications 65

70 Method of classification Error Percentage C (%) Bagging (%) Random Forest (%) Aggregated Forier Tree(AFT) 33.98(%) ODT from 1st PC 46.40(%) ODT-LR 46.40(%) Table 3.9: Classification error for Contraceptive Method Usage Data. Method of classification Tree Complexity C Bagging(average of 10 trees) 24.8 Random Forest (average of 10 trees) Aggregated Forier Tree(AFT)(10 trees) 55 Orthogonal Decision Tree from 1st PC 15 Orthogonal Decision Trees (average of 10 trees) 6.6 Table 3.10: Tree Complexity for Contraceptive Method Usage Data. 70 C4.5 RFT 30 C4.5 RFT Accracy in Classification Tree Complexity i th tree in the ensemble (i varies from 1 to 10) i th tree in the ensemble (i varies from 1 to 10) Figre 3.15: The accracy and tree complexity of C4.5 and RFT for Contraceptive Method Usage data for pervasive healthcare management, be it for seniors, emergency response personnel, soldiers in the battlefield or atheletes. A key reqirement is that the monitoring system be able to rn on resoce constrained handheld or wearable devices. Orthogonal decision trees(odts) (introdced in section 2.3.2) offer an effective way to con- 66

71 strct a redndancy-free, accrate, and meaningfl representation of large decisiontree-ensembles often created by poplar techniqes sch as Bagging, Boosting, Random Forests and many distribted and data stream mining algorithms. This section discsses varios properties of the ODTs and their sitability for monitoring physiological data streams in a resorce-constrained environment. It offers experimental reslts to docment the performance of orthogonal trees on gronds of accracy, model complexity, and other characteristics in a resorce-constrained mobile environment. In closing, we arge that this application will have significant benefits if integrated with a grid infrastrctre. Physiological Data Stream Monitoring We draw two scenarios to illstrate the potential ses of physiological data stream monitoring. Both cases involve a sitation where a potentially complex decision space has to be examined, and yet the resorces available on the devices that will rn the decision process are not sfficient to maintain and se ensembles. Consider a real time environment to monitor the health effects of environmental toxins or disease pathogens on hmans. There are significant advances being made today in biochemical engineering to create extremely low cost sensors for varios toxins[162] that cold constantly monitor the environment and generate data streams over wireles networks. It is not nreasonable to assme that similar sensors cold be developed to detect disease casing pathogens. In addition, most state health/environmental agencies and the federal government entities sch as CDC and EPA have mobile labs and response nits that can test for the presence of pathogens or dangeros chemicals. The mobile nits will have handheld devices with wireless connections on which to send the data and/or their analysis. In addition, each hospital today generates reports on admissions and discharges, and often reports that to varios monitoring agencies. Given these disparate data streams, one cold analyze them to see if correlates can be fond, alerting experts to potential case-effect relations (Pfiesteria fond in Chesapeake Bay and hostpitals report many people with pset stomach who had seafood recently), potential epedemiological events (field nits report dead infected birds and elederly patients check in with viral fever symptoms, indicating tests needed for west nile virs and preventive spraying), and more pertinent in present times, low grade chemical and biological attacks (sensors detect particlar toxins, mobile nits find contaminated sites, hospitals show people who work at or near the sites being admitted with nexplained symptoms). At present, mch of this analysis is done post facto experts hypothesize on possible cases of ailments, then gather the data from disparate sorces to confirm their hypotheses. Clearly, a more proactive environment which cold mine these diverse data strems to detect emergent patters wold be extremely sefl. This scenario, of corse, has some ftristic elements. On a more present day note, there are now several wearable sensors on the market sch as SenseWear armband from BodyMedia [30], Wearable West [272], and LifeShirt Garment from Vivometrics [266] that can be sed to monitor vital signs for a person sch as temperatre, heartrate, heatflx, Û\^-/ etc. 67

Figre 3.16: The Body Media SenseWear armband and The Vivometrics Life Shirt Garment The figre 3.16 9 on the left hand side shows the SenseWear armband that was sed to collect the data.

Galvanic Skin Response: Electrical condctivity between two points on the wearer s arm 4. Skin Temperatre: Temperatre of the skin and is generally reflective of the body s core temperatre 5.

72 Figre 3.16: The Body Media SenseWear armband and The Vivometrics Life Shirt Garment The figre on the left hand side shows the SenseWear armband that was sed to collect the data. The sensors in this band were capable of measring the following: 1. Heat flx: The amont of heat dissipated by the body. 2. Accelerometer: Motion of the body 3. Galvanic Skin Response: Electrical condctivity between two points on the wearer s arm 4. Skin Temperatre: Temperatre of the skin and is generally reflective of the body s core temperatre 5. Near-Body Temperatre: Air temperatre immediately arond the wearer s armband. The sbjects were expected to wear the armband as they went abot their daily rotine, and were reqired to timestamp the beginning and end of an activity. For example, before starting to take a jog, they cold press the timestamp btton, and when finished, they cold press the btton again to record the end of the activity. This body monitoring device can be worn continosly, and can store p to 5 days of physiological data before it had to be retrieved. The LifeShirt Garment is yet another example of an easy to wear shirt, that allows measrement of plmonary fnctions via sensors woven into the shirt. The figre 3.16 on the right hand side shows the heart monitor. Sbjects are capable of measring symptoms, moods, activities and several other physiological characteristics. 9 The figres are obtained from and 68

73 Analysing these vital signs in real time sing small form factor wearable compters has several valable near term applications. For instance, one cold monitor senior citizens living in assisted or independent hosing, to alert physicians and spport personnel if the signs point to distress. Similarly, one cold monitor athletes dring games or practice. Given the recent high profle deaths of athletes both at the professional and high school levels dring practice, the importance of sch an application is fairly apparent. Other potential applications inclde battlefield monitoring of soldiers, or monitoring first responders sch as firefighters Experimental Reslts In order to perform on line monitoring of physiological data sing wearable or handheld (PDAs, cellphones) devices, data streams are sent to them from sensors sing short range wireless networks sch as PANs. Precompted(based on training data obtained previosly) orthogonal decision trees and bagging ensembles are kept on these devices. The data streams are classified sing these precompted models, which are pdated on a periodic basis. It mst be noted that while the monitoring is in real time, the model comptation is done off-line sing stored data. This section docments the performance of orthogonal decision trees on a physiological data set. It makes se of pblicly available data set in order to offer benchmarked reslts. This dataset 10 was obtained from the Physiological Data Modeling Contest 11 held as part of the International Conference on Machine Learning, It comprises of several months of data from more than a dozen sbjects and was collected sing BodyMedia 12 wearable body monitors. In or experiments, the training set consisted of 50,000 instances and 11 continos and discrete valed attribtes 13. The test set had 32,673 instances. The continos valed attribtes were discretized sing the WEKA software 14. The final training and test data sets had all discrete valed attribtes. A binary classification problem was formlated, which monitored whether an individal was engaged in a particlar activity(class label=1) or not(class label=0) depending on the physiological sensor readings. C4.5 decision trees were bilt on data blocks of size 150 instances and the classification accracy and tree complexity was noted. These were then sed to compte their Forier spectra and the matrix of the Forier coefficients was sbjected to principle component analysis. Orthogonal trees were bilt, corresponding to the significant components and they were combined sing an niform aggregation scheme. The accracy and size of the orthogonal trees are noted and compared with the corresponding reslts generated by Bagging sing the same nmber of decision trees in the ensemble. The Figre 3.17 illstrates for decision trees bilt on the niformly sampled training data set(each of size 150). The first decision tree, has a complexity 7 and considers 10 Obtained from The attribtes sed for the classification experiments were gender, galvanic skin temperatre, heat flx, near body temperatre, pedometer, skin temperatre, readings from the longitdinal and transverse accelerometer and time for recording an activity called session time

74 Figre 3.17: Decision Trees bilt from for different samples of the physiological data set Figre 3.18: An Orthogonal Decision Tree attribte transverse acceleromenter reading, session time and near body temperatre as ideal for splits. Before prning, only two instances are mis-classified giving an error of 1.3(%). After prning, there is no change in strctre of the tree. The estimated error percentage is 4.9(%). The second, third and forth decision trees have complexities 5, 7, and 3 respectively. An illstration of an orthogonal decision tree obtained from the first principle component, is shown in Figre Figre 3.19 illstrates the distribtion of tree complexity and error in classification for the original C4.5 trees sed to constrct an ODT ensemble. The total nmber of nodes in the original C4.5 trees varied between three and thirteen. The trees had 70

75 20 Distribtion of tree size in ensemble sed to constrct an orthogonal decision tree 25 Histogram of tree error in ensemble Distribtion in ensemble Distribtion Size of a decision tree (measred by nmber of nodes in tree) Error in classification of trees in ensemble Figre 3.19: Histogram of tree complexity (left) and error (right) in classification for the original C4.5 trees. 35 Distribtion of the error in classification for the ODT Distribtion Error in classification Figre 3.20: Histogram of error in classification in the ODT ensemble. an error of less than 25(%). In comparison, the average complexity of the orthogonal decision trees was fond to be 3 for all the different ensemble sizes. In fact, for this particlar dataset, the sensor reading corresponding to transverse accelerometer attribte was fond to be the most interesting. All the orthogonal decision trees sed this attribte as the root node for bilding the trees. The Figre 3.20 illstrates the distribtion of error in classification for an ODT ensemble of 75 trees. 71

Figre 3.21: Comparison of error in classification for trees in the ensemble for aggregated ODT verss Bagging. Figre 3.22: Plot of Tree Complexity Ratio verss nmber of trees in the ensemble.

76 Figre 3.21: Comparison of error in classification for trees in the ensemble for aggregated ODT verss Bagging. Figre 3.22: Plot of Tree Complexity Ratio verss nmber of trees in the ensemble. We compared the accracy obtained from an aggregated orthogonal decision tree to that obtained from a bagging ensemble(sing the same nmber of trees in each case). Figre 3.21 plots the error in classification of the aggregated ODT and bagging verss the nmber of decision trees in the ensemble. We fond that the classification from an aggregated orthogonal decision tree was better than bagging when the nmber of trees in the ensemble was smaller. With increase in nmber of trees in the ensemble bagging provided a slightly better accracy. It mst be noted however, that in constrained environments sch as in pocket pcs, personal assistants and sensor network setting, increasing the nmber of trees in the ensemble arbitarily may not be feasible de to memory constraints. 72

77 Figre 3.23: Variance captred by the first principle component verss nmber of trees in ensemble. In resorce constrained environments it is often necessary to keep track of the amont of memory sed to store the ensemble. In the crrent implementation storing a node data strctre in a tree reqires approximately 1 KB of memory. Consider an ensemble of 20 trees. If the average nmber of nodes in the trees in the ensemble is 7, then we are reqired to store 140 KB of data. Orthogonal decision trees on the other hand are smaller in size, with less redndancy. In the experiments we performed they typically have a complexity of 3 nodes. This means that we need to store only 3 KB of data. We define Tree Complexity Ratio (TCR) as the total nmber of nodes in the ODT verss the total nmber of nodes in the bagging ensemble. Figre 3.21 plots the variation of the TCR as the nmber of trees in the ensemble increases. It may be noted that in resorce constrained environments one can opt for meaningfl trees of smaller size and comparable accracy as opposed to larger ensembles with a slightly better accracy. An orthogonal decision tree also helps in the featre selection process and indicates which attribtes are more important than others in the data set. The Figre 3.23 indicates the variance captred by the first principle component as the nmber of trees in the ensemble was varied from 5 to 75 trees. As is expected, as the nmber of trees in the ensemble increases, the first principle component captres most of the variance and those occpied by the second and third components gradally decreases. The following section illstrates the response time for classification on a pocket pc sing a bagging ensemble and an eqivalent orthogonal decision tree ensemble. 73

78 Figre 3.24: Plot of Response time for Bagging and eqivalent ODT ensemble verss the nmber of trees in the ensemble Monitoring in Resorce Constrained Environments Resorce Constrained environments sch as personal digital assistants, pocket pcs, cell phones are often sed to monitor the physiological conditions of sbjects. These devices present additional challenges in monitoring owing to the limited battery power, memory restrictions and small displays that they have. The previos section indicated that an aggregated orthogonal decision tree was small in size, and captred an accracy better or comparable to that of bagging when the ensemble size was small. Althogh bagging was fond to perform better in larger ensembles, the nmber of trees that needed to be stored was considerably larger and clearly not an option in the resorce constrained environments. Therefore a tradeoff exists between the memory sage and accracy. In order to test the response time for monitoring, we performed classification experiments on an HP ipaq Pocket PC. We assmed that physiological data blocks of size 40 instances were sent to the handheld device. Using training data obtained previosly, we precompted C4.5 decision trees. The Forier spectra of the trees were evalated(preserving approximately 99(%) of the total energy) and the coefficient matrix was projected onto the most significant principal components. Since the time reqired for comptation is of considerable importance in resorce constrained environments, we estimated the response time for Bagging ensemble verss the eqivalent ODT ensemble. We define response time as the time reqired to prodce an accracy estimate from all the instances available by the specified classification scheme. The Figre 3.24 illstrates the response time for a bagging ensemble and an eqivalent ODT ensemble. Clearly the eqivalent orthogonal decision tree prodces classification reslts faster than a bagging ensemble and this may be attribted to the fact that mch of the redndancy in bagging ensemble has been removed in the ODT ensemble. Or method ths offers a comptationally efficient method for classification on resorce constrained devices. 74

Isilon InsightIQ. Version 2.5. User Guide

Isilon InsightIQ. Version 2.5. User Guide Isilon InsightIQ Version 2.5 User Gide Pblished March, 2014 Copyright 2010-2014 EMC Corporation. All rights reserved. EMC believes the information in this pblication is accrate as of its pblication date.