Project Participants

Similar documents
Annual Report:

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering Part 4 DBSCAN

Dimension Reduction CS534

Computing Data Cubes Using Massively Parallel Processors

After completing this course, participants will be able to:

Data Warehousing & Data Mining

University of Florida CISE department Gator Engineering. Clustering Part 4

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Chapter 1, Introduction

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

6. Concluding Remarks

Summary. 4. Indexes. 4.0 Indexes. 4.1 Tree Based Indexes. 4.0 Indexes. 19-Nov-10. Last week: This week:

Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas

CHAPTER 8: ONLINE ANALYTICAL PROCESSING(OLAP)

Chapter 17 Indexing Structures for Files and Physical Database Design

Building a Concept Hierarchy from a Distance Matrix

ADVANCED MACHINE LEARNING. Mini-Project Overview

Performance impact of dynamic parallelism on different clustering algorithms

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Co-clustering for differentially private synthetic data generation

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

OLAP Introduction and Overview

Visualization with Data Clustering DIVA Seminar Winter 2006 University of Fribourg

Non-linear dimension reduction

Mining of Web Server Logs using Extended Apriori Algorithm

Efficient Range Query Processing on Uncertain Data

Jun Li, Ph.D. School of Computing and Information Sciences Phone:

6234A - Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Tribhuvan University Institute of Science and Technology MODEL QUESTION

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

An Efficient Approach for Color Pattern Matching Using Image Mining

Question Bank. 4) It is the source of information later delivered to data marts.

Funded Project Final Survey Report

CS570: Introduction to Data Mining

Social Behavior Prediction Through Reality Mining

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

An Improved Apriori Algorithm for Association Rules

A Novel Extreme Point Selection Algorithm in SIFT

Network Based Hard/Soft Information Fusion Data Association Process Gregory Tauer, Kedar Sambhoos, Rakesh Nagi (co-pi), Moises Sudit (co-pi)

Using Natural Clusters Information to Build Fuzzy Indexing Structure

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

Architecture Conscious Data Mining. Srinivasan Parthasarathy Data Mining Research Lab Ohio State University

CIE L*a*b* color model

10778A: Implementing Data Models and Reports with Microsoft SQL Server 2012

Nearest Neighbor Search by Branch and Bound

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA

NIH Research Performance Progress Report

Horizontal Aggregations in SQL to Prepare Data Sets Using PIVOT Operator

Hierarchical Document Clustering

Closed Pattern Mining from n-ary Relations

Web Security Vulnerabilities: Challenges and Solutions

Mining Distributed Frequent Itemset with Hadoop

IMPROVING APRIORI ALGORITHM USING PAFI AND TDFI

Deccansoft Software Services Microsoft Silver Learning Partner. SSAS Syllabus

1 (eagle_eye) and Naeem Latif

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Quotient Cube: How to Summarize the Semantics of a Data Cube

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Office of Space Management. U-Space User Guide. Version 1.5 March, 2011

Appropriate Item Partition for Improving the Mining Performance

An Algorithm for Frequent Pattern Mining Based On Apriori

SQL Server Analysis Services

Progress Report: Collaborative Filtering Using Bregman Co-clustering

Monotone Constraints in Frequent Tree Mining

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

DATA MINING AND WAREHOUSING

Distance-based Methods: Drawbacks

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Parallelizing Frequent Itemset Mining with FP-Trees

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

Courtesy of Prof. Shixia University

Dimension Reduction of Image Manifolds

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Using R-trees for Interactive Visualization of Large Multidimensional Datasets

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

The application of OLAP and Data mining technology in the analysis of. book lending

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

An Efficient Clustering Method for k-anonymization

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Improving 3D Shape Retrieval Methods based on Bag-of Feature Approach by using Local Codebooks

Collection Guiding: Multimedia Collection Browsing and Visualization. Outline. Context. Searching for data

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Efficient Algorithm for Frequent Itemset Generation in Big Data

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Contents. Preface to the Second Edition

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Density Based Clustering using Modified PSO based Neighbor Selection

Background. Parallel Coordinates. Basics. Good Example

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining

Table Of Contents: xix Foreword to Second Edition

Technical Report. Title: Manifold learning and Random Projections for multi-view object recognition

Transcription:

Annual Report for Period:10/2004-10/2005 Submitted on: 06/21/2005 Principal Investigator: Yang, Li. Award ID: 0414857 Organization: Western Michigan Univ Title: Projection and Interactive Exploration of Large Relational Data Senior Personnel Name: Yang, Li Project Participants Post-doc Graduate Student Name: Sanver, Mustafa Mustafa Sanver is a Ph.D. student who has been working on research and development of the data visualization component. Name: Yang, Young-ik Young-ik Yang has worked on the database component. Undergraduate Student Technician, Programmer Other Participant Research Experience for Undergraduates Organizational Partners Other Collaborators or Contacts Activities and Findings Research and Education Activities: (See PDF version submitted by PI at the end of the report) Findings: (See PDF version submitted by PI at the end of the report) Training and Development: Page 1 of 3

Two graduate students have worked on this project. They are: 1. Mustafa Sanver 2. Young-Ik Yang Outreach Activities: No outreach activities to report in the first year. Journal Publications Li Yang, "Distance-preserving projection of high dimensional data for nonlinear dimensionality reduction", IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1243, vol. 26, (2004). Published Li Yang, "Pruning and visualizing generalized association rules in parallel coordinates", IEEE Transactions on Knowledge and Data Engineering, p. 60, vol. 17, (2005). Published Li Yang, "Building k-edge-connected neighborhood graphs for distance-based data projection", Pattern Recognition Letters, p., vol. 26, (2005). Accepted Li Yang, "Building k edge-disjoint spanning trees of minimum total length for isometric data embedding", IEEE Transactions on Pattern Analysis and Machine Intelligence, p., vol. 27, (2005). Accepted Li Yang, "Data Embedding Techniques and Applications", Proceedings of the 2nd International Workshop on Computer Vision meets Databases (CVDB'2005), Baltimore, MD, June 2005, p. 29, vol., (2005). Published Li Yang, "Building Connected Neighborhood Graphs for Isometric Data Embedding", Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2005), Chicago, IL, August 2005, p., vol., (2005). Accepted Books or Other One-time Publications Web/Internet Site URL(s): http://www.cs.wmich.edu/~yang Description: The URL provided is the PI's home page. A web site dedicated to this project will be setup once we finish the development of the first release of the proposed software. Other Specific Products Contributions within Discipline: Our primary contribution in this first year of the project is the idea of using high dimensional indexing tree as a data structure to store and organize data aggregations on the secondary storage for the purpose of visual exploration of high dimensional data. Our attention has focused on the k-d-b tree and its variations. In addition to information kept in internal pages of a k-d-b tree for range research, we use each internal node of the tree to keep aggregation information of data points in the region represented by Contributions Page 2 of 3

the node. Each node is partitioned such that all nodes at a certain depth level of the tree contain approximately the same number of data points. Therefore, data required by the visualization component can be provided by visiting internal nodes (each represents a hyper-rectangle in the high dimensional space) of the k-d-b tree at a certain depth level. Sibling internal pages of the tree are linked together for the purpose of visual panning. Zooming is supported by accessing data aggregation values at another depth level of the tree. In the data projection component, four methods have been proposed to build connected neighborhood graphs. Using these methods, connectivity of the constructed neighborhood graphs is guaranteed. These methods have made data embedding techniques applicable to under-sampled data and non-uniformly distributed data, for example, data distributed among clusters. We have discovered that many-to-many generalized association rules can be visually explored by a novel use of parallel coordinates. An item taxonomy tree is displayed at each coordinate. Each rule is visualized as a sequence of Bezier curves connecting items on the coordinates. The item taxonomy tree can be partly displayed by user interaction and such interaction effectively introduces a display border in the itemset lattice. Strategies for summarizing and pruning association rules have been developed. This approach allows us to visually explore many frequent itemsets and many-to-many association rules. Contributions to Other Disciplines: The k-d-b data aggregation tree may potentially be used in the optimization of traditional database queries such as multi-attribute join, data aggregation and OLAP queries. These exciting issues are currently under investigation. Contributions to Human Resource Development: The PI and two graduate students have gained great research experience in working on this project. Contributions to Resources for Research and Education: Contributions Beyond Science and Engineering: Special Requirements Special reporting requirements: None Change in Objectives or Scope: None Unobligated funds: less than 20 percent of current funds Animal, Human Subjects, Biohazards: None Categories for which nothing is reported: Organizational Partners Any Book Any Product Contributions: To Any Resources for Research and Education Contributions: To Any Beyond Science and Engineering Page 3 of 3

Major Findings Data Visualization Component: The two-step bootstrapping strategy of footprint splatting is applicable if the data are not aggregated into evenly partitioned data cubes. This enables us to visually explore high dimensional data by directly visualizing aggregated data stored in internal pages of a high dimensional indexing tree. Zooming is easily supported through accessing (multi-resolution) data aggregations at different levels in the internal nodes of the tree. For visual exploration of association rules, we have proposed to visualize generalized association rules by a novel use of parallel coordinates where each coordinate is used to display an item taxonomy tree. The display of child nodes of each node can be toggled on/off by user interaction. Such an interaction effectively introduces a display border in the itemset lattice. Only frequent itemsets or rules that are on the border are visualized. This technique allows us to visually explore many frequent itemsets and many-to-many association rules. Database Component: We have found that most database research issues (multi-resolution data aggregation, interaction-driven query optimization, high dimensional data indexing) listed in the original proposal can be properly addressed if we us a k-d-b data aggregation tree as an index structure to organize data aggregations on the secondary storage. The data aggregation tree is used for two purposes: (1) to store data aggregations at multiple resolutions; (2) to serve as an indexing mechanism for high dimensional search and range query. Together with the bootstrapping strategy of footprint splatting, visual exploration of high dimensional data can be supported by interactively browsing internal pages of the data aggregation tree. Panning and zooming are directly supported by the index. The k-d-b data aggregation tree extends the original k-d-b tree in the following major ways: (1) values of data aggregation (count, sum, max, min, etc.) are stored in internal pages of the tree; (2) sibling internal pages are linked with each other for the purpose of visual panning. For data insertion and deletion, we have modified the k-d-b tree s page splitting strategy: when a page is full, it splits by following the first partition of the hyperrectangle represented by the page. This may create unbalanced tree and the minimum occupancy of each page cannot be guaranteed. We are investigating techniques to avoid this problem by using data histogram and by re-balancing the tree after bulk-loading of data records. The k-d-b data aggregation tree has potential applications to accelerate data aggregation queries and OLAP queries. These issues are currently under investigation. Data Projection Component: Four methods (k-mst, Min-k-ST, k-ec and k-vc) have been proposed to build connected neighborhood graphs. The first three methods build k-edge-connected neighborhood graphs and k-vc builds k-connected neighborhood graphs. These methods can be used for both geodesic-distance-based data embedding and localitypreserving data embedding. They make the data embedding algorithms applicable to non-uniformly distributed data, including data distributed among clusters. Geodesic distance estimation using neighborhood graph performs better in estimating long geodesic distances than in estimating short ones. Geodesic-distance-based data embedding methods (e.g. Isomap) work well in keeping global structure of data. Locality-preserving embedding methods (e.g. LLE) support bidirectional mapping between high dimensional data and their projections. They also support incremental mapping of new data records. A challenge is how to find a method that keeps advantages of both geodesic-distance-preserving methods and locality preserving methods. Such a method should be scalable to large data sets.

Major Research and Education Activities The proposed project consists of three components: data visualization, database support, and data projection. Activities in each of the three components are reported in the following: Data Visualization Component: Footprint splatting of n-d aggregated data has been implemented by using a two-step splatting strategy: the first step calculates 2D footprint of a n-d cube by partitioning the n-d cube into smaller cells and by integrating the footprints of the cells, each of which is approximated by a Gaussian kernel; the second step visualizes the n-d aggregated data by integrating the footprints of all data cubes, where the color and the opacity of each footprint are set according to the aggregation values of the corresponding data cube. This two-step bootstrapping strategy can be applied to footprint calculation of arbitrarily shaped n-d regions. The strategy is still applicable if the high dimensional data are not aggregated into evenly partitioned data cubes. This enables us to visually explore high dimensional data by interactively browsing data aggregations stored in the internal nodes (which represent hyper-rectangles) of a high dimensional indexing tree. We did experiments of using 3D footprints for footprint splatting. We have developed techniques for pruning and visualizing frequent itemsets and many-to-many association rules. A paper on related algorithms is published on the IEEE Transactions on Knowledge and Data Engineering. Database Component: Visual exploration of large high dimensional data requires that the data are aggregated at multiple resolutions. Various data structures for multidimensional indexing and data aggregation have been studied. We have focused particularly on the k-d-b tree and its variations. We have extended the k-d-b tree so that it can be used to store aggregation information of high dimensional data. The original k-d-b tree is modified in the following ways: (1) values of data aggregation (count, sum, max, min, etc.) are stored at internal pages; (2) sibling internal pages are linked to each other for the purpose of visual panning. Data insertion and bulk loading to the k-d-b tree have also been studied. When an internal page is full and has to split, it is divided by following the first partition in the page. This avoids cascading splitting of descendent pages. However, this may create unbalanced tree and the minimum occupancy of each page cannot be guaranteed. We are investigating techniques to re-balance the tree after bulk loading of all data records. In a dynamic database, the indexing tree needs to be re-balanced periodically. Performance of Unix memory-mapped I/O (mmap()) has been tested. Data Projection Component: Existing data embedding techniques have been evaluated. Their advantages and disadvantages have been identified. Four methods have been proposed to build connected neighborhood graphs for the purpose of defining neighbors of each data point. These methods have been implemented in Matlab and experiments have been conducted on different kinds of datasets. Papers on these methods for constructing connected neighborhood graphs have been published on IEEE Transactions on Pattern Analysis and Machine Intelligence, Pattern Recognition Letters, and on the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2005). Education Activities: Visual data exploration will be taught as one of the major parts of CS603 Knowledge Discovery and Data Mining in Fall 2005. Course material has been prepared.