Grid-Based Data Mining and the KNOWLEDGE GRID Framework

Similar documents
Tools and Services for Distributed Knowledge Discovery on Grids

Knowledge Discovery Services and Tools on Grids

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID

Architecture, Metadata, and Ontologies in the Knowledge Grid

Programming Knowledge Discovery Workflows in Service-Oriented Distributed Systems

Managing Heterogeneous Resources in Data Mining Applications on Grids Using XML-Based Metadata

PARALLEL AND DISTRIBUTED KNOWLEDGE DISCOVERY ON THE GRID: A REFERENCE ARCHITECTURE

WSRF Services for Composing Distributed Data Mining Applications on Grids: Functionality and Performance

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

Incremental DataGrid Mining Algorithm for Mobility Prediction of Mobile Users

Peer-to-Peer Metadata Management for Knowledge Discovery Applications in Grids

Weka4WS: a WSRF-enabled Weka Toolkit for Distributed Data Mining on Grids

A Cloud Framework for Big Data Analytics Workflows on Azure

High Performance Computing Course Notes Grid Computing I

Introduction to Grid Technology

Grid Programming Models: Current Tools, Issues and Directions. Computer Systems Research Department The Aerospace Corporation, P.O.

Grid Architectural Models

Just as the Internet is shifting its focus from information and communication to a

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

Grid Computing Systems: A Survey and Taxonomy

Functional Requirements for Grid Oriented Optical Networks

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

Parameter Sweeping Programming Model in Aneka on Data Mining Applications

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Parallel and Distributed Mining of Association Rule on Knowledge Grid

Day 1 : August (Thursday) An overview of Globus Toolkit 2.4

Metadata, Ontologies and Information Models for Grid PSE Toolkits based on Web Services

Computational Web Portals. Tomasz Haupt Mississippi State University

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

A Federated Grid Environment with Replication Services

GT-OGSA Grid Service Infrastructure

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

A SEMANTIC MATCHMAKER SERVICE ON THE GRID

A P2P Approach for Membership Management and Resource Discovery in Grids1

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Grid Computing: Status and Perspectives. Alexander Reinefeld Florian Schintke. Outline MOTIVATION TWO TYPICAL APPLICATION DOMAINS

Introduction to Grid Computing

Exploiting peer group concept for adaptive and highly available services

Grid Computing Middleware. Definitions & functions Middleware components Globus glite

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Parallelism in Knowledge Discovery Techniques

CHAPTER 2 LITERATURE REVIEW AND BACKGROUND

Juliusz Pukacki OGF25 - Grid technologies in e-health Catania, 2-6 March 2009

Research and Design Application Platform of Service Grid Based on WSRF

MSF: A Workflow Service Infrastructure for Computational Grid Environments

Grid Middleware and Globus Toolkit Architecture

Announcements. me your survey: See the Announcements page. Today. Reading. Take a break around 10:15am. Ack: Some figures are from Coulouris

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Dynamic Workflows for Grid Applications

A Distributed Media Service System Based on Globus Data-Management Technologies1

By Ian Foster. Zhifeng Yun

A Web-Services Based Architecture for Dynamic- Service Deployment

ICT-SHOK Project Proposal: PROFI

DataMining with Grid Computing Concepts

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

User Tools and Languages for Graph-based Grid Workflows

The Integration of Grid Technology with OGC Web Services (OWS) in NWGISS for NASA EOS Data

Grid Scheduling Architectures with Globus

Kepler Scientific Workflow and Climate Modeling

MINING LARGE DATA SETS ON GRIDS: ISSUES AND PROSPECTS. David Skillicorn. Domenico Talia

Grid Computing. Grid Computing 2

Knowledge-based Grids

ECLIPSE PERSISTENCE PLATFORM (ECLIPSELINK) FAQ

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

(9A05803) WEB SERVICES (ELECTIVE - III)

NextData System of Systems Infrastructure (ND-SoS-Ina)

Customized way of Resource Discovery in a Campus Grid

Grid Computing. Lectured by: Dr. Pham Tran Vu Faculty of Computer and Engineering HCMC University of Technology

SwinDeW-G (Swinburne Decentralised Workflow for Grid) System Architecture. Authors: SwinDeW-G Team Contact: {yyang,

GRIDS INTRODUCTION TO GRID INFRASTRUCTURES. Fabrizio Gagliardi

A Survey Paper on Grid Information Systems

OGCE User Guide for OGCE Release 1

Grid Computing with Voyager

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Personal Grid. 1 Introduction. Zhiwei Xu, Lijuan Xiao, and Xingwu Liu

STATUS UPDATE ON THE INTEGRATION OF SEE-GRID INTO G- SDAM AND FURTHER IMPLEMENTATION SPECIFIC TOPICS

A Grid-Enabled Component Container for CORBA Lightweight Components

Meta-learning in Grid-based Data Mining Systems

Cloud Computing. Up until now

Visual Component Composition Environments

A Generalized MapReduce Approach for Efficient mining of Large data Sets in the GRID

UNIT IV PROGRAMMING MODEL. Open source grid middleware packages - Globus Toolkit (GT4) Architecture, Configuration - Usage of Globus

Regular Forum of Lreis. Speechmaker: Gao Ang

The Design of a DLS for the Management of Very Large Collections of Archival Objects

Technical Overview. Access control lists define the users, groups, and roles that can access content as well as the operations that can be performed.

GRID Stream Database Managent for Scientific Applications

S.No QUESTIONS COMPETENCE LEVEL UNIT -1 PART A 1. Illustrate the evolutionary trend towards parallel distributed and cloud computing.

My View Grid Computing: An Overview

Design The way components fit together

On the Use of Peer-to-Peer Architectures for the Management of Highly Dynamic Environments

UNICORE Globus: Interoperability of Grid Infrastructures

Top-down definition of Network Centric Operating System features

Java Development and Grid Computing with the Globus Toolkit Version 3

An Engineering Computation Oriented Visual Grid Framework

Introduction to Web Services & SOA

Using Resources of Multiple Grids with the Grid Service Provider. Micha?Kosiedowski

Engineering Interoperable Computational Collaboratories on the Grid 1

A Taxonomy and Survey of Grid Resource Management Systems

Transcription:

Grid-Based Data Mining and the KNOWLEDGE GRID Framework DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS University of Calabria ITALY talia@deis.unical.it Minneapolis, September 18, 2003

OUTLINE Introduction Parallel and Distributed Data Mining on Grids The KNOWLEDGE GRID KNOWLEDGE GRID Architecture KNOWLEDGE GRID Services KNOWLEDGE GRID Tools VEGA Current Work Conclusion 2

PARALLEL & DISTRIBUTED DATA MINING Data mining is often a compute intensive task. When large data sets are coupled with geographic distribution of data, users, and systems, it is necessary to combine different technologies for implementing high-performance distributed knowledge discovery systems (PDKD). Distributed data mining tools are available but most of them do not run on Grids. 3

WHAT IS A GRIDS? By providing scalable, secure, high-performance mechanisms for discovering and negotiating access to remote resources, the Grid promises to make it possible for scientific collaborations to share resources on an unprecedented scale, and for geographically distributed groups to work together in ways that were previously impossible Ian Foster 4 4

PARALLEL & DISTRIBUTED DM ON GRIDS Grid middleware targets technical challenges in areas such as communication, scheduling, security, information and data access, and fault detection. Efforts are needed for the development of knowledge discovery tools and services on the Grid. Grid-aware PDKD systems 5

PARALLEL & DISTRIBUTED DM ON GRIDS The basic principles that motivate the architecture design of the grid-aware PDKD systems Data heterogeneity and large data size Algorithm integration and independence Grid awareness Openness Scalability Security and data privacy. 6

WHAT THE GRID OFFERS Grid infrastructure tools, such as the Globus Toolkit and Legion, provide basic services that can be effectively used in the development of a data mining applications. Data Grid middleware (e.g. Globus Data Grid) implements data management architectures based on two main services: storage system and metadata management. Data Grids are useful, but are not sufficient for data mining. 7

THE KNOWLEDGE GRID KNOWLEDGE GRID - a PDKD architecture that integrates data mining techniques and computational Grid resources. In the KNOWLEDGE GRID architecture data mining tools are integrated with lower-level Grid mechanisms and services and exploit Data Grid services. This approach benefits from "standard" Grid services and offers an open PDKD architecture that can be configured on top of generic Grid middleware. 8

KNOWLEDGE GRID ENVIRONMENT A KNOWLEDGE GRID application uses: A set of KNOWLEDGE GRID-enabled computers - K-GRID nodes declaring their availability to participate to some PDKD computation, that are connected by A Grid infrastructure offering basic grid-services (authentication, data location, service level negotiation) and implementing the KNOWLEDGE GRID services. 9

KNOWLEDGE GRID ENVIRONMENT KNOWLEDGE GRID services Basic Grid Infrastucture Cluster Element K-GRID tools Grid Middleware LAN Cluster Element Cluster containing data sets and/or DM algorithms K-GRID node Cluster Element Grid Middleware Local Resources Generic Grid node K-GRID tools Grid Middleware Local Resources K-GRID node 10

KNOWLEDGE GRID SERVICES The KNOWLEDGE GRID services are organized in two hierarchic layers : Core K-Grid layer and High-level K-Grid layer. The former refers to services directly implemented on the top of generic Grid services. The latter is used to describe, develop, and execute PDKD computations over the KNOWLEDGE GRID. 11

KNOWLEDGE GRID ARCHITECTURE High level K-Grid layer DAS Data Access Service Core K-Grid layer TAAS Tools and Algorithms Access Service EPMS Execution Plan Management Service Resource Metadata Execution Plan Metadata Model Metadata RPS Result Presentation Service K N O W L E D G E KMR KDS Knowledge Directory Service KEPR RAEMS Resource Alloc. Execution Mng. KBR G R I D Generic Grid Services 12

KNOWLEDGE GRID SERVICES Core K-Grid layer services: Knowledge directory service (KDS). Extends the basic Globus MDS and GIS services to maintain a description of all data and tools used in the KNOWLEDGE GRID. Resource allocation and execution management service (RAEMS). RAEMS services are used to find a mapping between an execution plan and available resources. The Core K-Grid layer manages metadata describing features of data sources, third party data mining tools, data management, and data visualization tools and algorithms. 13

KNOWLEDGE GRID SERVICES High-level K-grid layer services: Data Access Search, selection (Data search services), extraction, transformation and delivery (Data extraction services) of data to be mined. Tools and algorithms access Search, selection, and downloading of data mining tools and algorithms. Execution Plan Management Generation of a set of different execution plans that satisfy user, data, and algorithms requirements and constraints. Results presentation Specifies how to generate, present and visualize the PDKD results (rules, associations, models, classification, etc.). 14

KNOWLEDGE GRID OBJECTS We use the Globus MDS model only for generic Grid resources, but extended it with an XML metadata model to manage specific KNOWLEDGE GRID resources. Metadata describing relevant K-Grid objects, such as data sources and data mining tools, are implemented using both LDAP and XML. The (Knowledge Metadata Repository) KMR is implemented by LDAP entries and XML documents. The LDAP portion is used as a first point of access to more specific information represented by XML documents. 15

APPLICATION COMPOSITION STEPS Metadata about K-grid resources DAS / TAAS Search and selection of resources KMRs EPMS Metadata about the selected K-grid resources Design of the PDKD computation TMR Execution Plan KEPR 16

APPLICATION EXECUTION STEPS Execution Plan RAEMS Execution Plan optimization and translation KEPR RSL script GRAM Execution of the PDKD computation RPS Computation results Results presentation KBR 17

A TOOL : VEGA A prototype version f the KNOWLEDGE GRID architecture have been implemented using Java and the Globus Toolkit 2.x. To allow a user to build a grid-based data mining application, we developed a toolset named VEGA (a Visual Environment for Grid Applications). VEGA offers users support for : task composition - definition of the entities involved in the computation and specification of relations among them; checking of the consistency of the planned task; generation of the execution plan for a data mining task. execution of the execution plan through the resource allocation manager of the underlying grid. 18

VEGA : OBJECTS and LINKS Objects: Hosts Software Data Links: File Transfer Execute Input Output Objects represent resources Links represent relations among resources 19

VEGA Hosts pane Resources pane 20

VEGA A KGrid application can be composed of several workspaces 21

XML METADATA in a KMR... <Software> <name>autoclass</name> <description>unsupervised Bayesian Classifier </description> <release> <number major= 3 minor= 3 patch= 3 /> <date>01 May 00</date> </release> <author>nasa Ames Research Center</author> <hostname>icarus.isi.cs.cnr.it</hostname> <executablepath>/share/software/autoclass-c/autoclass </executablepath> <manualpath>/share/software/autoclass-c/read-me.text </manualpath>... </Software> 22

XML EXECUTION PLAN <ExecutionPlan>... <Task ep:label="ws1_dt2"> <DataTransfer> <Source ep:href="g1../unidb.xml" ep:title="unidb on g1.isi.cs.cnr.it"/> <Destination ep:href="k2../unidb.xml ep:title="unidb on k2.deis.unical.it"/>... </DataTransfer> </Task>... <Task ep:label="ws2_c2"> <Computation> <Program ep:href="k2../iminer.xml" ep:title="iminer on k2.deis.unical.it"/> <Input ep:href="k2../unidb.xml" ep:title="unidb on k2.deis.unical.it"/>... <Output ep:href="k2../iminer.out.xml" ep:title="iminer.out on k2.deis.unical.it"/> </Computation> </Task>... <TaskLink ep:from="ws1_dt2" ep:to="ws2_c2"/>... </ExecutionPlan> 23

A GENERATED RSL SCRIPT +... (&(resourcemanagercontact=g1.isi.cs.cnr.it) (subjobstarttype=strict-barrier) (label=ws1_dt2) (executable=$(globus_location)/bin/globus-url-copy) (arguments=-vb notpt gsiftp://g1.isi.cs.cnr.it/.../unidb gsiftp://k2.deis.unical.it/.../unidb ) )... (&(resourcemanagercontact=k2.deis.unical.it) (subjobstarttype=strict-barrier) (label=ws2_c2) (executable=.../iminer)... ) )... 24

APPLICATION EXECUTION 25

ON GOING WORK : OTHER TOOLS Some things we have done recently VEGA : Support for more complex computation layouts, Execution plan optimization, Abstract resources definition and use. KNOWLEDGE GRID : A peer-to-peer system for presence management and resource discovery on the Grid, A tool for optimized file transfer on the Grid based on GridFTP, A data mining ontology and an associated tool. 26

ON GOING WORK OGSA and KNOWLEDGE DISCOVERY SERVICES The KNOWLEDGE GRID is an abstract service-based Grid architecture that does not limit the user in developing and using service-based knowledge discovery applications. We are defining a set of Grid Services that export functionalities and operations of the KNOWLEDGE GRID. Each of the KNOWLEDGE GRID services is exposed as a persistent service, using the OGSA conventions and mechanisms. We intend to offer those OGSA-Compliant services for impementing distributed Data Mining applications and Knowledge Discovery processes on Grids. 27

CONCLUSION Parallel and distributed data mining suites and computational grid technology are two critical elements of future highperformance computing environments for e-science (data-intensive experiments) e-business (on-line services) virtual organizations support (virtual teams, virtual enterprises) Knowledge Grids will enable entirely new classes of advanced applications for dealing with the data deluge. The Grid is not yet another distributed computing system: it is a medium to dynamically share heterogeneous resources, services, and knowledge. 28

CONCLUSION Grids are coupling computation-oriented services with dataoriented services and knowledge-based services. This trend enlarges the Grid application scenario and offer new opportunities for high-level applications. We are much more able to store data than to extract knowledge from it. The KNOWLEDGE GRID is a framework for the unification of knowledge discovery and grid technologies helping us to climb some mountain of data. 29

MAIN REFERENCES M. Cannataro, D. Talia, The Knowledge Grid, Communications of the ACM, 46(1), 2003. M Cannataro, D. Talia, P. Trunfio, Distributed Data Mining on the Grid, Future Generation Computer Systems, 18(8), 2002. D. Talia, The Open Grid Services Architecture-Where the Grid Meets the Web, IEEE Internet Computing, 6(6), 2002. www.icar.cnr.it/kgrid 30

THANKS 31