Knowledge Discovery Services and Tools on Grids

Similar documents
Grid-Based Data Mining and the KNOWLEDGE GRID Framework

Tools and Services for Distributed Knowledge Discovery on Grids

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID

Architecture, Metadata, and Ontologies in the Knowledge Grid

PARALLEL AND DISTRIBUTED KNOWLEDGE DISCOVERY ON THE GRID: A REFERENCE ARCHITECTURE

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

Programming Knowledge Discovery Workflows in Service-Oriented Distributed Systems

High Performance Computing Course Notes Grid Computing I

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Grid Computing Fall 2005 Lecture 5: Grid Architecture and Globus. Gabrielle Allen

Managing Heterogeneous Resources in Data Mining Applications on Grids Using XML-Based Metadata

Introduction to Grid Technology

Grid Architectural Models

WSRF Services for Composing Distributed Data Mining Applications on Grids: Functionality and Performance

Grid Programming Models: Current Tools, Issues and Directions. Computer Systems Research Department The Aerospace Corporation, P.O.

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

Day 1 : August (Thursday) An overview of Globus Toolkit 2.4

A Distributed Media Service System Based on Globus Data-Management Technologies1

Computational Web Portals. Tomasz Haupt Mississippi State University

Introduction to Grid Computing

A Federated Grid Environment with Replication Services

Grid Computing Systems: A Survey and Taxonomy

A Cloud Framework for Big Data Analytics Workflows on Azure

Knowledge-based Grids

UNICORE Globus: Interoperability of Grid Infrastructures

Grid Computing Middleware. Definitions & functions Middleware components Globus glite

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Announcements. me your survey: See the Announcements page. Today. Reading. Take a break around 10:15am. Ack: Some figures are from Coulouris

Weka4WS: a WSRF-enabled Weka Toolkit for Distributed Data Mining on Grids

The Integration of Grid Technology with OGC Web Services (OWS) in NWGISS for NASA EOS Data

Grid Computing. Lectured by: Dr. Pham Tran Vu Faculty of Computer and Engineering HCMC University of Technology

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

Peer-to-Peer Metadata Management for Knowledge Discovery Applications in Grids

Research and Design Application Platform of Service Grid Based on WSRF

Design The way components fit together

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

Grid Computing: Status and Perspectives. Alexander Reinefeld Florian Schintke. Outline MOTIVATION TWO TYPICAL APPLICATION DOMAINS

Grid Middleware and Globus Toolkit Architecture

Grid Scheduling Architectures with Globus

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

Design The way components fit together

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

Juliusz Pukacki OGF25 - Grid technologies in e-health Catania, 2-6 March 2009

A SEMANTIC MATCHMAKER SERVICE ON THE GRID

Introduction to Distributed Systems

CHAPTER 2 LITERATURE REVIEW AND BACKGROUND

Just as the Internet is shifting its focus from information and communication to a

Functional Requirements for Grid Oriented Optical Networks

By Ian Foster. Zhifeng Yun

Parallel and Distributed Mining of Association Rule on Knowledge Grid

Assignment 5. Georgia Koloniari

Incremental DataGrid Mining Algorithm for Mobility Prediction of Mobile Users

MINING LARGE DATA SETS ON GRIDS: ISSUES AND PROSPECTS. David Skillicorn. Domenico Talia

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Metadata, Ontologies and Information Models for Grid PSE Toolkits based on Web Services

GRIDS INTRODUCTION TO GRID INFRASTRUCTURES. Fabrizio Gagliardi

Exploiting peer group concept for adaptive and highly available services

Cloud Computing. Up until now

PPERFGRID: A GRID SERVICES-BASED TOOL FOR THE EXCHANGE OF HETEROGENEOUS PARALLEL PERFORMANCE DATA JOHN JARED HOFFMAN

SwinDeW-G (Swinburne Decentralised Workflow for Grid) System Architecture. Authors: SwinDeW-G Team Contact: {yyang,

Layered Architecture

system of systems: such as a cloud of clouds, a grid of clouds, or a cloud of grids, or inter-clouds as a basic SOA architecture.

Dynamic Workflows for Grid Applications

Multiple Broker Support by Grid Portals* Extended Abstract

Distributed Systems. Bina Ramamurthy. 6/13/2005 B.Ramamurthy 1

Research on the Interoperability Architecture of the Digital Library Grid

The University of Oxford campus grid, expansion and integrating new partners. Dr. David Wallom Technical Manager

Grid Challenges and Experience

The GridWay. approach for job Submission and Management on Grids. Outline. Motivation. The GridWay Framework. Resource Selection

Distributed Data Management with Storage Resource Broker in the UK

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Kenneth A. Hawick P. D. Coddington H. A. James

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

Workflow, Planning and Performance Information, information, information Dr Andrew Stephen M c Gough

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 4, Jul-Aug 2015

User Tools and Languages for Graph-based Grid Workflows

S.No QUESTIONS COMPETENCE LEVEL UNIT -1 PART A 1. Illustrate the evolutionary trend towards parallel distributed and cloud computing.

Introduction to SRM. Riccardo Zappi 1

UNIT IV PROGRAMMING MODEL. Open source grid middleware packages - Globus Toolkit (GT4) Architecture, Configuration - Usage of Globus

MSF: A Workflow Service Infrastructure for Computational Grid Environments

(9A05803) WEB SERVICES (ELECTIVE - III)

Text mining on a grid environment

A Survey Paper on Grid Information Systems

Grid Computing. Grid Computing 2

M. Antonioletti, EPCC December 5, 2007

GT-OGSA Grid Service Infrastructure

Globus Toolkit Firewall Requirements. Abstract

A Location Model for Ambient Intelligence

Database Assessment for PDMS

ICT-SHOK Project Proposal: PROFI

Future Developments in the EU DataGrid

Resolving Load Balancing Issue of Grid Computing through Dynamic Approach

ICENI: An Open Grid Service Architecture Implemented with Jini Nathalie Furmento, William Lee, Anthony Mayer, Steven Newhouse, and John Darlington

Customized way of Resource Discovery in a Campus Grid

Web-based access to the grid using. the Grid Resource Broker Portal

Kepler Scientific Workflow and Climate Modeling

A Data-Aware Resource Broker for Data Grids

Personal Grid. 1 Introduction. Zhiwei Xu, Lijuan Xiao, and Xingwu Liu

An Evaluation of Alternative Designs for a Grid Information Service

Transcription:

Knowledge Discovery Services and Tools on Grids DOMENICO TALIA DEIS University of Calabria ITALY talia@deis.unical.it Symposium ISMIS 2003, Maebashi City, Japan, Oct. 29, 2003

OUTLINE Introduction Grid and Knowledge Discovery Towards Knowledge Discovery Services on the Grid The KNOWLEDGE GRID u u u KNOWLEDGE GRID Architecture and Services KNOWLEDGE GRID Tools VEGA Research issues and trends Conclusions

PARALLEL & DISTRIBUTED KNOWLEDGE DISCOVERY Knowledge Discovery is often a compute intensive process. When large data sets are coupled with geographic distribution of data, users, and systems, it is necessary to combine different technologies for implementing high-performance distributed knowledge discovery systems. Distributed data mining and knowledge discovery tools are available but most of them do not run on Grids.

WHAT IS THE GRID? The Grid is a new distributed computing infrastructure. Its main goal is: Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations Grid computing is focusing on large-scale resource sharing, innovative applications, and high-performance orientation.

WHAT IS THE GRID? By providing scalable, secure, high-performance mechanisms for discovering and negotiating access to remote resources, the Grid promises to make it possible for scientific collaborations to share resources on an unprecedented scale, and for geographically distributed groups to work together in ways that were previously impossible Ian Foster

WHAT IS THE GRID? Grid Computing By M. Mitchell Waldrop May 2002 Hook enough computers together and what do you get? A new kind of utility that offers supercomputer processing on tap. Is Internet history about to repeat itself?

PROGRAMMING THE GRID A grid programmer has to manage a computation in an environment that is typically heterogeneous and dynamic in composition with a deepening memory and bandwidth latency hierarchy. There are several general problems that are to be solved by Grid programming models Portability and Interoperability Adaptivity and Discovery Performance Fault Tolerance and Security.

PROGRAMMING THE GRID We can envision three main levels in programming applications on the Grid. 1. Using Grid middleware mechanisms (e.g., Globus RSL) 2. Using Grid programming languages 3. Using Grid-based application-oriented environments (PSEs). ApplicationOrientedEnvs Programming Languages Globus mechanisms

Globus SERVICES Grid Security Infrastructure (GSI): secure communication, mutual authentication and single sign-on, run-anywhere authentication service Monitoring and Discovery Service (MDS): a framework for publishing and accessing information about the system components Globus Resource Allocation Manager (GRAM): facilities for resource allocation and process creation, monitoring, and management Dynamically-Updated Resource Online Co-allocator (DUROC): manages multi-requests of resources, delivers requests to different GRAMs and provides timebarrier mechanisms among jobs GridFTP: implements a high-performance, secure data transfer mechanism based on an extension of the FTP protocol Replica Catalog and Replica Management: facilities for managing data replicas, i.e. multiple copies of data stored in different systems to improve access across geographically-distributed Grids

MOTIVATIONS TOWARDS KNOWLEDGE SERVICES Computation and Data Grids are available today. The Knowledge Grid DATA TO KNOWLEDGE The Information Grid CONTROL The Data & Computation Grid As an advancement of the Data Grid concept, it is imperative to develop Information and Knowledge Grids.

TOWARDS KNOWLEDGE SERVICES SCIENTIFIC OBJECTIVES This objective can be achieved through development of techniques and tools for supporting data intensive applications and integration of Data and Computation Grids with Information and Knowledge Grids. to support the process of unification of data management and knowledge discovery systems with Grid technologies for providing knowledge-based Grid services.

PARALLEL & DISTRIBUTED KDD ON GRIDS Grid middleware targets technical challenges in areas such as communication, scheduling, security, information, data access, and fault detection. More recently, data grid software was proposed. Efforts are needed for the development of knowledge discovery tools and services on the Grid. Grid-aware Knowledge Discovery systems

PARALLEL & DISTRIBUTED KDD ON GRIDS The basic principles that motivate the architecture design of the grid-aware PDKD systems Data heterogeneity and large data size management Algorithm integration and independence Grid awareness Openness Scalability Security and data privacy.

WHAT THE GRID OFFERS Grid tools, such as the Globus Toolkit, Legion and UNICORE, provide basic services that can be effectively exploited in the development of distributed data mining applications. Data Grid middleware (e.g. Globus Data Grid) implements data management architectures based on two main services: storage system and metadata management. We designed an architecture for Grid-enabled knowledge discovery.

THE KNOWLEDGE GRID KNOWLEDGE GRID - a distributed knowledge discovery architecture that integrates data mining techniques and computational Grid resources. In the KNOWLEDGE GRID architecture data mining tools are integrated with lower-level Grid mechanisms and services and exploit Data Grid services. This approach benefits from "standard" Grid services and offers an open architecture that can be configured on top of generic Grid middleware.

THE KNOWLEDGE GRID KDD application examples include gene databases, network access and intrusion data, astronomy data files, web usage and content data. Knowledge discovery processes in all these applications typically require the creation and management of complex, dynamic, multi-step workflows. By the KNOWLEDGE GRID such workflows are defined and mapped on agrid assigning its nodes to the Grid hosts and using interconnections for communication among the workflow components (nodes).

THE KNOWLEDGE GRID D 3 S 1 H 3 D 2 S 3 H 2 S 2 D 1 H 1 Component Selection D 1 D 3 H 2 D 4 S 1 H 3 D 4 S 3 Application Workflow Composition Application Execution on the Grid

KNOWLEDGE GRID ENVIRONMENT A KNOWLEDGE GRID application uses: A set of KNOWLEDGE GRID-enabled computers (K-GRID nodes) declaring their availability to participate to some PDKD computation, that are connected by A Grid infrastructure offering basic grid-services (authentication, data location, service level negotiation) and implementing the KNOWLEDGE GRID services.

KNOWLEDGE GRID SERVICES The KNOWLEDGE GRID hierarchic layers : Core K-Grid layer and services are organized in two High-level K-Grid layer. The former refers to services directly implemented on the top of generic Grid services. The latter is used to describe, develop, and execute PDKD computations over the KNOWLEDGE GRID.

KNOWLEDGE GRID ARCHITECTURE High level K-Grid layer DAS Data Access Service Core K-Grid layer TAAS ToolsandAlgorithms Access Service EPMS Execution Plan ManagementService Resource Metadata Execution Plan Metadata Model Metadata RPS Result PresentationService K N O W L E D G E KMR KDS Knowledge Directory Service KEPR RAEMS Resource Alloc. Execution Mng. KBR G R I D Generic Grid Services

KNOWLEDGE GRID ARCHITECTURE High level K-Grid layer DAS Data Access Service Core K-Grid layer KMR TAAS ToolsandAlgorithms Access Service KDS Knowledge Directory Service EPMS Execution Plan ManagementService KEPR Resource Metadata Execution Plan Metadata Model Metadata RAEMS Resource Alloc. Execution Mng. RPS Result PresentationService KBR High level K-Grid layer High level K-Grid layer DAS Data Access Service Core K-Grid layer KMR TAAS ToolsandAlgorithms Access Service KDS Knowledge Directory Service K NEPMS O W L Execution Plan ManagementService KEPR E D G E Resource Metadata Execution Plan Metadata Model Metadata Resource Metadata Execution Plan Metadata Model Metadata RAEMS Resource Alloc. Execution Mng. RPS Result PresentationService KBR High level K-Grid layer DAS Data Access Service Core K-Grid layer TAAS ToolsandAlgorithms Access Service EPMS Execution Plan ManagementService Resource Metadata Execution Plan Metadata Model Metadata RPS Result PresentationService DAS Data Access Service Core K-Grid layer KMR TAAS ToolsandAlgorithms Access Service KDS Knowledge Directory Service EPMS Execution Plan ManagementService KEPR RAEMS Resource Alloc. Execution Mng. G R I KBR D RPS Result PresentationService KMR KDS Knowledge Directory Service KEPR RAEMS Resource Alloc. Execution Mng. KBR High level K-Grid layer Resource Metadata Execution Plan Metadata Model Metadata DAS Data Access Service TAAS ToolsandAlgorithms Access Service EPMS Execution Plan ManagementService RPS Result PresentationService Core K-Grid layer KMR KDS Knowledge Directory Service KEPR RAEMS Resource Alloc. Execution Mng. KBR

KNOWLEDGE GRID SERVICES Core K-Grid layer services: Knowledge directory service (KDS). Extends the basic Globus MDS and GIS services to maintain a description of all data and tools used in the KNOWLEDGE GRID. Resource allocation and execution management service (RAEMS). RAEMS services are used to find a mapping between an execution plan and available resources. The Core K-Grid layer manages metadata describing features of data sources, third party data mining tools, data management, and data visualization tools and algorithms.

KNOWLEDGE GRID SERVICES High-level K-grid layer services: Data Access Search, selection (Data search services), extraction, transformation and delivery (Data extraction services) of data to be mined. Tools and algorithms access Search, selection, and downloading of data mining tools and algorithms. Execution Plan Management Generation of a set of different execution plans that satisfy user, data, and algorithms requirements and constraints. Results presentation Specifies how to generate, present and visualize the PDKD results (rules, associations, models, classification, etc.).

Heterogeneous Resources in GRID-Based KDD Resource involved distributed KDD processes are: Computational resources Data sources (structured or unstructured) Mining tools and algorithms Knowledge extracted (models) Visualization tools and algorithms Heterogeneity arises from the large variety of resources within each category (e.g. software can run on particular host machines, data can be extracted by different data management systems etc.)

A METADATA MODEL A metadata model can play a main role for an efficient management of heterogeneity. Metadata should: document in a simple and human-readable fashion the features of a data mining application. allow for the effective search of resources. provide an efficient way to access resources. be accessed by software tools that support the visual building of Knowledge Grid computations.

A METADATA MODEL In the KNOWLEDGE GRID single resources and workflows are stored using a XML-based metadata model. XML metadata are used to categorize and represent the following resources: data mining software and other tools used in a KDD process data sources discovered knowledge models execution plans.

METADATA MODEL IMPLEMENTATION We use the Globus MDS model only for generic Grid resources, but extended it with an XML metadata model to manage specific KNOWLEDGE GRID resources. Metadata describing relevant K-Grid objects, such as data sources and data mining tools, are implemented using both LDAP and XML. The (Knowledge Metadata Repository) KMR is implemented by LDAP entries and XML documents. The LDAP portion is used as a first point of access to more specific information represented by XML documents.

APPLICATION COMPOSITION STEPS Metadata about K-grid resources DAS / TAAS Search and selection of resources KMRs EPMS Metadata about the selected K-grid resources Design of the PDKD computation TMR Execution Plan KEPR

APPLICATION EXECUTION STEPS Execution Plan RAEMS Execution Plan optimization and translation KEPR RSL script GRAM Execution of the PDKD computation RPS Computation results Results presentation KBR

A TOOL : VEGA A prototype version of the KNOWLEDGE GRID architecture have been implemented using Java and the Globus Toolkit 2.x. To allow a user to build a grid-based data mining application, we developed a toolset named VEGA (a Visual Environment for Grid Applications). VEGA offers users support for : u u u task composition -definition of the entities involved in the computation and specification of relations among them; checking of the consistency of the planned task; generation of the execution plan for a data mining task. u execution of the execution plan through the resource allocation manager of the underlying grid.

VEGA : Visual Design Rather than devise a set of customized syntactical rules, VEGA makes available a visual language to express relations among resources and composition of workflow that define a KDD Grid application. Basic elements: Links Workspaces Graphical objects

VEGA : OBJECTS and LINKS Objects: Hosts Software Data Links: File Transfer Execute Input Output Objects represent resources Links represent relations among resources

VEGA : HOSTS and RESOURCES Hosts pane Resources pane

VEGA WORKSPACES

VEGA WORKSPACES A KGrid application can be composed of several workspaces

XML METADATA in a KMR... <Software> <name>autoclass</name> <description>unsupervised Bayesian Classifier </description> <release> <number major= 3 minor= 3 patch= 3 /> <date>01 May 00</date> </release> <author>nasa Ames Research Center</author> <hostname>icarus.isi.cs.cnr.it</hostname> <executablepath>/share/software/autoclass-c/autoclass </executablepath> <manualpath>/share/software/autoclass-c/read-me.text </manualpath>... </Software>

XML EXECUTION PLAN <ExecutionPlan>... <Task ep:label="ws1_dt2"> <DataTransfer> <Source ep:href="g1../unidb.xml" ep:title="unidb on g1.isi.cs.cnr.it"/> <Destination ep:href="k2../unidb.xml ep:title="unidb on k2.deis.unical.it"/>... </DataTransfer> </Task>... <Task ep:label="ws2_c2"> <Computation> <Program ep:href="k2../iminer.xml" ep:title="iminer on k2.deis.unical.it"/> <Input ep:href="k2../unidb.xml" ep:title="unidb on k2.deis.unical.it"/>... <Output ep:href="k2../iminer.out.xml" ep:title="iminer.out on k2.deis.unical.it"/> </Computation> </Task>... <TaskLink ep:from="ws1_dt2" ep:to="ws2_c2"/>... </ExecutionPlan>

A GENERATED RSL SCRIPT +... (&(resourcemanagercontact=g1.isi.cs.cnr.it) (subjobstarttype=strict-barrier) (label=ws1_dt2) (executable=$(globus_location)/bin/globus-url-copy) (arguments=-vb notpt gsiftp://g1.isi.cs.cnr.it/.../unidb gsiftp://k2.deis.unical.it/.../unidb ) )... (&(resourcemanagercontact=k2.deis.unical.it) (subjobstarttype=strict-barrier) (label=ws2_c2) (executable=.../iminer)... ) )...

APPLICATION EXECUTION

MODELS, PROJECTS, and PROTOTYPES 4KNOWLEDGE GRID 4Discovery Net 4DataCentric Grid 4ADaM EPSRC s project at Imperial College (e-science) Queen s University project/model for immovable data Algorithm Develop. and Mining to mine hydrology data 4TeraGrid Project 4Terra Wide Data Mining Testbed 4Terabyte Challenge Testbed Projects and Testbeds of the National Center for Data Mining (NCDM) at UIC. 4Global Discovery Network

OGSA The Open Grid Services Architecture (OGSA) is an effort to align Grid technologies with Web services technologies. OGSA defines the Grid service concept. A Grid service is apotentially transient Web service with specified interfaces and behaviors. Grid Services implement a service oriented architecture, allowing Grid functionality to be incorporated into a Web services framework.

OGSA KNOWLEDGE GRID SERVICES The KNOWLEDGE GRID is an abstract service-based Grid architecture that does not limit the user in developing and using service-based knowledge discovery applications. We are defining a set of Grid Services that export functionality and operations of the KNOWLEDGE GRID. Each of the KNOWLEDGE GRID services is exposed as a persistent service, using the OGSA conventions and mechanisms.

KNOWLEDGE GRID SERVICES Research Topics: XML-based Grid resource information system; Ontologies for Grid Computing and Knowledge Discovery; Peer-To-Peer Models and Services for resource discovery and presence management; Grid Portals: high-level Problem Solving Environment (PSEs) for Knowledge Discovery on the Grid. Knowledge Management on Grids.

TOWARDS P2P GRIDS P2P and GRIDS are both distributed computing models that depart from the classic client/server model. If we analyze both models we could discover that Grids are in essence P2P systems. But today their implementations differ significantly.

P2P RESOURCE DISCOVERY AND PRESENCE MANAGEMENT P2P mechanisms for resource publication and discovery on Globus. P2P Information System for the KNOWLEDGE GRID. P2P Presence Management tool for Grids implemented in JXTA.

ISSUES in KNOWLEDGE GRIDS Research Issues : Naming, representation, and information modelling of Grid entities and resources. Efficient data transfer and data replica management protocols and services on Grids. Discovery, retrieval, and combination of information regarding available resources, services, software, policies, use cases and best practices. Query mechanisms and intelligent agents for query formation. Development of data brokers, to coordinate the access of data resources across Grid Systems.

ISSUES in KNOWLEDGE GRIDS Setting-up end-user requirements for new knowledge-oriented Grid technologies and preparing test-beds. Use of agent based systems to allow knowledge management across distributed virtual organizations. Development of Semantic Grid models and techniques for data intensive applications. Definition of ontology-based services and applications for data management and knowledge discovery. Evaluating and validating new knowledge discovery strategies and develop grid business models for them.

GRID WISH LIST Despite the exiciting results achieved in the latest years there are some features of future Grids that today are not available: Easy to program, Adaptive, Human-centric, Secure, Reliable, Pervasive, and Knowledge-based.

CONCLUSIONS Parallel and distributed data mining suites and computational grid technology are two critical elements of future high-performance computing environments for e-science (data-intensive experiments) e-business (on-line services) virtual organizations support (virtual teams, virtual enterprises) Knowledge Grids will enable entirely new classes of advanced applications for dealing with the distributed massive data sets. The Grid is not yet another distributed computing system: it is a medium to dynamically share heterogeneous resources, services, and knowledge.

CONCLUSIONS Grids are coupling computation-oriented services with dataoriented services and knowledge-based services. This trend enlarges the Grid application scenario and offer new opportunities for high-level applications. We are much more able to store data than to extract knowledge from it. The integration of knowledge discovery and Grid technologies can help us in this task.

THANKS www.icar.cnr.it/kgrid