Projects at SDSC. Chaitan Baru Richard Marciano Data Intensive Computing Group. San Diego Supercomputer Center

Similar documents
Knowledge-based Grids

DATA MANAGEMENT SYSTEMS FOR SCIENTIFIC APPLICATIONS

Building the Archives of the Future: Self-Describing Records

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Search Framework for a Large Digital Records Archive DLF SPRING 2007 April 23-25, 25, 2007 Dyung Le & Quyen Nguyen ERA Systems Engineering National Ar

Digital Curation and Preservation: Defining the Research Agenda for the Next Decade

Electronic Records Archives: Philadelphia Federal Executive Board

Mitigating Risk of Data Loss in Preservation Environments

Collection-Based Persistent Digital Archives - Part 1

A GML-Based Open Architecture for Building A Geographical Information Search Engine Over the Internet

Welcome. to Pre-bid meeting. Karnataka State Spatial Data Infrastructure (KSSDI) Project, KSCST, Bangalore.

NARA s Electronic Records Archives Program

A GML-Based Open Architecture for Building a Geographical Information Search Engine Over the Internet

Leveraging High Performance Computing Infrastructure for Trusted Digital Preservation

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Metadata and Encoding Standards for Digital Initiatives: An Introduction

Implementing Trusted Digital Repositories

SERVO - ACES Abstract

XViews: XML views of relational schemas

Feature Enhancements by Release

Wendy Thomas Minnesota Population Center NADDI 2014

New Mexico s RGIS Program: State Geospatial Data Clearinghouse

DSpace Fedora. Eprints Greenstone. Handle System

OGC Simple Features (for SQL and XML/GML)

Leveraging metadata standards in ArcGIS to support Interoperability. Aleta Vienneau and Marten Hogeweg

Data Exchange and Conversion Utilities and Tools (DExT)

DataONE: Open Persistent Access to Earth Observational Data

Leveraging metadata standards in ArcGIS to support Interoperability. David Danko and Aleta Vienneau

SEXTANT 1. Purpose of the Application

Alphabet Soup: A Metadata Overview Melanie Schlosser Metadata Librarian

The International Journal of Digital Curation Issue 1, Volume

Presented by Kit Na Goh

Extending the Implementation of PREMIS to Geospatial Resources in the Stanford Digital Repository: An Exploration

National Association of Regional Councils: SICoP DRM 2.0 Pilot

Metadata: The Theory Behind the Practice

Archivists Workbench: White Paper

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES

Reducing Consumer Uncertainty

GML, WFS and SVG: A New Frontier of Internet GIS

BIBL NEEDS REVISION INTRODUCTION

Enabling Interaction and Quality in a Distributed Data DRIS

Geospatial Intelligence Interoperability Through Standards Gordon C.Ferrari Chief, Content Standards and Interoperability Division

Storage Challenges at the San Diego Supercomputer Center

CineGrid Exchange. Building A Global Networked Testbed for Distributed Media Management and Preservation

XML and Inter-Operability in Distributed GIS

Jeffery S. Horsburgh. Utah Water Research Laboratory Utah State University

PA Department of Environmental Protection. Guidance for Data Management

Interoperability in Science Data: Stories from the Trenches

Sustainable Governance for Long-Term Stewardship of Earth Science Data

Connecting Distributed Geoservices: Interoperability research at ITC

ISO PDF/A -Standard Archive file format standard for long-term preservation

Data Grid Services: The Storage Resource Broker. Andrew A. Chien CSE 225, Spring 2004 May 26, Administrivia

NOW ON. Mike Takats Thomson Reuters April 30, 2013

CGM v SVG. Computer Graphics Metafile v Scalable Vector Graphic. David Manock

METAINFORMATION INFRASTRUCTURE FOR GEOSPATIAL INFORMATION

BSC Smart Cities Initiative

INSPIRE: The ESRI Vision. Tina Hahn, GIS Consultant, ESRI(UK) Miguel Paredes, GIS Consultant, ESRI(UK)

Edinburgh DataShare: Tackling research data in a DSpace institutional repository

DLF ENVIRONMENTAL SCAN BY JEN MOHAN

Description Cross-domain Task Force Research Design Statement

The What, Why, Who and How of Where: Building a Portal for Geospatial Data. Alan Darnell Director, Scholars Portal

Data Interoperability in the Hydrologic Sciences

GIS Solutions for Location-Based Services

Leveraging OGC Services in ArcGIS Server. Satish Sankaran, Esri Yingqi Tang, Esri

The NASA/GSFC Advanced Data Grid: A Prototype for Future Earth Science Ground System Architectures

Using ESRI data in Autodesk ISD Products

Esri Support for Geospatial Standards

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

GEO-SPATIAL METADATA SERVICES ISRO S INITIATIVE

IRODS: the Integrated Rule- Oriented Data-Management System

Service Oriented Architecture For GIS Applications

Creating a Corporate Taxonomy. Internet Librarian November 2001 Betsy Farr Cogliano

SIP AIP AIP DIP. Preservation Planning. Data Management. Ingest. Access. Archival Storage. Administration MANAGEMENT P R O D U O N S U M E R E R 4-1.

PROCESS HISTORY METADATA PEGGY GRIESINGER NATIONAL DIGITAL STEWARDSHIP RESIDENT MUSEUM OF MODERN ART DECEMBER 4 T H, 2014

Integrated Map Tool. Overview, Current Status, and Things to Come

Metadata Workshop 3 March 2006 Part 1

Syllabus DATABASE I Introduction to Database (INLS523)

<goals> 10/15/11% From production to preservation to access to use: OAIS, TDR, and the FDLP

How to use Water Data to Produce Knowledge: Data Sharing with the CUAHSI Water Data Center

ArcWeb Services (APIs, GIS Content and Functionality)

ADVANCED GEOGRAPHIC INFORMATION SYSTEMS Vol. II - Geospatial Interoperability : The OGC Perspective Open Geospatial Consortium, Inc.

Toward the Development of a Comprehensive Data & Information Management System for THORPEX

Topology at the US Census

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 14 Database Connectivity and Web Technologies

Basics in good research data management (RDM) for reviewing DMPs

Pennsylvania Mine Map Grant Project

Web-accessible Metadata Tools. Bill Schuman GeoDecisions

Appendix 1: FGDC Press Release

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

The U.S. National Spatial Data Infrastructure

Distributed Data Management with Storage Resource Broker in the UK

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Oct. 13, Reagan Moore

GEOSPATIAL ERDAS APOLLO. Your Geospatial Business System for Managing and Serving Information

LSGI 521: Principles of GIS. Lecture 5: Spatial Data Management in GIS. Dr. Bo Wu

Cyberinfrastructure!

Tutorial International Standards. Web Map Server (WMS) & Web Feature Server (WFS) Overview

XML-based production of Eurostat publications

FDO Data Access Technology at a Glance

Transcription:

Projects at SDSC Chaitan Baru Richard Marciano {baru,marciano}@sdsc.edu Data Intensive Computing Group

Projects at SDSC National Archives and Records Administration, NARA Persistent Archives and Electronic Records NHPRC/NARA XML and GIS axiomap I2T: An Information Integration Testbed for Digital Government

Projects at SDSC ( cont) AMICO In conjunction with the California Digital Library (CDL) Part of the NSF DLI-2 project ESRI Community of Science, Inc. Networked Earthquake Engineering Simulation (NEES) NSF program

NARA & NSF NPACI is a Highly Leveraged National Partnership of Partnerships 47 institutions, up from 37 20 states, up from 18 4 countries, up from 1 5 national labs Many projects (new and old) Vendors and industry Government agencies

Information Based Computing Applications Data Storage Archival Storage Information Management Applications Digital Sky Neuroscience Protein Data Bank Molecular Structures Earth Systems Science Digital Library Collection Building Digital Libraries CDL UCB - Elib UCSB - ADL Stanford - SDLIP U Michigan - UMDL

Information Management Hierarchy Persistent Archives Storage of information model, data model, along with data Data Grid Access to data in a different administration domain Digital Library - Presentation / Information Discovery Interlib - ADEPT, UC Berkeley Digital Library Data Collection Extensible Meta-data catalog - EMCAT Data handling SDSC Storage Resource Broker - SRB Archival Storage High performance storage system - HPSS

Common Information Model extensible Markup Language (XML) Use tags to define semantic context for components of the data set Document Type Definition (DTD) Provides semi-structured representation for organizing tags that can be applied to groups of digital objects Development of standards for tags Digital sky, Protein Data Bank, Neuroscience brain images California Digital Library - Art Museum Image Consortium

Hierarchy of Information Contexts Digital object context Meta-data to define the structure of the object When publishing a digital object, must also publish the context of the object Use collections to organize objects Meta-data to define the structure of the collection When publishing a collection, must also publish the information needed to organize the collection. Use presentation context to control access Meta-data to define structure of presentation

Persistent Object Preservation for Archives (POP-A) Electronic Records Archives: Conceptual View GSI SRB/EMCAT Tapes Accessioning Workbench Archival Repository Reference Workbench Accessi on Collection Query Disks Verify Collection Collection Rebuild Wrap & Containeri ze Metadata Present Describe Internet Records Schedules MIX Archival Research Catalog Order Fulfillment System EMCAT: Extensible Meta-data Catalog GSI: Grid Security Infrastructure MIX: Mediati on of Infor mati onusing XML SRB: Storage Resource Broker

Collections Studied E-Mail Postings Tiger/Line 92 104th Congress Bills 105th Congress Bills Electronic Archive Project Combat Area Casualties File Patent Data Image Collection (AMICO) Joint Interoperability Test SDSC Census House House NARA NARA USPTO CDL Calif. Defense

E-mail Postings Demo NARA_article_begin : Path: news.sdsc.edu!newshub.csu.net! newshub.sdsu.edu! newsfeed.berkeley.edu! news.cis.ohiostate.edu! news.rootsweb.com!rootsweb-gw From: Casivers@aol.com Newsgroups: soc.genealogy.hispanic Subject: Passenger Lists for Ships from Spain To Cuba Date: 22 Mar 1999 16:20:37-0800 Organization: RootsWeb Genealogical Data Cooperative Lines: 7 Message-ID: <2376321.36f6de03@aol.com> NNTP-Posting-Host: localhost Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: bl-1.rootsweb.com 922148437 3147 127.0.0.1 (23 Mar 1999 00:20:37 GMT) X-Complaints-To: usenet@news.rootsweb.com NNTP-Posting-Date: 23 Mar 1999 00:20:37 GMT Xref: news.sdsc.edu soc.genealogy.hispanic:3156 Does anyone know where I can get passengers lists for ships that transported Spaniards to Cuba circa 1860's? Any help would be appreciated. Thanks, Cheryl Sanchez-Sivers NARA_article_end :

E-mail Postings Demo http://srb.npaci.edu 40 million E-mail records: ingested, archived, dynamically rebuilt within a month

NHPRC National Historical Publications & Records Commission Methodologies for the Long-Term Preservation of and Access to Software-Dependent Electronic Records funded January 1, 2000 a 3-year project

Introduction A project funded by the National Historical Publications and Records Commission (NHPRC): to conduct research on the long-term preservation of and access to software-dependent data objects (SD-DO), and to develop prototypes that will lead to the creation of useful tools for archivists to preserve and provide access to electronic records over the long-term.

Goals The organizing principles we adhere to are:!electronic records need to be input as infrastructure independent digital objects (II-DO)!Context information about relationships (collection information) needs to be maintained!the structure of a collection can be derived through an adaptive process!new electronic records can then be validated against the collection structure!error handling mechanisms can be developed!mechanisms to handle the natural evolution of a collection's structure must be developed

Issues Investigated II-DO creation based on the analysis of text, compound, spatial E-records II-DO creation & management automation Archivists Workbench (AW) framework: prototype of infrastructure independent management Issues: robustness structural change scalability

Approach Use of markup standards (SGML, XML, ) XML advantages: allows us to model SD-DO as semistructured information more flexible tolerates structural variations allows for a common information interchange format across the WWW

Plan of Work Research key functions of an AW: using different classes of SD-DO: textual records: ASCII, US Congress, HTML compound records: E-mail, pdf, Word, Excel spatial records: GIS data Key functionality includes: Ingestion Structural inference (DTD creation) II-DO validation Error handling DTD evolution

Major Tasks a. text b. compound ----- records c. spatial I. Input I.a. I.b. I.c. II. DTD Creation II.a. II.b. II.c. III. Document Validation III.a. III.b. III.c. IV. Error Handling IV.a. IV.b. IV.c. V. DTD Evolution V.a. V.b. V.c.

Three-year Timeline

Technology Demos: Year 1: AW for: Year 2: AW for: Year 3: AW for: Deliverables text without evolution html without evolution text with DTD evolution compound docs without DTD evolution preliminary error handling ingestion of spatial objects all types of information

Objectives: Advisory Board & SDSC exchange information develop a working relationship Feedback of interest: attributes for long-term preservation of SD-DO relationships to be maintained across DO & collections sources of data to drive project (evolution, ) usability of tools dissemination of results in the archival and records management community

axiomap Application of XML for Interactive Online Mapping

Spatial XML Markup Languages Metadata Standards FGDC / XML DTD ANZMETA / XML DTD Geography Markup Language (GML) 1.0 OGC Working Draft 17-Jan-2000 Web Mapping Testbed (WMT): NIMA, USACoE, FGDC, NASA, USDA, USGS... Digital Earth (www.digitalearth.gov) AXL (ArcXML) pre-release part of ESRI ArcIMS 31-Jan-2000

GML GML: XML specification to encode geo. info. For both Data Storage & Data Transport Initial release deals with OGC Simple Features: vector geodata: e.g. digital map info (streets, population, land use zones, property lines, watersheds, etc.) GML is not concerned with the visualization of geographic features (drawing of maps) Direct rendering Graphic format GML in XML Direct routing w.o. viz. Transformation into a vector graphics rendering format SVG VML VRML Numerical model

Web Vector Graphics: VML & SVG April 98--Adobe, IBM, Netscape, Sun: PGML (Precision Graphics Markup Language) May 98--Hewlett-Packard, Macromedia, Microsoft, Visio VML (Vector Markup Language) ==> October 98--SVG working group formed SVG contains six predefined objects: Rectangle, circle, ellipse, polyline, polygon, & line ==> Adobe: Illustrator, Photoshop, GoLive

Selected VML Demos http://www.qolsandiego.net/maproom/northcity/main.htm http://www.elzaresearch.com/gis/primaries/

I2T: An Information Integration Testbed for Digital Government To be funded by the NSF Digital Government program at $720K for 3 years (unofficial) Personnel Chaitan Baru, SDSC, PI Yannis Papakonstantinou, CSE/UCSD, co-pi Amarnath Gupta, SDSC, co-pi Bob Hollebeek, U.Penn, co-pi Richard Rockwell, U.Michigan, co-pi Bertram Ludaescher, Richard Marciano, Ilya Zaslavsky, Senior Personnel

Government Partners U.S. Census Bureau NARA USGS Department of Community and Economic Development, State of Penna Department of Labor and Industry, State of Penna SANDAG (San Diego Association of Governments), San Diego County

Research Topics Extending the MIX system to support integration and mediation of geospatial information sources Modeling of GIS information in XML Spatial extensions to XML query languages Dealing with heterogeneity in accuracy resolution feature space schema XML-based representations of GIS to support long-term archiving

Research Topics Tools for DTD-guided wrapping of unstructured text Specifically, investigate conversion of Census codebooks to XML based on the Data Documentation Initiative (DDI) DTD Novel applications of the I2T infrastructure Census Integrated Statistical Information system (FERRETT) Sociology workbench--access to remote survey information. Ability to read DDI-encoded survey codebooks and other XML data sets Distributed decision support and data mining applications

The I2T Testbed Users: government, research, education, general public Research partners: SDSC UCSD CSE U.Penn, NSCP U.Michigan, ICPSR Census Integrated Info. Systems BBQ GUI Spatial Mediator Mediator Data Mining Archive Access Conflation Mediator, Data Analysis and other plug-ins Sociology Workbench Agency information sources Census USGS NARA PA State SANDAG.... Federal.... State Local Wrapped sources from multiple government levels Various surveys

The AMICO Digital Library Project http://www.amico.org http://www.npaci.edu/dice/amico Art Museum Image Consortium 55,146 objects (750 MB) 53,763 thumbnail images (319 MB) 57,609 full tiff images (180 GB)

AMICO Consortium of 26 museums AGO_ AIC_ AKAG ASIA BMFA CCP_ CMA_ DMCC FASF GEH_ JPGM LACM LOC_ MACM MBAM MCAS MIA_ MMA_ NGC_ NMAA PMA_ SFMO SJMA TFC_ WAC_ WMAA Art Gallery of Ontario Art Institute of Chicago Albright-Knox Art Gallery, Buffalo, NY Asia Society Boston Museum of Fine Arts Center for Creative Photography, U. Arizona The Cleveland Museum of Art Davis Museum and Cultural Center, Wellesley College, MA Fine Arts Museums of San Francisco George Eastman House, Rochester, NY J. Paul Getty Museum, Los Angeles, CA Los Angeles County Museum of Art Library of Congress Musée d'art contemporain de Montréal Musée des beaux-arts de Montréal Museum of Contemporary Art, San Diego The Minneapolis Institute of Arts The Metropolitan Museum of Art National Gallery of Canada, Ottawa/Ontario National Museum of American Art, Smithsonian Institution Philadelphia Museum of Art San Francisco Museum of Modern Art San Jose Museum of Art The Frick Collection, NY Walker Art Center, Minneapolis, MN Whitney Museum of American Art, NY

-catdata: 8 files 16,604 year1.d990429 14,430 year1.d990512 22,938 year1.d990520 54,303 year1.d990627 15 year1.d990708 54,298 year1.d990731 93 year1.d990806 657 year1.d990813 Raw Metadata Structure - tiffmetadata: 23 files 2963 AGO_.tiffmetadata.txt 1016 AIC_.tiffmetadata.txt 894 AKAG.tiffmetadata.txt 187 ASIA.tiffmetadata.txt 7591 BMFA.tiffmetadata.txt 401 CCP_.tiffmetadata.txt 1455 CMA_.tiffmetadata.txt 56 DCMC.tiffmetadata.txt 470 DMCC.tiffmetadata.txt 10141 FASF.tiffmetadata.txt 2137 GEH_.tiffmetadata.txt 1459 JPGM.tiffmetadata.txt 1013 LACM.tiffmetadata.txt 20654 LOC_.tiffmetadata.txt 86 MACM.tiffmetadata.txt 50 MBAM.tiffmetadata.txt 31 MCAS.tiffmetadata.txt 1440 MIA_.tiffmetadata.txt 550 MMA_.tiffmetadata.txt 1507 NGC_.tiffmetadata.txt 1416 NMAA.tiffmetadata.txt 154 PMA_.tiffmetadata.txt 158 SFMO.tiffmetadata.txt 86 SJMA.tiffmetadata.txt 68 Such.tiffmetadata.txt 396 WAC_.tiffmetadata.txt 37069 replacements.txt 57499 replacements2.txt - thumbmeta: 52,689 files AGO_.1016.25_thum.met* AGO_.1016.32_thum.met* AGO_.1016.39_thum.met*... WAC_.994C_thum.met WAC_.996C_thum.met WAC_.998C_thum.met WAC_.99C_thum.met* WMAA.1557_56_thum.met WMAA.31_426_thum.met

Tape Data Structure Six DLT7000 tapes. Each tape contains a tar file of TIFF images of approximately 30GB. Tape 1: AGO_, BMFA, LACM Tape 2: CMA_, LOC_, AIC_, AKAG, ASIA Tape 3: CCP_, GEH_, MACM, MCAS, MMA_, NMAA, SFMO, TFC, WMAA, DMCC, JPGM, MBAM, MIA_, NGC_, PMA_, SJMA, WAC_ Tape 4: FASF Tape 5: FASF Tape 6: FASF

AMICO Metadata Conversion Steps Tape Read Raw Metadata files: - catdata (8 files), - tiffmetada (23 files), - thumbmeta (52,689 files) Merge Consolidated Metadata files: - 1 catdata - 1 tiffmetadata - 1 thumbmeta Convert to XML 3 XML files: - 1 catdata - 1 tiffmetadata - 1 thumbmeta Split-bymuseums 1 1 XML file XML file per museum per museum Split-byfile size Multiple XML files per museum Split-bymachines Multiple museum XML files per machine excelon Dump&Load Utility excelon Data Server excelon Data Server

California Digital Library (CDL) Prototype The Art Museum Image Consortium (AMICO) Request for image (X.509) tif file XMAS query BBQ Interface (slide carousel interface) XML doc MIXm View based on AMICO DTD SRB/MCAT HPSS Wrapper MARC Database AMICO XML Database AMICO XML Database (XMAS: XML Matching and Structuring query language)

Architecture for interactive archives excelon Oracle 8i DB2 Metadata server Storage Area Network (SAN) Network attached Disk TB capacity HPSS

Current catalog metadata count (per museum)

Catalog & Image count (per museum)

Catalog & Image count (per museum)

Average tiff size in MB (per museum)

Community of Science, Inc. www.cos.com Specifying XML DTD standards for Current Research Information Systems (CRIS) Investigate tools for wrapping unstructured text mapping from source formats to CRIS DTD Enable creation of warehouse of research information and enable e-commerce

ESRI ESRI develop ArcInfo, ArcView, ArcIMS GIS products Project involves evaluating the ArcXML (AXL) standard Keeping AXL abreast of developments in the Geography Markup Language (GML) standard proposed by OpenGIS Consortium Identifying issues in mapping AXL to other XML standards, e.g. WAP (Wireless Application Protocol)

NEES Proposal to NSF s Networked Earthquake Engineering Simulation (NEES) program Develop NeesML, an XML-based standard for representing earthquake engineering simulation metadata and data NeesML will facilitate the creation of a NEES Curated Database, a warehouse of earthquake engineering simulation information