CU Boulder Research Cyberinfrastructure plan 1

Similar documents
ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

OneUConn IT Service Delivery Vision

Data Curation Handbook Steps

Data Protection for Cisco HyperFlex with Veeam Availability Suite. Solution Overview Cisco Public

Cyberinfrastructure!

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Microsoft Office SharePoint Server 2007

EMC Integrated Infrastructure for VMware. Business Continuity

Effective: 12/31/17 Last Revised: 8/28/17. Responsible University Administrator: Vice Chancellor for Information Services & CIO

INTRODUCING VERITAS BACKUP EXEC SUITE

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Microsoft SharePoint Server 2013 Plan, Configure & Manage

Organizational Update: December 2015

2017 Resource Allocations Competition Results

BUSINESS CONTINUITY: THE PROFIT SCENARIO

Global Headquarters: 5 Speen Street Framingham, MA USA P F

The Computation and Data Needs of Canadian Astronomy

Data Management Glossary

Data publication and discovery with Globus

College of Agricultural Sciences UNIT STRATEGIC PLANNING UPDATES MARCH 2, Information Technologies

EMC Business Continuity for Microsoft Applications

Protecting Future Access Now Models for Preserving Locally Created Content

New Zealand Government IBM Infrastructure as a Service

IBM System Storage DS5020 Express

ORACLE DATABASE LIFECYCLE MANAGEMENT PACK

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

NC Education Cloud Feasibility Report

Veritas Storage Foundation for Windows by Symantec

BSIT 1 Technology Skills: Apply current technical tools and methodologies to solve problems.

How Microsoft IT Reduced Operating Expenses Using Virtualization

Migrating a critical high-performance platform to Azure with zero downtime

Dell Storage Point of View: Optimize your data everywhere

Data Management Checklist

Cisco Unified Computing System Delivering on Cisco's Unified Computing Vision

DRS Policy Guide. Management of DRS operations is the responsibility of staff in Library Technology Services (LTS).

Georgia State University Cyberinfrastructure Plan

Carbonite Availability. Technical overview

Microsoft SQL Server on Stratus ftserver Systems

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Conducting a Self-Assessment of a Long-Term Archive for Interdisciplinary Scientific Data as a Trustworthy Digital Repository

That Set the Foundation for the Private Cloud

Data Governance Central to Data Management Success

7 Ways Compellent Optimizes VMware Server Virtualization WHITE PAPER FEBRUARY 2009

ECONOMICAL, STORAGE PURPOSE-BUILT FOR THE EMERGING DATA CENTERS. By George Crump

Core Services for ediscovery Perfection

Scientific Data Curation and the Grid

Advancing Library Cyberinfrastructure for Big Data Sharing and Reuse. Zhiwu Xie

The Microsoft Large Mailbox Vision

Building on to the Digital Preservation Foundation at Harvard Library. Andrea Goethals ABCD-Library Meeting June 27, 2016

EUDAT- Towards a Global Collaborative Data Infrastructure

<Insert Picture Here> Enterprise Data Management using Grid Technology

5 Fundamental Strategies for Building a Data-centered Data Center

TITLE. the IT Landscape

NEC Express5800 R320f Fault Tolerant Servers & NEC ExpressCluster Software

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

EMC XTREMCACHE ACCELERATES ORACLE

HPE MSA 2042 Storage. Data sheet

EMC Virtual Architecture for Microsoft SharePoint Server Reference Architecture

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

Sustainable Governance for Long-Term Stewardship of Earth Science Data

Controlling Costs and Driving Agility in the Datacenter

Discover the all-flash storage company for the on-demand world

University of California, Riverside. Computing and Communications. Computational UCR. March 28, Introduction 2

Cloud Confidence: Simple Seamless Secure. Dell EMC Data Protection for VMware Cloud on AWS

Energy Action Plan 2015

Data Virtualization Implementation Methodology and Best Practices

University of British Columbia Library. Persistent Digital Collections Implementation Plan. Final project report Summary version

Network Performance, Security and Reliability Assessment

EMC Backup and Recovery for Microsoft Exchange 2007

EMC Virtual Infrastructure for Microsoft SharePoint Server 2010 Enabled by EMC CLARiiON and VMware vsphere 4

Oracle Database and Application Solutions

Introduction to Grid Computing

Big Data 2015: Sponsor and Participants Research Event ""

Transform your bottom line: 5G Fixed Wireless Access

Simplifying Downtime Prevention for Industrial Plants. A Guide to the Five Most Common Deployment Approaches

MySQL for Database Administrators Ed 4

Survey of Research Data Management Practices at the University of Pretoria

Advanced Solutions of Microsoft SharePoint Server 2013 Course Contact Hours

Data Curation Profile Human Genomics

Chain of Preservation Model Diagrams and Definitions

IT Town Hall Meeting

Kroll Ontrack VMware Forum. Survey and Report

Surveillance Dell EMC Storage with IndigoVision Control Center

Reducing Costs in the Data Center Comparing Costs and Benefits of Leading Data Protection Technologies

National Data Sharing and Accessibility Policy-2012 (NDSAP-2012)

Real-time Protection for Microsoft Hyper-V

Integrated Data Management:

Dell EMC Storage with the Avigilon Control Center System

BlackPearl Customer Created Clients Using Free & Open Source Tools

Veritas NetBackup Appliance Family OVERVIEW BROCHURE

EnterpriseLink Benefits

By 2014, World-Wide file based

WHITE PAPER Cloud FastPath: A Highly Secure Data Transfer Solution

Veritas InfoScale Enterprise for Oracle Real Application Clusters (RAC)

PUT DATA PROTECTION WHERE YOU NEED IT

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Connecticut Department of Department of Administrative Services and the Broadband Technology Opportunity Program (BTOP) 8/20/2012 1

arcserve r16.5 Hybrid data protection

Archiving, Backup, and Recovery for Complete the Promise of Virtualisation Unified information management for enterprise Windows environments

Transcription:

CU Boulder Research Cyberinfrastructure plan 1 Advanced information technology provides for the creation of robust new tools, organized and coordinated seamlessly, allowing the free flow of information, ideas, and results. Fully realizing this goal requires resources that span from the individual faculty member through medium-scale campus layer resources to large national centers, such as the National Science Foundationfunded XSEDE and Department of Energy Leadership Computing Facilities. This complex mix of advanced computing resources, people, and capabilities is sometimes referred to as cyberinfrastructure (CI), which we define as follows: Cyberinfrastructure consists of computational systems, data and information management, advanced instruments, visualization environments, and people, all linked together by software and advanced networks to improve scholarly productivity and enable knowledge breakthroughs and discoveries not otherwise possible 2. The mission of the research computing group is to provide leadership in developing, deploying, and operating such an integrated CI to allow CU Boulder achieve further preeminence as a research university. Research Computing Data Center Facility A centrally managed data center will reduce campus energy costs and carbon footprint by offering a more energy-efficient alternative to locally maintained server farms (currently estimated about 40 throughout CU-Boulder) as well as house mission-critical IT infrastructure. In turn, this will recapture office and lab space for teaching and research. Requirements for the data center: 1. Support high power density. Standard server racks now will have about 30kW. The power density can double or triple for GPU based racks. 2. Support next generation power infrastructure, e.g. 480 VAC and possible DC systems. Electrical failover and redundancy Centralized Research Storage Personnel (1 FTE): 1 Storage Engineer, 1 Used Blueprint for the Digital University (http://idi.ucsd.edu/_files/blueprint.pdf) to guide this document 2 Developing Coherent Cyberinfrastructure from Campus to the National Facilities (http://casc.org/papers/epo0906.pdf)

Initial storage investment: 2 PB raw (1PB at each location), about $1,000,000, funded through the NSF-MRI grant petalibrary Servers: $80,000 for CIFS and NSF gateway servers Background and technical description Reliable, high-performance, professionally managed raw data storage is a fundamental building block of CU Boulder s research cyberinfrastructure. Simply stated, while users of computing (e.g., via their own clusters, through a campus condo cluster, an NSF computing center, or commercial clouds) can tolerate periods of downtime or limited availability, raw storage must be highly reliable and persistently available. The research computing group is currently working on a implementation of a scalable (multipetabyte) storage infrastructure with different service levels. Different researchers can tolerate different risk storing their data. We will provide the following services: Non replicated raid protected storage Replicated storage Replicated storage with tape backup At the most fundamental level, our storage will provide access to the following types of clients: Authenticated workstations and laptops on the CU network Laboratory and department owned clusters Primary/secondary storage for data intensive instruments Higher level data preservation services (see section on data management) Very high performance shared resource facilities such as the high performance computing facility It is clear that data storage requirements are not uniform across campus, researchers or labs. CU research storage with stable base funding provides for the critical infrastructure, know how, monitoring, and maintenance for a defined volume of storage that can grow over time. If for example, each faculty member were allocated 1 TB of long term storage this year, at the end of 10 years, that number could grow to 20TB/faculty member. However, this does not answer the question of How does a single researcher store 100TB today? To solve this issue, we will operate the CU research storage as a storage Condo. By that we mean the following: Researchers could write extramural grants to fund the acquisition of physical storage to meet their needs and a fraction of administration, security, and monitoring that scales above the basic personnel costs outlined above. It would be part of the final governance to determine appropriate cost recovery rates, but the research computing advisory committee will consider at least two different cost scenarios: Condo storage (above basic allocation) has a lifetime limited to the warranty of the project purchased storage. Condo storage has a lifetime equivalent to the core storage (unlimited).

In other words, costing should be calculated on the basis of both limited lifetime and infinite lifetime for long-term data preservation of large-scale research data. Research Data Management (institutional stewardship of research data) Background Members of the CU Bolder research community routinely produce large amounts of data that need to be stored, analyzed, and preserved. These research data sets and their derivative output (e.g., publications, visualizations, etc.) represent the intellectual capital of the University; they have inherent and enduring value and must be preserved and made readily accessible for reuse by future researchers. Today s interdisciplinary research challenges cannot be addressed without the ability to combine data from disparate disciplines. Researchers need to know: (1) what relevant data exist, (2) how to retrieve, (3) how to combine, and (4) how to mine and analyze them using the latest data mining, analysis, and visualization tools. Granting agencies understand this fundamental scientific need, and are increasingly making it a condition of funding that researchers have a plan for preserving their data and for making it discoverable and available for reuse by other researchers. To keep CU Boulder competitive, the research computing group will develop baseline data services that respond to these new realities. The proposed CU Research Data Archive is a suite of three core services designed to support the needs of modern researchers: Data Analysis and Visualization Data Curation Data Discovery and Integration These services complement each other and provide a horizontal stack of data services that cover both active contemporary use and preservation for future use. Data Analysis and Visualization Baseline data analysis and visualization services will be provided to all CU researchers as one of the core services of the CU Research Data Archive. Discovering the knowledge that is buried in large data collections is a team effort that includes the researchers who created or gathered the data, the staff who host the data, and the specialists who can analyze and visualize the data. There are several aspects to this work: Data migration, upload, metadata creation, and management: bringing data into active disk areas where they can be accessed, synthesized, and used. Interface creation: adding front ends for either the data owners or their designated audiences to access and manipulate the data. Data analysis and mining: providing services that use advanced statistical and database processes to create usable data sets out of raw data.

Database/management tools selection (Oracle, MySQL, SRB, etc.): helping data owners and users understand the options at their disposal and helping them choose the most appropriate tools for their needs. Distributed data management: working with data owners and researchers who have data scattered across different sources and in different locations, synthesizing it to form a more coherent working environment. Database application tuning, and database optimization: providing ongoing advanced database support for a myriad of activities. Schema design and SQL query tuning: helping with advanced data searching services for a wide variety of data. These tasks are all necessarily active in nature, and involve researchers and service providers working directly with the data on a nearly continuous basis. Only by doing this can they provide users with the ability to organize, process, and manage large quantities of research data into data collections for data driven discovery. The visualization services at CU will provide users with a wide range of tools and services to aid in their scientific research. Data Curation Data curation encompasses the following three concepts: Curation: The activity of managing and promoting the use of data from their creation, to ensure they are fit for contemporary use and available for discovery and reuse. For dynamic data sets this may mean continuous updating or monitoring to keep them fit for future research. Higher levels of curation can involve maintaining links and annotations with published materials. Archiving: A curation activity that ensures that data are properly selected, appraised, stored, and made accessible. The logical and physical integrity of the data are maintained over time, including security and authenticity. Preservation: An archiving activity in which specific items or collections are maintained over time so that they can be accessed and remain viable in subsequent technology environments. It is important to note that archiving and preservation are subsets of the larger curation process, which is a much broader, planned, and interactive process. Data curation is critically important for a research institution because it provides two vital services needed to ensure data longevity: Data are not merely stored, but are preserved to overcome the technical obsolescence inherent and unavoidable in any storage system. Data are documented in such a way that they can be linked in scientific publications and meet the requirements of funding agencies. Staff of the CU Libraries and CU research computing would provide the curation service component of the CU Research Data Archive jointly. The CU Libraries would provide curatorial

oversight and bibliographic control and integration services. CU research computing staff would provide the back end technology services needed to actively maintain the data and the storage systems holding them. Staff from both organizations will provide the metadata services necessary to ensure that data remain discoverable and accessible. The data itself would be housed on campus, in the CU campus storage facility. It should be noted that this is merely the first level of storage needed. For true long-term preservation, it is essential to plan for storage that is not on campus. If this is not done, data are always dependent on a single point of failure, and are thus highly vulnerable. Baseline investments are required to establish geographically distributed replicas of data. Since data are inextricably dependent on a mediating technological infrastructure and subject to loss occasioned by either environmental, organizational, or technological disruptions, it is imperative that vital campus research data be replicated in at least two remote sites geographically, organizationally, and technically independent of each other and that the entire enterprise be anchored within a reliable and predictable baseline source of revenue, as even a temporary interruption of proactive curation activities can lead to irreparable loss. For this reason, another layer of service is required that stores exact duplicates of the data offsite. Research computing is working with NCAR to allow for this type of storage. Data Discovery and Integration The Research Data Depot would provide a portal to facilitate the discovery of, and access to, the research data held in the Research Data Archive. This service would include facilities for the registration and description of collections, services to support the submission of collections, and assistance for the use, reuse, and amalgamation of data sets for further research. The portal would assign persistent identifiers to each of the data collections, provide the ability to search across all collections that have been registered, link the data collections to their author(s), link the data collections to the resultant analyses and visualizations, and link the data collections to their published output through the integration of portal content with traditional library discovery tools and databases. Where appropriate, the contents of the CU Research Data Archive would be offered for harvesting and crawling by external discovery tools such as Google or disciplinary content aggregators. Research Computing Network 1 FTE in support of the network. Background

The current CU Boulder network is a 1gb (giga-bit, a mere 10 to the 9th) fiber core with Cat 5 or higher wiring throughout campus. We have created a 10 gb research computing core to connect the supercomputer to storage, bring individual dedicated 10gb circuits to various locations as needed, and create a 40gb circuit between the supercomputer and storage to be located at NCAR. CU Boulder participates in I2 (the Internet 2 higher education, government, and vendor research computing consortium) and is an active member of regional gigapops (high speed networking points of presence or pops) and other network. See description in the following subsections. The NSO will be able to get access to fiber to support all connectivity needs necessary at that time with in the Boulder area and to the national networking infrastructure. CU Boulders research computing group will provide all campus researcher s with a leading edge network that meets their needs and facilitates collaboration, high performance data exchange, access to colocation facilities, remote mounts to storage, and real time communications. To that end, we are currently building a Research Cyberinfrastructure Network (RCN) that will be used by every research lab whose requirements go beyond the standard production network. The RCN will complement the standard production network and will be designed for ultra high performance. It should be the first campus environment for implementation of newer technologies before being adopted into the standard production network. Funding and access philosophy should aim to encourage usage of the network. Computational Resources Supercomputer The Dell supercomputer is currently listed as the 31st fastest computer in the world on top500.org. The machine was setup in Texas where Michael Obert ran the High Performance Linpack benchmark, a program solving a very large system of equations, to get a performance of 152.2 teraflops or a trillion floating point operations per second. The computer consists of 342 blade chassis cloud edge servers, each chassis holds 4 servers for a total of 1368 servers. Each server has 2 sockets with six CPU cores per socket for a total of 16,416 cores. Condo model Working on pilot project with the Institute of Cognitive Sciences. Current approach with ICS is to incentivize ICS to shutdown existing resource in the CINC building and move to condo cluster managed by RC. Developing with ICS policies that will move researchers from their local resources to the centralized RC resources.