CU Boulder Research Cyberinfrastructure plan 1

CU Boulder Research Cyberinfrastructure plan 1 Advanced information technology provides for the creation of robust new tools, organized and coordinated seamlessly, allowing the free flow of information, ideas, and results. Fully realizing this goal requires resources that span from the individual faculty member through medium-scale campus layer resources to large national centers, such as the National Science Foundationfunded XSEDE and Department of Energy Leadership Computing Facilities. This complex mix of advanced computing resources, people, and capabilities is sometimes referred to as cyberinfrastructure (CI), which we define as follows: Cyberinfrastructure consists of computational systems, data and information management, advanced instruments, visualization environments, and people, all linked together by software and advanced networks to improve scholarly productivity and enable knowledge breakthroughs and discoveries not otherwise possible 2. The mission of the research computing group is to provide leadership in developing, deploying, and operating such an integrated CI to allow CU Boulder achieve further preeminence as a research university. Research Computing Data Center Facility A centrally managed data center will reduce campus energy costs and carbon footprint by offering a more energy-efficient alternative to locally maintained server farms (currently estimated about 40 throughout CU-Boulder) as well as house mission-critical IT infrastructure. In turn, this will recapture office and lab space for teaching and research. Requirements for the data center: 1. Support high power density. Standard server racks now will have about 30kW. The power density can double or triple for GPU based racks. 2. Support next generation power infrastructure, e.g. 480 VAC and possible DC systems. Electrical failover and redundancy Centralized Research Storage Personnel (1 FTE): 1 Storage Engineer, 1 Used Blueprint for the Digital University (http://idi.ucsd.edu/_files/blueprint.pdf) to guide this document 2 Developing Coherent Cyberinfrastructure from Campus to the National Facilities (http://casc.org/papers/epo0906.pdf)

Initial storage investment: 2 PB raw (1PB at each location), about $1,000,000, funded through the NSF-MRI grant petalibrary Servers: $80,000 for CIFS and NSF gateway servers Background and technical description Reliable, high-performance, professionally managed raw data storage is a fundamental building block of CU Boulder s research cyberinfrastructure. Simply stated, while users of computing (e.g., via their own clusters, through a campus condo cluster, an NSF computing center, or commercial clouds) can tolerate periods of downtime or limited availability, raw storage must be highly reliable and persistently available. The research computing group is currently working on a implementation of a scalable (multipetabyte) storage infrastructure with different service levels. Different researchers can tolerate different risk storing their data. We will provide the following services: Non replicated raid protected storage Replicated storage Replicated storage with tape backup At the most fundamental level, our storage will provide access to the following types of clients: Authenticated workstations and laptops on the CU network Laboratory and department owned clusters Primary/secondary storage for data intensive instruments Higher level data preservation services (see section on data management) Very high performance shared resource facilities such as the high performance computing facility It is clear that data storage requirements are not uniform across campus, researchers or labs. CU research storage with stable base funding provides for the critical infrastructure, know how, monitoring, and maintenance for a defined volume of storage that can grow over time. If for example, each faculty member were allocated 1 TB of long term storage this year, at the end of 10 years, that number could grow to 20TB/faculty member. However, this does not answer the question of How does a single researcher store 100TB today? To solve this issue, we will operate the CU research storage as a storage Condo. By that we mean the following: Researchers could write extramural grants to fund the acquisition of physical storage to meet their needs and a fraction of administration, security, and monitoring that scales above the basic personnel costs outlined above. It would be part of the final governance to determine appropriate cost recovery rates, but the research computing advisory committee will consider at least two different cost scenarios: Condo storage (above basic allocation) has a lifetime limited to the warranty of the project purchased storage. Condo storage has a lifetime equivalent to the core storage (unlimited).

In other words, costing should be calculated on the basis of both limited lifetime and infinite lifetime for long-term data preservation of large-scale research data. Research Data Management (institutional stewardship of research data) Background Members of the CU Bolder research community routinely produce large amounts of data that need to be stored, analyzed, and preserved. These research data sets and their derivative output (e.g., publications, visualizations, etc.) represent the intellectual capital of the University; they have inherent and enduring value and must be preserved and made readily accessible for reuse by future researchers. Today s interdisciplinary research challenges cannot be addressed without the ability to combine data from disparate disciplines. Researchers need to know: (1) what relevant data exist, (2) how to retrieve, (3) how to combine, and (4) how to mine and analyze them using the latest data mining, analysis, and visualization tools. Granting agencies understand this fundamental scientific need, and are increasingly making it a condition of funding that researchers have a plan for preserving their data and for making it discoverable and available for reuse by other researchers. To keep CU Boulder competitive, the research computing group will develop baseline data services that respond to these new realities. The proposed CU Research Data Archive is a suite of three core services designed to support the needs of modern researchers: Data Analysis and Visualization Data Curation Data Discovery and Integration These services complement each other and provide a horizontal stack of data services that cover both active contemporary use and preservation for future use. Data Analysis and Visualization Baseline data analysis and visualization services will be provided to all CU researchers as one of the core services of the CU Research Data Archive. Discovering the knowledge that is buried in large data collections is a team effort that includes the researchers who created or gathered the data, the staff who host the data, and the specialists who can analyze and visualize the data. There are several aspects to this work: Data migration, upload, metadata creation, and management: bringing data into active disk areas where they can be accessed, synthesized, and used. Interface creation: adding front ends for either the data owners or their designated audiences to access and manipulate the data. Data analysis and mining: providing services that use advanced statistical and database processes to create usable data sets out of raw data.

Database/management tools selection (Oracle, MySQL, SRB, etc.): helping data owners and users understand the options at their disposal and helping them choose the most appropriate tools for their needs. Distributed data management: working with data owners and researchers who have data scattered across different sources and in different locations, synthesizing it to form a more coherent working environment. Database application tuning, and database optimization: providing ongoing advanced database support for a myriad of activities. Schema design and SQL query tuning: helping with advanced data searching services for a wide variety of data. These tasks are all necessarily active in nature, and involve researchers and service providers working directly with the data on a nearly continuous basis. Only by doing this can they provide users with the ability to organize, process, and manage large quantities of research data into data collections for data driven discovery. The visualization services at CU will provide users with a wide range of tools and services to aid in their scientific research. Data Curation Data curation encompasses the following three concepts: Curation: The activity of managing and promoting the use of data from their creation, to ensure they are fit for contemporary use and available for discovery and reuse. For dynamic data sets this may mean continuous updating or monitoring to keep them fit for future research. Higher levels of curation can involve maintaining links and annotations with published materials. Archiving: A curation activity that ensures that data are properly selected, appraised, stored, and made accessible. The logical and physical integrity of the data are maintained over time, including security and authenticity. Preservation: An archiving activity in which specific items or collections are maintained over time so that they can be accessed and remain viable in subsequent technology environments. It is important to note that archiving and preservation are subsets of the larger curation process, which is a much broader, planned, and interactive process. Data curation is critically important for a research institution because it provides two vital services needed to ensure data longevity: Data are not merely stored, but are preserved to overcome the technical obsolescence inherent and unavoidable in any storage system. Data are documented in such a way that they can be linked in scientific publications and meet the requirements of funding agencies. Staff of the CU Libraries and CU research computing would provide the curation service component of the CU Research Data Archive jointly. The CU Libraries would provide curatorial

oversight and bibliographic control and integration services. CU research computing staff would provide the back end technology services needed to actively maintain the data and the storage systems holding them. Staff from both organizations will provide the metadata services necessary to ensure that data remain discoverable and accessible. The data itself would be housed on campus, in the CU campus storage facility. It should be noted that this is merely the first level of storage needed. For true long-term preservation, it is essential to plan for storage that is not on campus. If this is not done, data are always dependent on a single point of failure, and are thus highly vulnerable. Baseline investments are required to establish geographically distributed replicas of data. Since data are inextricably dependent on a mediating technological infrastructure and subject to loss occasioned by either environmental, organizational, or technological disruptions, it is imperative that vital campus research data be replicated in at least two remote sites geographically, organizationally, and technically independent of each other and that the entire enterprise be anchored within a reliable and predictable baseline source of revenue, as even a temporary interruption of proactive curation activities can lead to irreparable loss. For this reason, another layer of service is required that stores exact duplicates of the data offsite. Research computing is working with NCAR to allow for this type of storage. Data Discovery and Integration The Research Data Depot would provide a portal to facilitate the discovery of, and access to, the research data held in the Research Data Archive. This service would include facilities for the registration and description of collections, services to support the submission of collections, and assistance for the use, reuse, and amalgamation of data sets for further research. The portal would assign persistent identifiers to each of the data collections, provide the ability to search across all collections that have been registered, link the data collections to their author(s), link the data collections to the resultant analyses and visualizations, and link the data collections to their published output through the integration of portal content with traditional library discovery tools and databases. Where appropriate, the contents of the CU Research Data Archive would be offered for harvesting and crawling by external discovery tools such as Google or disciplinary content aggregators. Research Computing Network 1 FTE in support of the network. Background

The current CU Boulder network is a 1gb (giga-bit, a mere 10 to the 9th) fiber core with Cat 5 or higher wiring throughout campus. We have created a 10 gb research computing core to connect the supercomputer to storage, bring individual dedicated 10gb circuits to various locations as needed, and create a 40gb circuit between the supercomputer and storage to be located at NCAR. CU Boulder participates in I2 (the Internet 2 higher education, government, and vendor research computing consortium) and is an active member of regional gigapops (high speed networking points of presence or pops) and other network. See description in the following subsections. The NSO will be able to get access to fiber to support all connectivity needs necessary at that time with in the Boulder area and to the national networking infrastructure. CU Boulders research computing group will provide all campus researcher s with a leading edge network that meets their needs and facilitates collaboration, high performance data exchange, access to colocation facilities, remote mounts to storage, and real time communications. To that end, we are currently building a Research Cyberinfrastructure Network (RCN) that will be used by every research lab whose requirements go beyond the standard production network. The RCN will complement the standard production network and will be designed for ultra high performance. It should be the first campus environment for implementation of newer technologies before being adopted into the standard production network. Funding and access philosophy should aim to encourage usage of the network. Computational Resources Supercomputer The Dell supercomputer is currently listed as the 31st fastest computer in the world on top500.org. The machine was setup in Texas where Michael Obert ran the High Performance Linpack benchmark, a program solving a very large system of equations, to get a performance of 152.2 teraflops or a trillion floating point operations per second. The computer consists of 342 blade chassis cloud edge servers, each chassis holds 4 servers for a total of 1368 servers. Each server has 2 sockets with six CPU cores per socket for a total of 16,416 cores. Condo model Working on pilot project with the Institute of Cognitive Sciences. Current approach with ICS is to incentivize ICS to shutdown existing resource in the CINC building and move to condo cluster managed by RC. Developing with ICS policies that will move researchers from their local resources to the centralized RC resources.