CODE AND DATA MANAGEMENT Toni Rosati Lynn Yarmey
Data Management is Important! Because Reproducibility is the foundation of science Journals are starting to require data deposit You want to get credit for producing data (data citations) Others can use and build on your work (data reuse) Recreating a figure from a 2006 paper shouldn t be painful Funders tell us so (See NSF, NIH, NOAA, etc)
Outline Back up often Sharing code File naming Metadata Sharing data A data search tool
Back up Tips: - 1 working copy on your computer - 1 copy on infrastructure near you - 1 copy on infrastructure far away But why would you only backup when you can do so much more?... SHARE!!
Why Share Code? Good backup Collaboration People don t have to contact you to get and understand the code Faster and easier than other options (emailing individuals or sharing on servers)
Why Share Code? Version control Commenting gives public and brief history Work on multiple computers with the same code flexibility in where you work (no USB drive necessary) Keep code with metadata/user instructions No bureaucracy FREE!
What is Git? Git is a distributed revision control and source code management (SCM) system capable of dealing with nonlinear workflows As with most other distributed revision control systems, and unlike most client-server systems, every Git working directory is a full-fledged repository with complete history and full version tracking capabilities, independent of network access or a central server. (Wikipedia)
GitHub
Sharing Code GitHub.com
Sharing Code GitHub.com GitHub serves as the location of record for VIC at: https://github.com/uw-hydro/vic
File Naming Make names unique and meaningful! Include (as appropriate): - Project name or acronym - Study title - Location - Data type - Researcher initials - Date - Data stage - Version number - File type Think long-term
Metadata What would someone unfamiliar with your data need in order to evaluate, understand, and reuse them? How about someone: - who works in your lab? - from a different lab in your field? - who is in a related interdisciplinary field? - who researches a completely different area? - who works for a newspaper? Congress?
Metadata is the difference between:
Metadata is Data about Data Units? Resolution? What do the Column names mean? Caveats? Known data issues or missing values? How data were collected? Where forcing data came from? How many layers were used in this model? Information that describes the content, quality, condition, origin, and other characteristics of data or other pieces of information. Metadata for spatial data may describe and document its subject matter; how, when, where, and by whom the data was collected; availability and distribution information; its projection, scale, resolution, and accuracy; and its reliability with regard to some standard. Metadata consists of properties and documentation. Properties are derived from the data source (for example, the coordinate system and projection of the data), while documentation is entered by a person (for example, keywords used to describe the data). Esri
Metadata What happens without good metadata? You have no idea what the data mean You think you understand the data, so you use it but you use it totally wrong You waste hours (or days) trying to find out more about the data
Sharing Data These days, Dr. Hodes said, the old model in which researchers jealously guarded their data is no longer applicable. http://www.nytimes.com/ 2011/04/04/health/ 04alzheimer.html
Sharing/Finding Data www.nsidc.org/acadis/search
Organize now. or. Thank you!
Data Reuse Our team enables Arctic sciences by ensuring datasets are well documented and can be understood by re-users. The trick with data re-use is to find the dataset then become familiar enough with a dataset to be able to combine it with other data and extract accurate results.
Data Curation Metadata Usability Documentation Training Re-use Tools A little marketing Partnering Consensus building Data management plans for grant proposals Integrating social and physical sciences Data quality checks Data analysis
DOIs and Citations Digital Object Identifiers (DOI) officially name a resource. A DOI is essentially a stable, permanent URL. Information about a digital object may change over time, including where to find it, but its DOI name will not change. The DOI System provides a framework for persistent identification, managing intellectual content, managing metadata, linking customers with content suppliers, facilitating electronic commerce, and enabling automated management of media. (DataCite.org)
Beyond ACADIS Other Resources General Info and help - Earth Science Information Partners (ESIP): http://wiki.esipfed.org/ UVA Libraries: http://www2.lib.virginia.edu/brown/data/ Data Management Plan and other tools DMP Tool: https://dmp.cdlib.org/ DataOne: https://www.dataone.org/cattools/data%20and%20metadata %20Management Metadata - Excel Plug-in tool (in development): http://www.cdlib.org/cdlinfo/2011/09/01/facilitating-data-management-dcxl/ Lists of Standards (not complete!) for bio, climate, ecology, oceanography - http:// marinemetadata.org/conventions Stanford-based portal for medical/bio - http://bioportal.bioontology.org/resources