Educating a New Breed of Data Scientists for Scientific Data Management

Similar documents
Data publication and discovery with Globus

BEST BIG DATA CERTIFICATIONS

Data Science: A lifecycle perspective

Carelyn Campbell, Ben Blaiszik, Laura Bartolo. November 1, 2016

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Big Data infrastructure and tools in libraries

Developing a Research Data Policy

SEAD Data Services. Jim Best Practices in Data Infrastructure Workshop. Cooperative agreement #OCI

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

Data Curation Profile Movement of Proteins

I data set della ricerca ed il progetto EUDAT

Content Management for the Defense Intelligence Enterprise

JMP and SAS : One Completes The Other! Philip Brown, Predictum Inc, Potomac, MD! Wayne Levin, Predictum Inc, Toronto, ON!

Importance of the Data Management process in setting up the GDPR within a company CREOBIS

Data Discovery - Introduction

Data Curation Profile Cornell University, Biophysics

Towards FAIRness: some reflections from an Earth Science perspective

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Educator Learning Journeys. Tracy Immel Global Director Teacher Professional Development Programs & Certification

Big Data Specialized Studies

The Data Curation Profiles Toolkit: Interview Worksheet

MSc(IT) Program. MSc(IT) Program Educational Objectives (PEO):

Data Virtualization Implementation Methodology and Best Practices

Academy Catalogue - Customers-

Match data set availability to data resource requirements, including gap analysis and remediation assistance.

USC Viterbi School of Engineering

TOGAF days. Course description

EUROPEAN ICT PROFESSIONAL ROLE PROFILES VERSION 2 CWA 16458:2018 LOGFILE

Upgrading your End User Skills to SharePoint 2013 Course 55026A; 3 Days, Instructor-led

Survey of Research Data Management Practices at the University of Pretoria

Data Curation Profile Plant Genetics / Corn Breeding

Business Intelligence Roadmap HDT923 Three Days

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

INF - INFORMATION SCIENCES

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Science-as-a-Service

Software Development & Education Center. Microsoft Dynamics. Service Industries-AX 2012 R2

Information Technology Engineers Examination. Database Specialist Examination. (Level 4) Syllabus. Details of Knowledge and Skills Required for

The Smithsonian/NASA Astrophysics Data System

A: Advanced Technologies of SharePoint 2016

Making research data repositories visible and discoverable. Robert Ulrich Karlsruhe Institute of Technology

Course Content. This is the second in a sequence of two courses for IT Professionals and is aligned with the SharePoint 2016 IT Pro certification.

Writing a Data Management Plan A guide for the perplexed

Big Data Integration BIG DATA 9/15/2017. Business Performance

Data Curation Profile: Agronomy / Grain Yield

Data Warehousing Fundamentals by Mark Peco

Education Brochure. Education. Accelerate your path to business discovery. qlik.com

Metadata Models for Experimental Science Data Management

Advanced Solutions of Microsoft SharePoint Server 2013

Edinburgh DataShare: Tackling research data in a DSpace institutional repository

SCHOOL OF APPLIED HANDS ON. HIGH VALUE. TECHNOLOGY LIVE AND ONLINE COURSES

ARC VIEW. Critical Industries Need Active Defense and Intelligence-driven Cybersecurity. Keywords. Summary. By Sid Snitkin

CIS 602: Provenance & Scientific Data Management. Visualization & Provenance. Dr. David Koop

What can I do with a major in Computer Science?

Threat and Vulnerability Assessment Tool

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC

Conducting a Self-Assessment of a Long-Term Archive for Interdisciplinary Scientific Data as a Trustworthy Digital Repository

Semantic Technologies for Nuclear Knowledge Modelling and Applications

Jim Mains Director of Business Strategy and Media Services Media Solutions Group, EMC Corporation

The Role of Repositories and Journals in the Astronomy Research Lifecycle

Session Two: OAIS Model & Digital Curation Lifecycle Model

DATA SHEET RSA NETWITNESS PLATFORM PROFESSIONAL SERVICES ACCELERATE TIME-TO-VALUE & MAXIMIZE ROI

Flexible Design for Simple Digital Library Tools and Services

Survey of research data management practices at the University of Pretoria, South Africa: October 2009 March 2010

Advanced Solutions of Microsoft SharePoint Server 2013 Course Contact Hours

Janice Sipior, Villanova Joe Valacich, Washington State. Panelists:

Advanced Solutions of Microsoft SharePoint 2013


Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

An overview of the OAIS and Representation Information

ARKive-ERA Project Lessons and Thoughts

irods for Data Management and Archiving UGM 2018 Masilamani Subramanyam

The Social Grid. Leveraging the Power of the Web and Focusing on Development Simplicity

User Stories : Digital Archiving of UNHCR EDRMS Content. Prepared for UNHCR Open Preservation Foundation, May 2017 Version 0.5

A Brief Introduction to the Data Curation Profiles

Global Data Sharing The Research Data Alliance

Accelerated SQL Server 2012 Integration Services

Comparing Curricula for Digital Library. Digital Curation Education

NDSA Web Archiving Survey

HPE ALM Standardization as a Precursor for Data Warehousing March 7, 2017

Opus: University of Bath Online Publication Store

Digital Archivists in Practice: A Preliminary Survey

1 Technology, 2 Trends and 3 Questions for UC business success. Brian Witt Regional System Engineer enterprise.alcatel-lucent.com

2788 : Designing High Availability Database Solutions Using Microsoft SQL Server 2005

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Preservation and Access of Digital Audiovisual Assets at the Guggenheim

EUDAT- Towards a Global Collaborative Data Infrastructure

re3data.org - Making research data repositories visible and discoverable

From Visual Data Exploration and Analysis to Scientific Conclusions

MASTER OF INFORMATION TECHNOLOGY (Structure B)

Applying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing Trusted Digital Repositories

A. NAME OF AGENDA ITEM Consideration and approval of new or revised courses and programs.

Overview of the Information Technology Management Concentration & Career Focuses

Hauppauge Tri-State Library Consultancy. March 25 & 26, 2010

GRADUATE PROGRAMS IN ENTERPRISE AND CLOUD COMPUTING

Planning and Administering SharePoint 2016

Teach For All Partner Learning Portal Project

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

ITSS Model Curriculum. - To get level 3 -

Building on to the Digital Preservation Foundation at Harvard Library. Andrea Goethals ABCD-Library Meeting June 27, 2016

Transcription:

Educating a New Breed of Data Scientists for Scientific Data Management Jian Qin School of Information Studies Syracuse University Microsoft escience Workshop, Chicago, October 9, 2012

Talk points Data science (DS) and data scientists in the context of scientific data An ischool version of the DS curriculum Findings and lessons from implementing the DS curriculum A new breed of data scientists: the ischool approach 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 2

What is data science? An emerging area of work concerned with the collection, presentation, analysis, visualization, management, and preservation of large collections of information. Stanton, J. (2012). Introduction to Data Science. http://ischool.syr.edu/media/documents/2012/3/datascie ncebook1_1.pdf 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 3

Data science and scientific research Plan, design, consult for, implement, and evaluate data management projects and services Ingest, store, organize, merge, filter, and transform data and create analysis-ready data 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 4

What data scientists are expected to do: the job market Scientific Data Management Specialist Design, develop, implement, and manage high-throughput automatic data processing infrastructure for large databases in a mature system Develop and improve the infrastructure supporting this system Interface with multiple data providers to design, build, and maintain their customized databases Clarify requirements, feature requests and bug reports for software developers and assist in http://www.bioinformatics.org/for testing code. ums/forum.php?forum_id=9670 Laboratory Data Management Specialist Administer operational database Assure the quality of data database content Interact closely with researchers, lab managers, and platform coordinators Track deliverables against budget and prepare data reports Collaborate closely with IT and bioinformatics colleagues Assist IT in gathering workflow requirements Test changes and updates in IT systems Create and maintain app documentation Data Modeling/ Management Specialist Working closely with the high performance computing and the IT manager Develop a data model for complex multi-scale rocks Design and organize a database and complex queries Integrate and mange multiscale rocks subjected to large-scale scientific computing applications http://www.ingrainrocks.co m/data-managementspecialist/ 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 5

We re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others. Loukides, M. (2011). What is data science? Sebastopol, CA: O Reilly. 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 6

What data scientists are expected to do: the difference from tradition Data scientists are more likely to be involved across the data lifecycle: Acquiring new data sets: 33% Parsing data sets: 29% Filtering and organizing data: 40% Mining data for patterns: 30% Advanced algorithms to solve analytical problems: 29% Representing data visually: 38% Telling a story with data: 34% Interacting with data dynamically: 37% Making business decisions based on data: 40% http://mashable.com/2012/01/13/careerof-the-future-data-scientist-infographic/ 10/16/20 12 MICROSOFT ESCIENCE WORKSHOP 2012 7

How should educational programs address the challenge? A case of the CAS in Data Science program at Syracuse ischool DS 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 8

Story 1: cognitive-demanding workflows and data management Domain: Thermochronology and tectonics What s involved: rock samples from drilling and field observation, sliced and grained rock samples Data types: Excel data files (lots of them), spectrum and microscopic images, annotations Analysis: modeling and sensemaking by combining data from multiple data files with specialized software Bottleneck problem: manually matching/merging/filtering data is extremely cumbersome and the problem is compounded by the difficulty finding the right data files

Story 2: highly automated workflows Domain: Astrophysics: gravitational wave detection What s involved: data ingestion from laser interferometers, raw data calibration and segmentation, workflow management, provenance Data types: streaming data from the laser interferometers, images Analysis: detection of events Bottleneck problem: tracking of data and processes and the relationships between them 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 10

Ability to use a wide variety tools for documentation, analysis, and report of data Knowledge of a subject domain Data modeling, database and query design Collaboration, communication, and coordination Data scientists OS, Programming languages What are expected of data scientists? Content and repository systems Encoding languages 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 11

Analytical skills: domain modeling Requirement analysis Workflow analysis Data modeling Data transformation needs analysis Data provenance needs analysis Interview skills, analysis and generalization skills Ability to capture components and sequences in workflows Ability to translate domain analysis into data models Ability to envision the data model within the larger system architecture

Analytical skills: from data sources to patterns, relationships, and trends Analytical tools Hacking Knowledge Data products 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 13

Data management skills: data lifecycle and infrastructural services Metadata standards Encoding language Semantic control Identify management Infrastructural services Processed, transformed, derived, calculated, data Common data format Image formats Matrix formats Microarray file formats Communication protocols Data source discovery Data curation Data preservation Data integration and mashup Data citation, publication, and distribution Data linking and interoperability 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 14

Technology skills with excellent communication skills TECHNOLOGY SKILLS Operation systems Repository systems Database systems Programming languages Encoding languages Specialized programming COMMUNICATION SKILLS Interviews Ice breaking Community building Institutionalization Stakeholder buy-in 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 15

No superman model for beginning data scientists Data analytics General system management Data scientists Core: Applied data science Databases Data storage and management Data visualization 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 16

The CAS in Data Science program at SU Required: Data Administration Concepts and Database Management Applied Data Science Elective: Data Analytics Data Mining Basics of Information Retrieval Systems Natural Language Processing Advanced Information Analytics Research Methods Statistical Methods Data Storage and Management Technologies for Web Content Management Foundations of Digital Data Creating, Managing, and Preserving Digital Assets Data Warehousing Advanced Database Management Data Visualization Information Architecture for Internet Services Information Visualization General Systems Management Enterprise Technologies Managing Information Systems Projects Information Systems Analysis 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 17

What we learned from the program development Data science is a moving target with multiple focal points Versions from statistics, computer science, and library and information science Skills vs. theories Students are anxious to learn skills but not so interested in theories Theories help build visions Sufficient hands-on time for technologies and tools Authentic learning through real-world data management projects 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 18

Reconciliation of the two views of data science An emerging area of work concerned with the collection, presentation, analysis, visualization, management, and preservation of large collections of information. We re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others. Stanton, J. (2012). Introduction to Data Science. http://ischool.syr.edu/media/documents/201 2/3/DataScienceBook1_1.pdf Loukides, M. (2011). What is data science? Sebastopol, CA: O Reilly. 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 19

The ischool s version of data science education Eventually the ischool data science program will build the foundation for super data scientists Ability to use a wide variety tools for documentation, analysis, and report of data Collaboration, communication, and coordination Knowledge of a subject domain Content and repository systems Data scientists Data modeling, database and query design Encoding languages OS, Programming languages 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 20

escience Librarianship Curriculum Project: http://eslib.ischool.syr.edu/ Science Data Literacy Project: http://sdl.syr.edu/ CAS in Data Science: http://ischool.syr.edu/future/cas/datasci ence.aspx 10/16/2012 MICROSOFT ESCIENCE WORKSHOP 2012 21