Data Quality and Cleaning

Size: px
Start display at page:

Download "Data Quality and Cleaning"

Transcription

1 Data Quality and Cleaning A Case of Mobile Phone Survey Data INNA KOUPER DATA TO INSIGHT CENTER SCHOOL OF INFORMATICS AND COMPUTING INDIANA UNIVERSITY September,

2 Why DQ Data becomes: Big Frequent Heterogeneous Collaborative Integrated Reusable Shared

3 Agricultural Decision Making and Food Security in Africa When to plant and harvest What to plant How to grow What weather looks like Are all of your maize fields planted now? Did it rain on your fields this week? Did you plant any maize in the last 7 days? How many 50kg bags of maize do you have in storage now? What seed variety did you plant?

4 Common activities impacting DQ Redman, Thomas C. Data Quality: The Field Guide. Digital Press

5 DQ - Database / industry approach Accuracy The data was recorded correctly. Completeness All relevant data was recorded. Uniqueness Entities are recorded once. Timeliness The data is kept up to date. Consistency The data agrees with itself. Exploratory Data Mining and Data Quality, T. Dasu and T. Johnson, Wiley, 2004

6 DQ - Government approach Utility The data is useful to the public. Objectivity The data is accurate, clear, complete, and unbiased. The data is documented and/or reproducible. The data is subject to peer-review. Integrity The data is protected from corruption and unauthorized action.

7 DQ - Research lifecycle approach Validity Accuracy Consistency Integrity Completeness Context

8 Factors affecting mobile data quality Sampling Medium / message No interviewer Technical e.g., poor network signal, discharged device Social / individual Errors Non-response Policy / economical e.g., literacy level, no funds for texting Limitations of mobile and texting platforms

9 Completeness

10 Accuracy 1 Did it rain this week? Please answer yes or no : Yes, yes, YES, YESS, YAS No, no, No!, N0, NO, no rain 50

11 Accuracy / Consistency Did it rain this week? Please answer yes or no : No Yes

12 Consistency How many 50kg bags of maize do you have in storage now? Week 1 30 Week 2 25 Week 3 25 Week 4 20 Week 5 60

13 Context: Can you interpret the data?

14 Approaches to data quality Preemptive Processes (data management) Metadata and domain expertise Diagnostic Statistics Databases (data mining) Retrospective Data cleaning

15 Processes Decide where to store raw data and products Standardize content and formats Assign responsibility: data stewards Monitor data Archive data

16 Data Processing Pipeline TextIt Server Quality Control Metadata Management Cleaning and anomaly detection Build products Development Server Test Server Production Server MongoDB

17 Data monitoring

18

19 Metadata From the platform From the team Country "uuid": "f07a bc-ab11-572f4b562aa9", Season "name": "harvest flow 25 Apr 2016", Creator "runs": 674, Date created "completed_runs": 252, Run start date "label": "coll_fuelwood" Run start time "label": "harvest" Run end date Run end time Flow type List of questions

20 Data cleaning Identify variable types Check for invalid / incorrect values and missing values Conduct frequency analysis (mean, median, min, max, STD) Identify outliers Identify what can be corrected Decide how to treat outliers and missing values Evaluate time needed for automated and manual cleaning Use tools or manual correction Are there patterns in missing data?

21 Using Google (Open) Refine to clean data String to number Remove words ( bags ) Cluster similar values ( NO, no )

22 Difficulties in cleaning How many 50kg bags of maize do you expect to harvest? !50 You cant tell now rains have jast staoted 5O KGS ONLY

23 Questions for discussion How can we define data quality to reflect differences in types of data and its uses and the dynamic nature of research? What indicators can help to track quality processes and improvements? Who is responsible for ensuring high data quality? What preemptive techniques can help to improve the quality of mobile data?

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC SAP Agile Data Preparation Simplify the Way You Shape Data Introduction SAP Agile Data Preparation Overview Video SAP Agile Data Preparation is a self-service data preparation application providing data

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Feed the Future Innovation Lab for Peanut (Peanut Innovation Lab) Data Management Plan Version:

Feed the Future Innovation Lab for Peanut (Peanut Innovation Lab) Data Management Plan Version: Feed the Future Innovation Lab for Peanut (Peanut Innovation Lab) Data Management Plan Version: 20180316 Peanut Innovation Lab Management Entity The University of Georgia, Athens, Georgia Feed the Future

More information

Quality Assured (QA) data

Quality Assured (QA) data Quality Assured (QA) data Towards DOI quality of data generated at the UFZ Mark Frenzel (Ecologist) & Thomas Schnicke (IT) DataCite / Helmholtz Open Science Workshop Leipzig, 12.01.2016 QA + DOI: Best

More information

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Based on Big Data: Hype or Hallelujah? by Elena Baralis Based on Big Data: Hype or Hallelujah? by Elena Baralis http://dbdmg.polito.it/wordpress/wp-content/uploads/2010/12/bigdata_2015_2x.pdf 1 3 February 2010 Google detected flu outbreak two weeks ahead of

More information

IMPLEMENTING SECURITY, PRIVACY, AND FAIR DATA USE PRINCIPLES

IMPLEMENTING SECURITY, PRIVACY, AND FAIR DATA USE PRINCIPLES IMPLEMENTING SECURITY, PRIVACY, AND FAIR DATA USE PRINCIPLES Introductions Agenda Overall data risk and benefit landscape / shifting risk and opportunity landscape and market expectations Looking at data

More information

Checklist and guidance for a Data Management Plan, v1.0

Checklist and guidance for a Data Management Plan, v1.0 Checklist and guidance for a Data Management Plan, v1.0 Please cite as: DMPTuuli-project. (2016). Checklist and guidance for a Data Management Plan, v1.0. Available online: https://wiki.helsinki.fi/x/dzeacw

More information

Midwest Big Data Hub Accelerating the Big Data Innovation Ecosystem

Midwest Big Data Hub Accelerating the Big Data Innovation Ecosystem Ed Seidel PI (Illinois) Beth Plale Co-PI (Indiana) Sarah Nusser Co-PI (Iowa State) Brian Athey Co-PI (Michigan) Josh Riedy Co-PI, (UND) Melissa Cragin ED (Illinois) SEEDCorn: Sustainable Enabling Environment

More information

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems Management Information Systems Management Information Systems B10. Data Management: Warehousing, Analyzing, Mining, and Visualization Code: 166137-01+02 Course: Management Information Systems Period: Spring

More information

DataBridge: CREATING BRIDGES TO FIND DARK DATA. Vol. 3, No. 5 July 2015 RENCI WHITE PAPER SERIES. The Team

DataBridge: CREATING BRIDGES TO FIND DARK DATA. Vol. 3, No. 5 July 2015 RENCI WHITE PAPER SERIES. The Team Vol. 3, No. 5 July 2015 RENCI WHITE PAPER SERIES DataBridge: CREATING BRIDGES TO FIND DARK DATA The Team HOWARD LANDER Senior Research Software Developer (RENCI) ARCOT RAJASEKAR, PhD Chief Domain Scientist,

More information

Multilingual Information Access for Digital Libraries The Metadata Records Translation Project

Multilingual Information Access for Digital Libraries The Metadata Records Translation Project Multilingual Information Access for Digital Libraries The Metadata Records Translation Project Jiangping Chen Http://max.lis.unt.edu/ Jiangping.chen@unt.edu July 2011 Presentation Outline About Me Current

More information

The ODP Focal Point leads their agency s contribution to ensuring the NSDP meets e-gdds requirements on an ongoing basis.

The ODP Focal Point leads their agency s contribution to ensuring the NSDP meets e-gdds requirements on an ongoing basis. Focal Point Guide 2 PREFACE The African Development Bank (AfDB) and the International Monetary Fund (IMF) have collaborated to provide an Open Data Platform (ODP) for African countries and regional organizations.

More information

Data Mining on Agriculture Data using Neural Networks

Data Mining on Agriculture Data using Neural Networks Data Mining on Agriculture Data using Neural Networks June 26th, 28 Outline Data Details Data Overview precision farming cheap data collection GPS-based technology divide field into small-scale parts treat

More information

NeAT Business Plan Component Data Integration and Annotation Services in Biodiversity (DIAS-B) 1. Service Description

NeAT Business Plan Component Data Integration and Annotation Services in Biodiversity (DIAS-B) 1. Service Description NeAT Business Plan Component Data Integration and Annotation Services in Biodiversity (DIAS-B) 1. Service Description 1.1. Description of a research community and the eresearch service need The Atlas of

More information

Tutorial of the Breeding Planner (BP) for Marker Assisted Backcrossing (MABC)

Tutorial of the Breeding Planner (BP) for Marker Assisted Backcrossing (MABC) Tutorial of the Breeding Planner (BP) for Marker Assisted Backcrossing (MABC) BP system consists of three tools relevant to molecular breeding. MARS: Marker Assisted Recurrent Selection MABC: Marker Assisted

More information

Progress Report World Wide Web Foundation Vision, Programs, Plans

Progress Report World Wide Web Foundation Vision, Programs, Plans Advance the Web to Empower People Progress Report World Wide Web Foundation Vision, Programs, Plans Steve Bratt, CEO World Wide Web Foundation W3C Advisory Committee Meeting March 2010 World Wide Web Foundation

More information

MAIN REFORM AND CAPACITY BUILDING OF ECONOMIC STATISTICS IN CHINA

MAIN REFORM AND CAPACITY BUILDING OF ECONOMIC STATISTICS IN CHINA MAIN REFORM AND CAPACITY BUILDING OF ECONOMIC STATISTICS IN CHINA Wang Ping National Bureau of Statistics in China Contents Main reform in official statistics production of China statistics in China 1

More information

Data Quality Framework

Data Quality Framework #THETA2017 Data Quality Framework Mozhgan Memari, Bruce Cassidy The University of Auckland This work is licensed under a Creative Commons Attribution 4.0 International License Two Figures from 2016 The

More information

The Computation and Data Needs of Canadian Astronomy

The Computation and Data Needs of Canadian Astronomy Summary The Computation and Data Needs of Canadian Astronomy The Computation and Data Committee In this white paper, we review the role of computing in astronomy and astrophysics and present the Computation

More information

The Data Science Process. Polong Lin Big Data University Leader & Data Scientist IBM

The Data Science Process. Polong Lin Big Data University Leader & Data Scientist IBM The Data Science Process Polong Lin Big Data University Leader & Data Scientist IBM polong@ca.ibm.com Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

Easy Knowledge Engineering and Usability Evaluation of Longan Knowledge-Based System

Easy Knowledge Engineering and Usability Evaluation of Longan Knowledge-Based System Easy Knowledge Engineering and Usability Evaluation of Longan Knowledge-Based System ChureeTechawut 1,*, Rattasit Sukhahuta 1, Pawin Manochai 2, Jariya Visithpanich 3, Yuttana Khaosumain 4 1 Computer Science

More information

North American Market for Electronic Content Archiving

North American Market for Electronic Content Archiving An Osterman Research Industry Survey Report January 2016 Osterman Research, Inc. P.O. Box 1058 Black Diamond, Washington 98010-1058 USA Tel: +1 206 683 5683 Tel: +1 206 905 1010 info@ostermanresearch.com

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

Harvesting Democracy: Archiving Federal Government Web Content at End of Term

Harvesting Democracy: Archiving Federal Government Web Content at End of Term Harvesting Democracy: Archiving Federal Government Web Content at End of Term Jefferson Bailey, Director, Web Archiving, Internet Archive @jefferson_bail jefferson@archive.org Abbie Grotke, Web Archiving

More information

Tutorial of the Breeding Planner (BP) for Marker Assisted Recurrent Selection (MARS)

Tutorial of the Breeding Planner (BP) for Marker Assisted Recurrent Selection (MARS) Tutorial of the Breeding Planner (BP) for Marker Assisted Recurrent Selection (MARS) BP system consists of three tools relevant to molecular breeding. MARS: Marker Assisted Recurrent Selection MABC: Marker

More information

Data Mining Concepts & Tasks

Data Mining Concepts & Tasks Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Sept 9, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time

More information

Standard Glossary of Terms used in Software Testing. Version 3.2. Foundation Extension - Usability Terms

Standard Glossary of Terms used in Software Testing. Version 3.2. Foundation Extension - Usability Terms Standard Glossary of Terms used in Software Testing Version 3.2 Foundation Extension - Usability Terms International Software Testing Qualifications Board Copyright Notice This document may be copied in

More information

Research Data Management and Institutional Repositories

Research Data Management and Institutional Repositories Research Data Management and Institutional Repositories 2014 LIS Research Symposium UNISA Dr Lucia Lötter Social science that makes a difference 24 July 2014 Presentation overview Data and Research Data

More information

Challenges and Opportunities with Big Data. By: Rohit Ranjan

Challenges and Opportunities with Big Data. By: Rohit Ranjan Challenges and Opportunities with Big Data By: Rohit Ranjan Introduction What is Big Data? Big data is data sets that are so voluminous and complex that traditional data processing application software

More information

Hortonworks DataPlane Service

Hortonworks DataPlane Service Data Steward Studio Administration () docs.hortonworks.com : Data Steward Studio Administration Copyright 2016-2017 Hortonworks, Inc. All rights reserved. Please visit the Hortonworks Data Platform page

More information

HPC Progress and Response to the National Cyber-Infrastructure

HPC Progress and Response to the National Cyber-Infrastructure HPC Progress and Response to the National Cyber-Infrastructure Happy Sithole Center for High Performance Computing Email: hsithole@csir.co.za Phone: +27 21 658 2745 Website: http://www.chpc.ac.za The CHPC

More information

7 The Protection of Certification Marks under the Trademark Act (*)

7 The Protection of Certification Marks under the Trademark Act (*) 7 The Protection of Certification Marks under the Trademark Act (*) In this research, I examined the certification and verification business practices of certification bodies, the use of certification

More information

Reproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team

Reproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team Reproducible & Transparent Computational Science with Galaxy Jeremy Goecks The Galaxy Team 1 Doing Good Science Previous talks: performing an analysis setting up and scaling Galaxy adding tools libraries

More information

SEO PROPOSAL YOUR SEO CAMPAIGN YOUR SEO PROPOSAL CAMPAIGN STRATEGY

SEO PROPOSAL YOUR SEO CAMPAIGN YOUR SEO PROPOSAL CAMPAIGN STRATEGY SEO PROPOSAL CAMPAIGN STRATEGY YOUR SEO CAMPAIGN Mr. Pipeline sets out to find you the right leads that will convert at a higher rate. We do not obsess about increasing page rankings, but over time will

More information

About Knowledge Convergence. e-infrastructures Austria an interdisciplinary case study concerning research resources and their management

About Knowledge Convergence. e-infrastructures Austria an interdisciplinary case study concerning research resources and their management About Knowledge Convergence e-infrastructures Austria an interdisciplinary case study concerning research resources and their management Paolo Budroni The Munin Conference Tromsø, 27th November 2014 THE

More information

Domestic electricity consumption analysis using data mining techniques

Domestic electricity consumption analysis using data mining techniques Domestic electricity consumption analysis using data mining techniques Prof.S.S.Darbastwar Assistant professor, Department of computer science and engineering, Dkte society s textile and engineering institute,

More information

How App Ratings and Reviews Impact Rank on Google Play and the App Store

How App Ratings and Reviews Impact Rank on Google Play and the App Store APP STORE OPTIMIZATION MASTERCLASS How App Ratings and Reviews Impact Rank on Google Play and the App Store BIG APPS GET BIG RATINGS 13,927 AVERAGE NUMBER OF RATINGS FOR TOP-RATED IOS APPS 196,833 AVERAGE

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Powering Official Statistics at Statistics New Zealand with DDI-L and Colectica

Powering Official Statistics at Statistics New Zealand with DDI-L and Colectica Powering Official Statistics at Statistics New Zealand with DDI-L and A Case Study Authors 2 Adam Brown adam.brown@stats.govt.nz Jeremy Iverson jeremy@colectica.com Sally Vermaaten sally.vermaaten@stats.govt.nz

More information

Data Quality Assessment Tool for health and social care. October 2018

Data Quality Assessment Tool for health and social care. October 2018 Data Quality Assessment Tool for health and social care October 2018 Introduction This interactive data quality assessment tool has been developed to meet the needs of a broad range of health and social

More information

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: KH116 Fall 2017 Merged CS586 and DS504 Examples of Reviews/ Critiques Random

More information

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce CSE 701: LARGE-SCALE GRAPH MINING A. Erdem Sariyuce WHO AM I? My name is Erdem Office: 323 Davis Hall Office hours: Wednesday 2-4 pm Research on graph (network) mining & management Practical algorithms

More information

The NIH Collaboratory Distributed Research Network: A Privacy Protecting Method for Sharing Research Data Sets

The NIH Collaboratory Distributed Research Network: A Privacy Protecting Method for Sharing Research Data Sets The NIH Collaboratory Distributed Research Network: A Privacy Protecting Method for Sharing Research Data Sets Jeffrey Brown, Lesley Curtis, and Rich Platt June 13, 2014 Previously The NIH Collaboratory:

More information

The CROS portal. A platform for your collaborative initiative? Jean-Marie Bolis & Martin Karlberg ESTAT B1 17 November 2017.

The CROS portal. A platform for your collaborative initiative? Jean-Marie Bolis & Martin Karlberg ESTAT B1 17 November 2017. The CROS portal A platform for your collaborative initiative? Jean-Marie Bolis & Martin Karlberg ESTAT B1 17 November 2017 1 Introduction and general information on the CROS portal Finding information

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Stat Day 6 Graphs in Minitab

Stat Day 6 Graphs in Minitab Stat 150 - Day 6 Graphs in Minitab Example 1: Pursuit of Happiness The General Social Survey (GSS) is a large-scale survey conducted in the U.S. every two years. One of the questions asked concerns how

More information

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E Powering Knowledge Discovery Insights from big data with Linguamatics I2E Gain actionable insights from unstructured data The world now generates an overwhelming amount of data, most of it written in natural

More information

CORE: Improving access and enabling re-use of open access content using aggregations

CORE: Improving access and enabling re-use of open access content using aggregations CORE: Improving access and enabling re-use of open access content using aggregations Petr Knoth CORE (Connecting REpositories) Knowledge Media institute The Open University @petrknoth 1/39 Outline 1. The

More information

Enabling efficiency through Data Governance: a phased approach

Enabling efficiency through Data Governance: a phased approach Enabling efficiency through Data Governance: a phased approach Transform your process efficiency, decision-making, and customer engagement by improving data accuracy An Experian white paper Enabling efficiency

More information

Preservation of Web Materials

Preservation of Web Materials Preservation of Web Materials Julie Dietrich INFO 560 Literature Review 7/20/13 1 Introduction Websites are a communication and informational tool that can be shared and updated across the World Wide Web.

More information

Enabling Collaboration for Digital Preservation

Enabling Collaboration for Digital Preservation Enabling Collaboration for Digital Preservation ipres 2009, San Francisco Martha Anderson The Library of Congress .trust and reciprocity lengthen the shadow of the future. Axelrod,The Evolution of Cooperation,1984.

More information

DATA QUALITY KNOWLEDGE MANAGEMENT: A TOOL FOR THE COLLECTION AND ORGANIZATION OF METADATA IN A DATA WAREHOUSE

DATA QUALITY KNOWLEDGE MANAGEMENT: A TOOL FOR THE COLLECTION AND ORGANIZATION OF METADATA IN A DATA WAREHOUSE Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2002 Proceedings Americas Conference on Information Systems (AMCIS) December 2002 DATA QUALITY KNOWLEDGE MANAGEMENT: A TOOL FOR

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

WSIS Implementation Process

WSIS Implementation Process World Summit on the Information Society (WSIS) WSIS Implementation Process (Turning Targets into Action!) Committee on Information and Communications Technology, second session, Bangkok 24-26 November

More information

Putting DDI in the driver s seat

Putting DDI in the driver s seat Putting DDI in the driver s seat Using Metadata to control data capture Samuel Spencer Australian Bureau of Statistics 2010: XForms and DDI January: XForms transform demonstrated within ABS June: XForms

More information

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing

More information

Technical Working Session on Profiling Equity Focused Information

Technical Working Session on Profiling Equity Focused Information Technical Working Session on Profiling Equity Focused Information Using to create, knowledge and wisdom (with a particular focus on meta) 23 26 June, 2015 UN ESCAP, Bangkok 24/06/2015 1 Aims 1. Outline

More information

IUNI Web of Science Data Enclave 102

IUNI Web of Science Data Enclave 102 Enclave 102 Katy Börner and Robert Light Cyberinfrastructure for Network Science Center School of Informatics and Computing and IUNI Indiana University, USA Val Pentchev, Matt Hutchinson, and Benjamin

More information

Data Curation Profile Human Genomics

Data Curation Profile Human Genomics Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date

More information

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 232 Fall 2016 The Data Equation Oceans of Data Ocean Biodiversity Informatics,

More information

The PICTURE project, ICT R&I priorities in EaP, areas of cooperation

The PICTURE project, ICT R&I priorities in EaP, areas of cooperation The PICTURE project, ICT R&I priorities in EaP, areas of cooperation With the EU PICTURE project participants Yerevan, September 26,2013 THEME 1 : PICTURE PROJECT Svetlana Klessova, project coordinator

More information

Data Mining Concepts & Tasks

Data Mining Concepts & Tasks Data Mining Concepts & Tasks Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Jan 16, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time

More information

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful

More information

Introduction to Data Management for Ocean Science Research

Introduction to Data Management for Ocean Science Research Introduction to Data Management for Ocean Science Research Cyndy Chandler Biological and Chemical Oceanography Data Management Office 12 November 2009 Ocean Acidification Short Course Woods Hole, MA USA

More information

DDI metadata for IPUMS I samples

DDI metadata for IPUMS I samples DDI metadata for IPUMS I samples Wendy Thomas Workshop Integrating Global Census Microdata : Dublin Ireland, 58th ISI What is DDI DDI is a metadata standard d focused ocusedprimarily on microdata from

More information

The United Republic of Tanzania THE THIRD QUARTER GROSS DOMESTIC PRODUCT (JULY - SEPTEMBER) 2015

The United Republic of Tanzania THE THIRD QUARTER GROSS DOMESTIC PRODUCT (JULY - SEPTEMBER) 2015 The United Republic of Tanzania THE THIRD QUARTER GROSS DOMESTIC PRODUCT (JULY - SEPTEMBER) 2015 National Bureau of Statistics Ministry of Finance and Planning January 2016 1.0 INTRODUCTION The National

More information

Threat-Based Metrics for Continuous Enterprise Network Security

Threat-Based Metrics for Continuous Enterprise Network Security Threat-Based Metrics for Continuous Enterprise Network Security Management and James Riordan Lexington, MA 02420-9108 {lippmann,james.riordan}@ll.mit.edu To be Presented at IFIP Working Group 10.4 Workshop

More information

Executive Committee Meeting

Executive Committee Meeting Executive Committee Meeting To hear the meeting, you must call in Toll-free phone number: 1-866-740-1260 Access Code: 2201876 For international call in numbers, please visit: https://www.readytalk.com/account-administration/international-numbers

More information

Pre-Requisites: CS2510. NU Core Designations: AD

Pre-Requisites: CS2510. NU Core Designations: AD DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification

More information

SEMANTIC NETWORK AND SEARCH IN VEHICLE ENGINEERING

SEMANTIC NETWORK AND SEARCH IN VEHICLE ENGINEERING Martin Sturm, Sylke Rosenplaenter SEMANTIC NETWORK AND SEARCH IN VEHICLE ENGINEERING From Concept to Deployment Vehicle Design Operations & System Development GM Europe Engineering Adam Opel AG www.opel.com

More information

Statistical Yearbook for Africa

Statistical Yearbook for Africa Statistical Yearbook for Africa Statistics Division, FAORAF Food and Agriculture Organization of the United Nations AFCAS 23, 2013 1 The background Previous yearbook: based on excel sheets, no or limited

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

SEO PROPOSAL YOUR SEO CAMPAIGN YOUR SEO PROPOSAL CAMPAIGN STRATEGY

SEO PROPOSAL YOUR SEO CAMPAIGN YOUR SEO PROPOSAL CAMPAIGN STRATEGY SEO PROPOSAL CAMPAIGN STRATEGY YOUR SEO CAMPAIGN WorkWave Marketing sets out to find you the right leads that will convert at a higher rate. We do not obsess about increasing page rankings, but over time

More information

Applications to support the curation of African government microdata for research purposes

Applications to support the curation of African government microdata for research purposes Statistics SA/OECD Seminar on Innovative Approaches to turn Statistics into Knowledge Applications to support the curation of African government microdata for research purposes Lynn Woolfrey, DataFirst,

More information

DL User Interfaces. Giuseppe Santucci Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza

DL User Interfaces. Giuseppe Santucci Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza DL User Interfaces Giuseppe Santucci Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza Delos work on DL interfaces Delos Cluster 4: User interfaces and visualization Cluster s goals:

More information

Research, Development, and Evaluation of a FRBR-Based Catalog Prototype

Research, Development, and Evaluation of a FRBR-Based Catalog Prototype Research, Development, and Evaluation of a FRBR-Based Catalog Prototype Yin Zhang School of Library and Information Science Kent State University yzhang4@kent.edu Athena Salaba School of Library and Information

More information

Architecture of Complex Systems Tentative Schedule

Architecture of Complex Systems Tentative Schedule Architecture of Complex Systems 2017 Schedule Architecture of Complex Systems Tentative Schedule WEEK 1: Systems Thinking (4.5 hrs) The course Pre-Assessment officially kicks off! Get Started In the first

More information

Scholarly collaboration platforms

Scholarly collaboration platforms Scholarly collaboration platforms STM Meeting 22 April 2015 Washington, DC Mark Ware @mrkwr Question: Which social network do researchers know & use almost as much as Google Scholar? Source: Reprinted

More information

Qualification Specification for the Knowledge Modules that form part of the BCS Level 3 Software Development Technician Apprenticeship

Qualification Specification for the Knowledge Modules that form part of the BCS Level 3 Software Development Technician Apprenticeship Qualification Specification for the Knowledge Modules that form part of the BCS Level 3 Software Development Technician Apprenticeship Level 3 Certificate in Software Development Context and Methodologies

More information

Global Partnership for Sustainable Development and Data Roadmaps. INTRODUCTION 14 June 2016

Global Partnership for Sustainable Development and Data Roadmaps. INTRODUCTION 14 June 2016 Global Partnership for Sustainable Development and Data Roadmaps INTRODUCTION 14 June 2016 1 The Global Sustainable Development Goals MDGs (2000-2015) Developing country focused Social SDGs (2015-2030)

More information

Thaddeus (Thad) Pennas, FHI360

Thaddeus (Thad) Pennas, FHI360 Integrated SBCC Programs: Key Challenges and Promising Strategies New horizons in data collection for integrated SBC Programs. Experience from Ghana and Malawi. Thaddeus (Thad) Pennas, FHI360 Overview

More information

Data Mining. Jeff M. Phillips. January 9, 2013

Data Mining. Jeff M. Phillips. January 9, 2013 Data Mining Jeff M. Phillips January 9, 2013 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? Data

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data

More information

Global Initiatives in Support of Measurements of SDGs

Global Initiatives in Support of Measurements of SDGs Global Initiatives in Support of Measurements of SDGs UN Statistics division Taking Collective Action to Accelerate Transformation of Official Statistics for Agenda 2030 27 28 March 2017, Bangkok 48 th

More information

Some Big Data Challenges

Some Big Data Challenges Some Big Data Challenges 2,500,000,000,000,000,000 Bytes (2.5 x 10 18 ) of data are created every day! (2012) or 8,000,000,000,000,000,000 (8 exabytes) of new data were stored globally by enterprises in

More information

Rural/Urban Divides in Mobile Coverage Expansion

Rural/Urban Divides in Mobile Coverage Expansion Rural/Urban Divides in Mobile Coverage Expansion Pierre Biscaye & C. Leigh Anderson Evans School Policy Analysis & Research Group (EPAR) Evans School of Public Policy & Governance, University of Washington,

More information

Data Governance in Mass upload processes Case KONE. Finnish Winshuttle User Group , Helsinki

Data Governance in Mass upload processes Case KONE. Finnish Winshuttle User Group , Helsinki Data Governance in Mass upload processes Case KONE Finnish Winshuttle User Group 6.11.2014, Helsinki Just IT Mastering the Data Just IT is a Finnish company focusing on Data Governance and Data Management.

More information

10th Tranche Development Account Programme on Statistics and Data (DA10)

10th Tranche Development Account Programme on Statistics and Data (DA10) 10th Tranche Development Account Programme on Statistics and Data (DA10) United Nations Statistics Division Regional Seminar on the Implementation of the SDG Indicators 3-4 April 2017, Santiago, Chile

More information

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

MAPR DATA GOVERNANCE WITHOUT COMPROMISE MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Managing the Evolution of Dataflows with VisTrails

Managing the Evolution of Dataflows with VisTrails Managing the Evolution of Dataflows with VisTrails Juliana Freire http://www.cs.utah.edu/~juliana University of Utah Joint work with: Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, Claudio

More information

Summary. Machine Learning: Introduction. Marcin Sydow

Summary. Machine Learning: Introduction. Marcin Sydow Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:

More information

The What, Why, Who and How of Where: Building a Portal for Geospatial Data. Alan Darnell Director, Scholars Portal

The What, Why, Who and How of Where: Building a Portal for Geospatial Data. Alan Darnell Director, Scholars Portal The What, Why, Who and How of Where: Building a Portal for Geospatial Data Alan Darnell Director, Scholars Portal What? Scholars GeoPortal Beta release Fall 2011 Production release March 2012 OLITA Award

More information

A Data Modeling Process. Determining System Requirements. Planning the Project. Specifying Relationships. Specifying Entities

A Data Modeling Process. Determining System Requirements. Planning the Project. Specifying Relationships. Specifying Entities Chapter 3 Entity-Relationship Data Modeling: Process and Examples Fundamentals, Design, and Implementation, 9/e A Data Modeling Process Steps in the data modeling process Plan project Determine requirements

More information

Development of a Social Extension for Real-Time Communication in CAD Software

Development of a Social Extension for Real-Time Communication in CAD Software Development of a Social Extension for Real-Time Communication in CAD Software Markus Müller, 2.11.2015 (Bachelor s Thesis, final presentation) Software Engineering for Business Information Systems (sebis)

More information

Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity

Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity Wendy Foslien, Honeywell Labs Valerie Guralnik, Honeywell Labs Steve Harp, Honeywell Labs William Koran, Honeywell Atrium

More information

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Shigeo Sugimoto Research Center for Knowledge Communities Graduate School of Library, Information

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Rutgers Master Gardener Program of Somerset County Graduating Class of 2019 POSITION DESCRIPTION

Rutgers Master Gardener Program of Somerset County Graduating Class of 2019 POSITION DESCRIPTION Rutgers Master Gardener Program of Somerset County Graduating Class of 2019 POSITION DESCRIPTION TITLES Rutgers Master Gardener Intern: Currently part of the Rutgers Master Gardener training class or volunteering

More information