DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li

Similar documents
DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li

DS595/CS525: Urban Network Analysis --Urban Mobility Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li

AN IMPROVED TAIPEI BUS ESTIMATION-TIME-OF-ARRIVAL (ETA) MODEL BASED ON INTEGRATED ANALYSIS ON HISTORICAL AND REAL-TIME BUS POSITION

TrajAnalytics: A software system for visual analysis of urban trajectory data

ITS Canada Annual Conference and General Meeting. May 2013

Constructing Popular Routes from Uncertain Trajectories

Visual Traffic Jam Analysis based on Trajectory Data

CrowdPath: A Framework for Next Generation Routing Services using Volunteered Geographic Information

arxiv: v1 [stat.ml] 29 Nov 2016

Course Syllabus. Course Information

Social Network Mining An Introduction

Spatio-Temporal Routing Algorithms Panel on Space-Time Research in GIScience Intl. Conference on Geographic Information Science 2012

Spatiotemporal Access to Moving Objects. Hao LIU, Xu GENG 17/04/2018

SMART MAPS BASED ON GPS PATH LOGS

MOBILE WI-FI Policy enforced Wi-Fi for moving vehicle applications

Visualization and modeling of traffic congestion in urban environments

Creating your Blank Shell for Blackboard Learn

CLIENT ONBOARDING PLAN & SCRIPT

MicroStrategy Academic Program

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

CLIENT ONBOARDING PLAN & SCRIPT

Database Management Systems CS Spring 2017

The Application of Concepts from Multiple Courses in Creating a Useful App for the University

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

M Thulasi 2 Student ( M. Tech-CSE), S V Engineering College for Women, (Affiliated to JNTU Anantapur) Tirupati, A.P, India

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

Clicking on Analytics will bring you to the Overview page.

Activity-Based Human Mobility Patterns Inferred from Mobile Phone Data: A Case Study of Singapore

BEST BIG DATA CERTIFICATIONS

TomTom Navigation app for iphone/ipad Reference Guide

Home Office Pro User Guide

Agenda. Traffic Sensor Data. TransDec. Tasks. Moving Objects. TRANSDEC: Transportation Decision Making

Detect tracking behavior among trajectory data

Network Analyst: Performing Network Analysis

User s Manual. Version

Collegiate Link FAQ for Club and Student Organization Leaders

Data Visualization for Educational Research. Taylor Martin Active Learning Lab Utah State University

Distracted Work. Kentucky Crushed Stone Association - Annual Safety & Education Seminar February Robert Jameson, Central Mine Services, Inc.

2015 Entrinsik, Inc.

Creating New Ads QUICK START GUIDE

Here Comes The Bus. Downloading The App Download the app from Google Play Store or from the Apple App Store

Chapter 8: GPS Clustering and Analytics

Mining massive geographic data. Jameson Toole & Yingxiang Yang! Human Mobility and Networks Lab MIT

A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data

Welcome to CS61A! Last modified: Thu Jan 23 03:58: CS61A: Lecture #1 1

Data Models and Data processing in GIS

Welcome to! 4DN4! Advanced Internet Communications"

Taking the Path More Travelled SAS Visual Analytics and Path Analysis

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Introduction to Cisco IoT Tools for Developers IoT 101

Survey on Recommendation of Personalized Travel Sequence

MicroStrategy Academic Program

Data Analyst Nanodegree Syllabus

Introduction to Data Management. Lecture #1 (Course Trailer )

HHH Instructional Computing Fall

Introduction to Data Management for Ocean Science Research

CS 3516: Advanced Computer Networks

Networks in GIS. Geometric Network. Network Dataset. Utilities and rivers One-way flow. Traffic data 2-way flow Multi-modal data

Welcome to the Student Success Network

Detecting Anomalous Trajectories and Traffic Services

CHANGE-ONLY MODELING IN NAVIGATION GEO-DATABASES

CSC 261/461 Database Systems. Fall 2017 MW 12:30 pm 1:45 pm CSB 601

Keys to Your Web Presence Checklist

Bellevue Community College Summer 2009 Interior Design 194 SPECIAL TOPIC: SKETCHUP

Utility Network Management in ArcGIS: Migrating Your Data to the Utility Network. John Alsup & John Long

Graduate Certificate in Internet Business

Statistical Graphs & Charts

Public or Private (1)

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

Network Analyst: An Introduction

Connect to CCPL

Student Getting Started Guide

Introduction to Data Management. Lecture #1 (Course Trailer ) Instructor: Chen Li

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Power Aware Data Driven Distributed Simulation on Micro-Cluster Platforms

Welcome to CS120 Fall 2012

Trip Reconstruction and Transportation Mode Extraction on Low Data Rate GPS Data from Mobile Phone

Big Data Analytics. Description:

: Semantic Web (2013 Fall)

Scalable Selective Traffic Congestion Notification

RouteSavvy Online User Guide

Intelligent Transportation Traffic Data Management

Why You Shouldn t Trust comscore s Numbers for Search Engine Market Share [Data]

feel free to poke around and change things. It's hard to break anything in a Moodle course, and even if you do it's usually easy to fix it.

CS535: Interactive Computer Graphics

Pulaski County Special School District: New Student Online Registration Manual for Parents

AMNH Gerstner Scholars in Bioinformatics & Computational Biology Application Instructions

Mistakes to Avoid when Open Sourcing Proprietary Tech

DATA VISUALIZATION Prepare the data for visualization Data presentation architecture (DPA) is a skill-set that seeks to identify, locate, manipulate,

Outlier Detection With SQL And R. Kevin Feasel, Engineering Manager, ChannelAdvisor Moderated By: Satya Jayanty

CS 200, Section 1, Programming I, Fall 2017 College of Arts & Sciences Syllabus

1. The Basics of Internet Marketing Web Analytics... 23

Elementary Computing CSC /01/2015 M. Cheng, Computer Science 1

Bengal Success Portal

Highway Advertising Sign Mobile Data Capture and Permit Management GIS OAPRT Mobile and OAPRT Web-based Applications

An Android Smartphone Application for Collecting, Sharing and Predicting the Niagara Frontier Border Crossings Waiting Time

Scalable Data Analysis (CIS )

Transcription:

Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 232 Fall 2016

The Data Equation Oceans of Data Ocean Biodiversity Informatics, Hamburg Praia de Forte, Brazil

Data Quality Dimensions v Accuracy Errors in data Example: Jhn vs. John v Currency Lack of updated data Example: Residence (Permanent) Address: out-dated vs. up-to-dated v Consistency Discrepancies into the data Example: ZIP Code and City consistent v Completeness Lack of data Partial knowledge of the records in a table

Geographic outliers - GIS Country, State, named district, etc. Gazetteer of Brazilian localities Ocean Biodiversity Informatics, Hamburg

What do we mean by Data Quality? An essential or distinguishing characteristic necessary for data to be fit for use. SDTS 02/92 The general intent of describing the quality of a particular dataset or record is to describe the fitness of that dataset or record for a particular use that one may have in mind for the data. Chrisman, 1991

Loss of data quality Loss of data quality can occur at many stages: v At the time of collection v During digitisation v During documentation v During storage and archiving v During analysis and manipulation v At time of presentation v And through the use to which they are put Don t underestimate the simple elegance of quality improvement. Other than teamwork, training, and discipline, it requires no special skills. Anyone who wants to can be an effective contributor. (Redman 2001).

Data Cleaning v Data cleaning tasks Accuracy: Smooth out noisy data Currency: Update the records Consistency: Correct inconsistent data Completeness: Fill in missing values

Map matching

Map-matching v Problem: (Sampled data) GPS trajectory = a sequence of GPS locations with time stamps Map a GPS trajectory onto a road network a sequence of GPS points à a sequence of road segments

Spatial Data v e 4 e1 e 2 e 3.start e 3 e 3.end

Map-Matching v Why it is important A fundamental step in many transportation applications Navigation and driving Traffic analysis Taxi dispatching and recommendations Examples: Find the vehicles passing Institute Road Calculate the average travel time from WPI campus to MIT campus When will the Bus 3 arrive at stop Highland St & North Ashland St

Map-Matching v Simple solution for high-sampling-rate data Weighted distance Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Map-Matching (for low sampling rate) v Why difficult?????? (a) Parallel roads b) Overpass c) Spur

Map-Matching v According to the additional information used Geometric Topological Probabilistic Advanced techniques v According to the range of sampling points Local/incremental Global Yu Zheng. Trajectory Data Mining: An Overview. ACM Transaction on Intelligent Systems and Technology, 6, 3, 2015.

Map-matching v Insights Consider both local and global information Incorporating both spatial and temporal features p b p a p c Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Map-matching framework Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Map-matching v Solution (incorporating spatial information) (Observation Probability) Model local possibility e i 3 c i 3 c i 2 c i 1 p i e i 1 N(c j i )= p 1 e (x j i 2 2 2 µ) 2 p i 1 e i 2 (Transmission Probability) Considering context (global) c i 2 p i c i 1 p i+1 Spatial analysis function V (c t i 1! c s i )= d i 1!i w (i 1,t)!(i,s) F s (c t i 1! c s i )=V (c t i 1! c s i ) N(c s i ) Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Map-matching Solution (Cosine Similarity) Temporal analysis function (Considering temporal information) Shortest path is used. P k F t (c t i 1! c s u=1 i )= (e0 u.v v (i 1,t)!(i,s) ) q Pk q Pk u=1 (e0 u.v) 2 u=1 v2 (i 1,t)!(i,s) A Highway P i-1 P i A Service Road v (i 1,t)!(i,s) = P k u=1 l u t i 1!i Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Map-matching Aggregating Spatial and temporal information Local and global information Dynamic programing P 1 's candidates P 2 's candidates P n 's candidates c 1 1 c 1 1 c 2 1 c 2 1 c n 1 c 1 2 Spatio-temporal function c 1 3 c 1 3 c 2 2 c 2 2 c n 2 F (c t i 1! c s i )=F s (c t i 1! c s i ) F t (c t i 1! c s i ), 2 apple i apple n Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Map-matching Path Selection F (P c )=N(c s 1 1 )+ n X Dynamic programing i=2 P = argmax Pc F (P c ) F (c s i 1 i 1! cs i i ) P 1 's candidates P 2 's candidates P n 's candidates c 1 1 c 1 1 c 2 1 c 2 1 c n 1 c 1 2 c 1 3 c 1 3 c 2 2 c 2 2 c n 2 Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Map-matching Example Path Selection

Map-matching framework Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Localized ST-Matching Strategy Path Selection Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Evaluations A N = #Correctly Matched Road Seg #all road segments A L = P Length Matched Road Seg Length of the trajectory Yin Lou, Chengyang Zhang, Yu Zheng, et al. Map-Matching for Low-Sampling-Rate GPS Trajectories. In ACM SIGSPATIAL GIS 2009

Course Project 25

Project 1 directions What is your project goal? v What new story you want to tell? v New contents to sample? v New sampling methods via API? v New statistics of YouTube, view count distribution, dynamics, or # uploaders/active users? v Analysis on other websites, Twitter, Facebook, Foursquare, Yelp, with API interfaces Broad impacts? (Keep in mind) v How YouTube is evolving? More business or personal videos? How to distinguish the two How special events, e.g., NBA game, breaking news, affect the uploading, viewing behaviors v Online Marketing, advertising?

A Few Words on Course Project I Project I: Collecting and Measuring Online Data v Team work; each team 3-4 students. v Starting date: Week 3 (9/8 R) v Proposal Due: Week 4 (9/15 R) 2 pages roughly v Due date/time: Before Class on Week 8 (10/13 R) v Presentation date/time: Class on Week 8 (10/13 R) Selected teams only v Requiring Programming in C/C++, Java, Python, and, etc v v v v v v Choose one online site/service with APIs to download data, or use existing datasets. Examples: (1) estimate site statistics, or (2) applying machine learning methods to predict future trends, or (3) perform time-series analysis to capture dynamic patterns, or something else, as long as your work can potentially bring research value to the community. 27

A Few Words on Course Project I v Group meeting with Prof Li by appointment) Week 3 (9/8 R), Starting date Week 4 (9/15 R), Proposal Due: 2 pages roughly (upload it to discussion board) Week 5 (9/22 R), Methodology due (upload it to discussion board) Week 6 (9/29 R), Results due (upload it to discussion board) Week 7 (10/6 R), Conclusion due (upload it class discussion board) Week 8 (10/13 R), Final Report due at 11:59pm EST & Self and Cross-evaluation due at 11:59pm EST Week 8 (10/13 R), In-class Presentation (10 min) (Selected teams only) Transport Layer 3-28

v Course Project II 29 Projects will be in groups! v 3-5 students per group, depending on enrollment v Topics on your choice (related to big data analytics) v Application-driven v Fundamental data analytics research v Data sources on course website http://wpi.edu/~yli15/courses/ds504fall16/ Resources.html Talk to me once you have an idea.

Next Class: Data Management v Do assigned readings before class 30 v v v Be prepared, read and review required readings on your own in advance! Do literature survey: find and read related papers if any Bring your questions to the class and look for answers during the class. v Submit reviews/critiques v In mywpi before class v Bring 2 hardcopies to the class v Hand in one copy, and keep one copy with you. Review Writing: http://users.wpi.edu/~yli15/courses/ds504fall16/critiques.html v Attend in-class discussions v v Please ask and answer questions in (and out of) class! Let s try to make the class interactive and fun!