Part 1: Collecting and visualizing The Movie DB (TMDb) data

Similar documents
4, 2016, 11:55 PM EST

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Data Collection, Simple Storage (SQLite) & Cleaning

Orbis Cascade Alliance Content Creation & Dissemination Program Digital Collections Service. OpenRefine for Metadata Cleanup.

Homework 7. For your information: Graded out of 100 points; 2 questions total 5-10 hours for phase 1; hours for phase 2.

HOMEWORK 9. M. Neumann. Due: THU 8 NOV PM. Getting Started SUBMISSION INSTRUCTIONS

2018 Pummill Relay problem statement

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Overview. Lab 5: Collaborative Filtering and Recommender Systems. Assignment Preparation. Data

HOMEWORK 8. M. Neumann. Due: THU 1 NOV PM. Getting Started SUBMISSION INSTRUCTIONS

HOMEWORK 7. M. Neumann. Due: THU 8 MAR PM. Getting Started SUBMISSION INSTRUCTIONS

CSC 3300 Homework 3 Security & Languages

MATLAB Demo. Preliminaries and Getting Started with Matlab

Lab 2: Movie Review. overview.png background.png rottenbig.png rbg.png fresh.gif rotten.gif critic.gif

DRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015

Getting Started Guide

Web API Lab. The next two deliverables you shall write yourself.

eschoolplus+ Cognos Query Studio Training Guide Version 2.4

CS 1520 / CoE 1520: Programming Languages for Web Applications (Spring 2013) Department of Computer Science, University of Pittsburgh

Now go to bash and type the command ls to list files. The unix command unzip <filename> unzips a file.

Blackboard 5 Level One Student Manual

Table of Contents Chapter 1: Getting Started System requirements and specifications Setting up an IBM Cognos portal Chapter 2: Roambi Publisher

Access Intermediate

Business Insight Authoring

CS 2316 Homework 9b GT Thrift Shop Due: Wednesday, April 20 th, before 11:55 PM Out of 100 points. Premise

Instructor Guide for Blackboard-Learn

CSE Theory of Computing Fall 2017 Project 3: K-tape Turing Machine

CIT 590 Homework 5 HTML Resumes

Lab 2: Your first web page

Project 2 Reliable Transport

Homework 3: Relational Database Design Theory (100 points)

User Guide. Web Intelligence Rich Client. Business Objects 4.1

General Instructions. Questions

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

Assignment: Seminole Movie Connection

CMPT 354 Database Systems. Simon Fraser University Fall Instructor: Oliver Schulte. Assignment 3b: Application Development, Chapters 6 and 7.

CSE 190D Spring 2017 Final Exam

MS Office for Engineers

15-110: Principles of Computing, Spring 2018

Hellerstein/Olston. Homework 6: Database Application. beartunes. 11:59:59 PM on Wednesday, December 6 th

CSE Theory of Computing Spring 2018 Project 2-Finite Automata

Family Map Server Specification

CSC 380/530 Advanced Database Take-Home Midterm Exam (document version 1.0) SQL and PL/SQL

Learn about the Display options Complete Review Questions and Activities Complete Training Survey

sforce Web Services Enterprise API sforce Object Query Language sforce Custom Objects... 40

Introduction to BEST Viewpoints

Assignment 2: Website Development

alteryx training courses

Unit Assessment Guide

SAS BI Dashboard 3.1. User s Guide Second Edition

ZENworks Reporting System Reference. January 2017

EventCenter Training SEPTEMBER CrowdCompass 2505 SE 11 th Ave, Suite #300 Portland, OR

Query Studio Training Guide Cognos 8 February 2010 DRAFT. Arkansas Public School Computer Network 101 East Capitol, Suite 101 Little Rock, AR 72201

Graded Project. Excel 2016

Section Editor Quick Start. Schoolwires Academic Portal Version 4.0

Record Indexer Server

FORMS. The Exciting World of Creating RSVPs and Gathering Information with Forms in ClickDimensions. Presented by: John Reamer

Incluvie: Actor Data Collection Ada Gok, Dana Hochman, Lucy Zhan

Creating your State Alpha Directory Listing

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Creating Teacher Webpages on the New APS WordPress Site

Tutorial notes on. Object relational structural patterns

HOMEWORK 9. M. Neumann. Due: THU 5 APR PM. Getting Started SUBMISSION INSTRUCTIONS

QTP MOCK TEST QTP MOCK TEST II

Release notes SPSS Statistics 20.0 FP1 Abstract Number Description

Interacting with Dashboards

GPS to GIS Tutorial Exploration of Native Plants

Homework: Content extraction and search using Apache Tika Employment Postings Dataset contributed via DARPA XDATA Due: October 6, pm PT

Creating an Online Catalogue Search for CD Collection with AJAX, XML, and PHP Using a Relational Database Server on WAMP/LAMP Server

/ tel: / tel:

Importing Career Standards Benchmark Scores

T-SQL Training: T-SQL for SQL Server for Developers

Table of Contents. Table of Contents

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications

Microsoft Excel 2013 Unit 1: Spreadsheet Basics & Navigation Student Packet

The following instructions cover how to edit an existing report in IBM Cognos Analytics.

Introduction to Cognos Participants Guide. Table of Contents: Guided Instruction Overview of Welcome Screen 2

CSCI 3300 Assignment 7

(Updated 29 Oct 2016)

Faculty Access for the Web 7 - New Features and Enhancements

Simple sets of data can be expressed in a simple table, much like a

Admissions & Intro to Report Editing Participants Guide

Talend Data Preparation - Quick Examples

Microsoft Access Database How to Import/Link Data

General Improvements with GainSeeker versions 8.8 and 8.8.1

Problem Description Earned Max 1 PHP 25 2 JavaScript / DOM 25 3 Regular Expressions 25 TOTAL Total Points 100. Good luck! You can do it!

Introduction Google Forms is used to plan events, provide quizzes, survey, or collect needed information quickly.

CS / Cloud Computing. Recitation 7 October 7 th and 9 th, 2014

iprism Reports Glossary Index

Software skills for librarians: Library carpentry. Module 2: Open Refine

After entering a course, edit by clicking the Turn editing on button in the upper-right corner.

Graded Project. Microsoft Excel

SAS Visual Analytics 8.2: Getting Started with Reports

-In windows explorer navigate to your Exercise_4 folder and right-click the DEC_10_SF1_P1.csv file and choose Open With > Notepad.

WPM for Departments Using WPM to Edit Your Department s Website

How to stay connected. Stay connected with DIIT

DbSchema Forms and Reports Tutorial

PowerTeacher Administrator User Guide. PowerTeacher Gradebook

Release Date April 24 th 2013

Faculty Guide. Creating your Blank Shell for Blackboard Learn:

Transcription:

CSE6242 / CX4242: Data and Visual Analytics Georgia Tech Fall 2015 Homework 1: Analyzing The Movie DB dataset; SQLite; D3 Warmup; OpenRefine Due: Friday, 11 September, 2015, 11:55PM EST Prepared by Meera Kamath, Gopi Krishnan, Siddharth Raja, Ramakrishnan Kannan, Akanksha, Polo Chau Submission Instructions: It is important that you read the following instructions carefully and also those about the deliverables at the end of each question or you may lose points. Submit a single zipped file, called HW1 {YOUR_LAST_NAME} {YOUR_FIRST_NAME}.zip, containing all the deliverables including source code/scripts, data files, and readme. Example: HW1 Doe John.zip if your name is John Doe. Only.zip is allowed (no.rar, etc.) You may collaborate with other students on this assignment, but each student must write up their own answers in their own words, and must write down the collaborators names on T Square s submission page. Any suspected plagiarism and academic misconduct will be reported and directly handled by the Office of Student Integrity (OSI). If you use any slip days, you must write down the number of days used in the T square submission page. For example, Slip days used: 1 At the end of this assignment, we have provided a folder structure that describes how we expect the files to be contained in that single zipped file. Please make sure you submit your files in the format specified. 5 points will be deducted for not following this strictly. Wherever you are asked to write down an explanation for the task you perform, please stay within the word limit or you may lose points. In your final zip file, please do not include any intermediate files you may have generated to work on the task, unless your script is absolutely dependent on it to get the final result (which it ideally should not be). Part 1: Collecting and visualizing The Movie DB (TMDb) data * We anticipate the time needed to complete this part to be around 3 hours for Q1, and 1 hour for Q2. 1. [25 pt] Use The Movie DB API to: (1) download data about movies and (2) for each movie, download its 5 related movies. a. How to use TheMovieDB API: Create a TMDb account https://www.themoviedb.org/account/signup. More detailed instructions can be found here. After you have signed up, you need to request for an API key. Log into your account page on TMDb and generate a new key from within the "API" section

found on the left sidebar. While requesting for an API key(select Developer ), you may enter: Type of Use Education URL http://poloclub.gatech.edu/cse6242/ Summary For the course CSE 6242 Once your request for the API key is approved, you can view the API key by clicking the Details tab under the API section. Read the documentation ( http://docs.themoviedb.apiary.io/# ) to learn how to use the API. Note : The API allows you to make 40 requests every 10 seconds. Set appropriate timeout intervals in the code while making requests. The API endpoint may return different results for the same request. b. [10 pt] Search for movies with the keyword life in the field title and retrieve the first 300 movies. The documentation for movie search based on keywords: http://docs.themoviedb.apiary.io/#reference/search/searchmovie/get Save the results in movie_id_name.txt. Each line in the file should describe one movie, in the format: <movie ID, movie name> c. [15 pt] For each of the 300 movies, use the API to find its 5 most similar movies (some movies might have fewer than 5 similar movies). The documentation for obtaining similar movies: http://docs.themoviedb.apiary.io/#reference/movies/movieidsimilar Save the results in movie_id_sim_movie_id.txt. Each line in the file should describe one pair of similar movies obtained, in the format: <movie ID, similar movie ID> Note: You should remove all duplicate pairs. That is, if both the pairs (A, B) and (B, A) are present, only keep (A, B) where A < B. For example, if movie A has three similar movies X, Y and Z; and movie X has two similar movies A and B, then there should only be four lines in the file. A, X A, Y A, Z X, B

Deliverables: Create a directory named Q1 and place all the files listed below, including README.txt, into it. Code: Write your own program/script to generate the two data files mentioned in steps b and c. You may use any languages you like (e.g., shell script, python, Java). You must include a README.txt file that includes: (1) which platform you have tested it on (e.g., Linux, Windows 8, etc.), and (2) instructions on how to run your code/script. Note: The submitted code should run as is. (No extra installation or configuration should be required.) Two data files : movie_id_name.txt and movie_id_sim_movie_id.txt produced in steps b and c respectively. 2. [20 pt] We are going to analyze the data (from TMDb) in the files below as an undirected graph (network). Please download the following files: http://poloclub.gatech.edu/cse6242/2015fall/hw1/movie_id_name_gephi.csv http://poloclub.gatech.edu/cse6242/2015fall/hw1/movie_id_sim_movie_id_gephi.csv Each movie in movie_id_name_gephi.csv is represented as a node in the graph; and each movie pair in movie_id_sim_movie_id_gephi.csv represents an undirected edge. You will visualize the graph of similar movies using Gephi (available for free download at http://gephi.org ). Import all the edges in movie_id_sim_movie_id_gephi.csv by checking the create missing nodes option, since many of the IDs in movie_id_sim_movie_id_gephi.csv may not be present in movie_id_name_gephi.csv. a. Gephi quick start guide: https://gephi.org/users/quick start/ b. [10 pt] Visualize the graph and submit a snapshot of a visually meaningful (and hopefully beautiful) view of this graph. This is fairly open ended and left to your preference. Here are some general guidelines of what a visually meaningful graph constitutes: Keeping the edge crossing to minimum. Keeping the edge length roughly the same. Avoid those 1 or 2 very long edges. Keep the edges as straight lines without any bends Keep the graph compact and symmetric if possible. All the nodes and edges should be properly visible. Make it easy to determine (or see ) which nodes any given node is connected to and also count its degree. Make it easy for any other tasks user might want to perform like finding the

shortest path from one node to another or inter connectedness, etc. perhaps clusters of nodes with higher Experiment with Gephi s features, such as graph layouts, changing node size and color, edge thickness, etc. c. [6 pt] Using Gephi s built in functions, compute and report the following metrics for your graph: Average node degree Diameter of the graph Average path length Briefly explain the intuitive meaning of each metric in your own words. (We will discuss these metrics in upcoming lectures on graphs.) d. [4 pt] Run Gephi s built in PageRank algorithm on your graph Submit an image showing the distribution of the PageRank score. (The distribution" is generated by Gephi's built in PageRank algorithm where the horizontal axis is the PageRank score and the vertical axis is the number of nodes with a specific PageRank score) List the top 5 movies( movie title and id) with the highest PageRank score Deliverables: Create a directory named Q2 and place all the files listed below into it. Result for part b: An image file named movie_graph.png containing the snapshot of the graph of the data you obtained from the rotten tomatoes API and a text file named graph_explanation.txt describing your choice of visualization, using fewer than 50 words (e.g. layout, color). Result for part c: A text file named movie_graph_metrics.txt containing the three metrics and your intuitive explanation for each of them, using fewer than 150 words total. Result for part d: An image file named movie_pagerank_distribution.png containing the distribution of the PageRank scores and append the movies with the 5 highest PageRank scores in the file movie_graph_metrics.txt from part c.

Part 2: Using SQLite * We anticipate the time to complete this question to be around 2 hours. 3. [35 pt] Elementary database manipulation with SQLite ( http://www.sqlite.org/ ): a. [2 pt] Import data: Create an SQLite database called rt.db. Import the movie data from http://poloclub.gatech.edu/cse6242/2015fall/hw1/movie name score.txt into a new table (in rt.db) called movies with the schema: movies(id integer, name text, score integer) Import the movie cast data at http://poloclub.gatech.edu/cse6242/2015fall/hw1/movie cast.txt into a new table (in rt.db) called moviecast with the schema: moviecast(movie_id integer, cast_id integer, cast_name text) Provide the SQL code (and SQLite commands used). Hint : you can use SQLite s built in feature to import CSV (comma separated values) files: https://www.sqlite.org/cli.html#csv b. [2 pt] Build indexes : Create two indexes over the table movies. The first index named movies_name_index is for the attribute name. The second index named movies_score_index is for the attribute score. c. [2 pt] Find average movie score: Calculate the average movie score over all movies that have scores >= 1. Output format: average_score d. [4 pt] Finding poor films: Find the 5 worst (lowest scores) movies which have scores > 80. Sort by score from lowest to highest, then name in alphabetical order. Output format: id, name, score e. [4 pt] Finding laid back actors: List 5 cast members (alphabetically by cast_name) with exactly 3 movie appearances. Output format: cast_id, cast_name, movie_count f. [6 pt] Getting aggregate movie scores: Find the top 10 (distinct) cast members who have the highest average movie scores. Sort by score (from high to low). In case of a tie in the score, sort the results based on name of cast members in

alphabetical order. Filter out movies with score < 1. Exclude cast members who have appeared in fewer than 3 movies. Output format: cast_id, cast_name, average_score g. [7 pt] Creating views: Create a view (virtual table) called good_collaboration that lists pairs of stars who appeared in movies. Each row in the view describe one pair of stars who have appeared in at least three movies together AND those movies have average scores >=75. The view should have the format: good_collaboration(cast_member_id1,cast_member_id2,av g_movie_score, movie_count) Exclude self pairs: (cast_member_id1 == cast_member_id2) Keep symmetrical or mirror pairs. For example, keep both (A, B) and (B, A). Hint: Remember that creating a view will produce no output, so you should test your view with a few simple select statements during development. Joins will likely be necessary. h. [3 pt] Finding the best collaborators: Get the 5 cast members with highest average good_collaboration score from the view made in part g. Output format: (cast_id,cast_name,average_good_collab_score) i. [4 pt] Full Text Search (FTS) https://www.sqlite.org/fts3.html Import the movie overview data from http://poloclub.gatech.edu/cse6242/2015fall/hw1/movie overview.txt into a new FTS table (in rt.db) called movie_overview with the schema: movie_overview(id integer, name text, year integer, overview text, popularity decimal) 1. Find the count of all the movies which have the word best or worst in the overview field. 2. List the ids of the movies that contains the terms life and about in the overview field with not more than 10 intervening terms in between. j. [1 pt] Explain your understanding of FTS performance in comparison with a SQL like query and why one may perform better than the other. Please restrict your response to 50 words. Please save your responses in a file titled fts.txt

Deliverables: Create a directory named Q3 and place all the files listed below into it. Code: A text file named Q3.SQL.txt containing all the SQL commands and queries you have used to answer questions a j in the appropriate sequence. We will test its correctness in the following way: $ sqlite3 rt.db <Q3.SQL.txt Assume that the data files are present in the current directory. Important : to make it easier for us to grade your code, after each question s query, append the following command to the txt file (which prints a blank line): select ; or select null; Here s an example txt file: Query for question a select ; Query for question b select ; Query for question c... Answers: A text file named Q3.OUT.txt containing the answers of the questions a j. This file should be created in the following manner: $ sqlite3 rt.db < Q3.SQL.txt >Q3.OUT.txt We will compare your submitted text file to the text created in the above manner. Part 3: D3 Warmup and Tutorial * We anticipate the time to complete this question to be around 1 hour. 4. [10 pt] Go through the D3 tutorial here. Chad Stolper will give a lecture on D3 that will cover aspects used here and in homework 2. Please complete steps 01 08 (Complete through 08. Drawing Divs ). This is an important tutorial which lays the groundwork for homework 2. Hint: We recommend using Firefox and Chrome; since they have relatively robust built in developer tools. Deliverables:

Code: an entire D3 project directory. When run in a browser, it will display 5 bars (as in step 8) with different color for each bar (not like the tutorial where all bars are of the same color) and your name which can appear above or below the bar chart. The files and folder structure for Q4 should be as specified at the end of the assignment. Part 4: OpenRefine * We anticipate the time to complete this question to be around 45 minutes. 5. [10 pt] Basic usage of OpenRefine : a. Download OpenRefine http://openrefine.org/download.html b. Import Dataset: Launch Open Refine. It opens in a browser (127.0.0.1:3333). Download the dataset from here: http://poloclub.gatech.edu/cse6242/2015fall/hw1/menu.csv Choose "Create Project" > This Computer > "menu.csv". Click "Next". You will now see a preview of the dataset. Click "Create Project" in the upper right corner. c. Clean/Refine the data: Note : OpenRefine maintains a log of all changes. You can undo changes. See the "Undo/Redo" button on the upper left corner. i. [3 pt] Clean the "Event" and "Venue" columns (Select these columns to be Text Facets, and cluster the data). Write down your observations in fewer than 75 words. ii. [2 pt] Use the Google Refine Evaluation Language to represent the dates in column( date ) in the format mm/dd/yyyy. iii. [1 pt] List a column in the dataset that contains only nominal data, and another column that contains ordinal data. iv. [1 pt] Create a new column URL which contains a link to the dishes of a particular menu. Format of new column entries: http://api.menus.nypl.org/dishes/139476 where 139476 is the id. v. [3 pt] Experiment with Open Refine, and list a feature (apart from the ones used above) you could additionally use to clean/refine the data, and comment on how it would

be beneficial in less than 50 words. Basic operations like editing a cell or deleting a row do not count. Deliverables: Create a directory named Q5 and place all the files listed below into it. Export the final table to a file named "Q5menu.csv" Submit a list of changes changes.json made to file in json format. Use the Extract Operation History option under the Undo/Redo tab to create this file. A text file Q5Observations.txt with answers to parts c (i), c (iii) and c (v). Survey We would greatly appreciate it if you spend a few minutes to complete this survey related to the time spent by you while completing this assignment https://docs.google.com/forms/d/1uxk2psygmpeffdjaqqmw3xb4pcqg4nmpgawfn4prm o/v iewform Your responses would help us give a more reliable estimate of the time needed to complete the assignment in the future. Thanks! Important Instructions on Folder structure We will be executing the following commands to validate the submission is all right. If some files are missing in our query of the zip folder, marks will be deducted. For example, if Q1/movie_ID_name.txt is missing, 10pt could be deducted. unzip l HW1 LastName FirstName.zip grep c txt should give 9 files. Similarly for html, json, csv, js files as well. The directory structure should look like: HW1 LastName FirstName/ Q1/ Your Script (script.py or script.js, etc.) movie_id_name.txt movie_id_sim_movie_id.txt README.txt Q2/ graph_explanation.txt movie_pagerank_distribution.png movie_graph.png movie_graph_metrics.txt Q3/

Q3.OUT.txt Q3.SQL.txt fts.txt Q4/ index.html d3/ d3.v3.min.js Q5/ changes.json Q5Menu.csv Q5Observations.txt