MovieNet: A Social Network for Movie Enthusiasts

Similar documents
MovieNet: A Social Network for Movie Enthusiasts

745: Advanced Database Systems

Outline. Databases and DBMS s. Recent Database Applications. Earlier Database Applications. CMPSCI445: Information Systems.

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

CGS 3066: Spring 2017 SQL Reference

Course Web Site. 445 Staff and Mailing Lists. Textbook. Databases and DBMS s. Outline. CMPSCI445: Information Systems. Yanlei Diao and Haopeng Zhang

CMPSCI445: Information Systems

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 7 Introduction to Structured Query Language (SQL)

Chapter 7. Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel

EECS 647: Introduction to Database Systems

ISYS1055/1057 Database Concepts Week 9: Tute/Lab SQL Programming

SQL functions fit into two broad categories: Data definition language Data manipulation language

CS450 - Database Concepts Fall 2015

MySQL for Developers. Duration: 5 Days

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

MySQL for Developers. Duration: 5 Days

Sql Server Syllabus. Overview

Databases (MariaDB/MySQL) CS401, Fall 2015

MySQL 5.0 Certification Study Guide

Course Outline. MySQL Database Administration & Design. Course Description: Pre-requisites: Course Content:

Manual Trigger Sql Server 2008 Update Inserted Rows

Interview Questions on DBMS and SQL [Compiled by M V Kamal, Associate Professor, CSE Dept]

MySQL Database Administrator Training NIIT, Gurgaon India 31 August-10 September 2015

8) A top-to-bottom relationship among the items in a database is established by a

Workbooks (File) and Worksheet Handling

Mysql Information Schema Update Time Null >>>CLICK HERE<<< doctrine:schema:update --dump-sql ALTER TABLE categorie

PHP & PHP++ Curriculum

T-SQL Training: T-SQL for SQL Server for Developers

Keyword Search in Databases

Midterm Review. Winter Lecture 13

Basic SQL. Basic SQL. Basic SQL

Mastering phpmyadmiri 3.4 for

CERTIFICATE IN WEB PROGRAMMING

COMP3311 Database Systems

Careerarm.com. 1. What is MySQL? MySQL is an open source DBMS which is built, supported and distributed by MySQL AB (now acquired by Oracle)

Introduction to MySQL /MariaDB and SQL Basics. Read Chapter 3!

Data about data is database Select correct option: True False Partially True None of the Above

MySQL for Beginners Ed 3

Connectivity: Utilizing Technology To Create Visibility For Your Chapter

DATABASE SYSTEMS. Database programming in a web environment. Database System Course, 2016

Database Systems CSE 414

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

Oracle Schema Create Date Index Trunc >>>CLICK HERE<<<

ACS-3902 Fall Ron McFadyen 3D21 Slides are based on chapter 5 (7 th edition) (chapter 3 in 6 th edition)

Mysql Create Table With Multiple Index Example

MySQL 5.1 Past, Present and Future MySQL UC 2006 Santa Clara, CA

Deccansoft softwareservices-microsoft Silver Learing Partner. SQL Server Syllabus

EE221 Databases Practicals Manual

Unit 27 Web Server Scripting Extended Diploma in ICT

Short List of MySQL Commands

SQL Interview Questions

MySQL for Developers Ed 3

Hello everyone! Page 1. Your folder should look like this. To start with Run your XAMPP app and start your Apache and MySQL.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2017 Quiz I

NULL. The special value NULL could mean: Unknown Unavailable Not Applicable

Covering indexes. Stéphane Combaudon - SQLI

Database Technology. Topic 3: SQL. Olaf Hartig.

Kathleen Durant PhD Northeastern University CS Indexes

Introduction to Data Management. Lecture #1 (Course Trailer ) Instructor: Chen Li

II B.Sc(IT) [ BATCH] IV SEMESTER CORE: RELATIONAL DATABASE MANAGEMENT SYSTEM - 412A Multiple Choice Questions.

CSE 344 Final Review. August 16 th

2.9 Table Creation. CREATE TABLE TableName ( AttrName AttrType, AttrName AttrType,... )

Lecture 8. Database Management and Queries

Perl Dbi Insert Hash Into Table >>>CLICK HERE<<<

1. Data Definition Language.

App Engine: Datastore Introduction

INTERMEDIATE SQL GOING BEYOND THE SELECT. Created by Brian Duffey

Participant Center User s Guide

ITCertMaster. Safe, simple and fast. 100% Pass guarantee! IT Certification Guaranteed, The Easy Way!

Model Question Paper. Credits: 4 Marks: 140

of websites on the internet are WordPress

20762B: DEVELOPING SQL DATABASES

Personalized Keyword Search Drawbacks found ANNE JERONEN, ARMAND NOUBISIE, YUDIT PONCE

IBM emessage Version 9 Release 1 February 13, User's Guide

Module 9: Managing Schema Objects

Oracle 1Z0-882 Exam. Volume: 100 Questions. Question No: 1 Consider the table structure shown by this output: Mysql> desc city:

Performance Optimization for Informatica Data Services ( Hotfix 3)

Networks and Web for Health Informatics (HINF 6220)

Introduction to Data Management. Lecture #1 (Course Trailer )

DATABASE MANAGEMENT SYSTEMS

1z0-882.exam.55q 1z0-882 Oracle Certified Professional, MySQL 5.6 Developer

SQL Server Development 20762: Developing SQL Databases in Microsoft SQL Server Upcoming Dates. Course Description.

SQL - Subqueries and. Schema. Chapter 3.4 V4.0. Napier University

"Charting the Course... Intermediate PHP & MySQL Course Summary

Microsoft. [MS20762]: Developing SQL Databases

Automated SQL Ownage Techniques. OWASP October 30 th, The OWASP Foundation

Database Technology. Topic 2: Relational Databases and SQL. Olaf Hartig.

Chapter 9: Working With Data

Previously everyone in the class used the mysql account: Username: csci340user Password: csci340pass

Database Applications (15-415)

Nested Queries. Dr Paolo Guagliardo. Aggregate results in WHERE The right way. Fall 2018

REPORT ON SQL TUNING USING INDEXING

MySQL for Developers Ed 3

Chapter 8: Working With Databases & Tables

COSC 304 Introduction to Database Systems SQL DDL. Dr. Ramon Lawrence University of British Columbia Okanagan

Developing SQL Databases

MANAGING DATA(BASES) USING SQL (NON-PROCEDURAL SQL, X401.9)

Techno India Batanagar Computer Science and Engineering. Model Questions. Subject Name: Database Management System Subject Code: CS 601

Lecture 19 Query Processing Part 1

Transcription:

MovieNet: A Social Network for Movie Enthusiasts 445 Course Project Yanlei Diao UMass Amherst

Overview MovieNet is a social network for movie enthusiasts, containing a database of movies, actors/actresses, directors, etc., and a social network of movie enthusiasts. Users of MovieNet can search for movies and artists. They can also rate movies. They can further be friends with each other based on movie interests or other similarities. An online movie store can pay to join MovieNet and publish ads that are customized to each user based on her movie interest. 2

Project Requirements Schema design, data cleaning and loading (20%) Basic functionalities of a movie database (40%) Searches: from simple to complex User login and ratings DB admin: updates, direct SQL queries Constraints: schema design, data loading, user ratings Performance requirements (20%) A few seconds for most queries Normalization, physical tuning with indexes Web interface and extensions beyond above (20%) Social networking among movie fans Customized advertisements 3

Step 1: ER Diagram Entity sets Movies, Actors/Actresses, Directors, Producers, MPAA ratings, Genres, Keywords Deduplication: e.g., John Smith I, John Smith II Users (login, name, age ), Friend network Online movie company? Relationship sets A movie has a cast, a director (optional), a producer (optional), genres (optional), keywords (optional), ratings (optional), reviews (optional) Users can rate and write reviews for movies, and be friends with each other Constraints A user can rate a movie once? Age constraint? 4

Data Sets Title<TAB>the title Year<TAB>the year Running Time<TAB>length in minutes MPAA Rating<TAB>the rating Genres<TAB>genre1<TAB>genre2...h Key Words<TAB>word1<TAB>word2 Producers<TAB>producer1<TAB>producer2 Directors<TAB>director1<TAB>director2 Editors<TAB>editor1<TAB>editor2 Actor<TAB>actor name1<tab>role name1 Actor<TAB>actor name2<tab>role name2 Actress<TAB>actress name1<tab>role name3 Actress<TAB>actress name2<tab>role name4 <EMPTY LINE> Title<TAB>the title of another movie Year<TAB>the year of another movie <title,year> is unique Title The Dark Knight Rises Year 2012 Running Time 165 MPAA Rating PG-13 Genres Action Crime Thriller Key Words action-hero bat billionaire bomb car-crash catwoman Producers Nolan, Christopher (I) Roven, Charles Thomas, Emma (I) Directors Nolan, Christopher (I) Editors Smith, Lee (II) Actor Bale, Christian Bruce Wayne Actor Oldman, Gary Commissioner Gordon Actress Hathaway, Anne Selina Actress Cotillard, Marion Miranda Title Life of Pi Year 2012 5

Data Sets Users email<tab>user_name<tab>password<tab>age<tab>gender<tab>location email<tab>user_name<tab>password<tab>age<tab>gender<tab>location sheldon@def.com John Smith 45G5x$dd 21 Male Massachusetts jsmith@abc.com John Smith johnpass 50 Male New York Ratings user email<tab>title<tab>year<tab>user_rating user email<tab>title<tab>year<tab>user_rating sheldon@def.com The Shawshank Redemption 1994 6 sheldon@def.com The Dark Knight Rises 2012 10 MPAA MPAA rating<tab>definition<tab>description MPAA rating<tab>definition<tab>description G General Audiences. All Ages Admitted. A G-rated motion picture contains 6

Notes on the Data Set A movie can be uniquely identified by the combination of its title and year. A movie always has a line for Title, and a line for Year. For some movies, the year is 0, which means the actual year is unknown. All the attribute lines, except Title and Year, could be missing for a movie. For example, the directors or producers may be unknown for a movie. A movie may have multiple values for Genres, and the values are listed in the same line. The same for Key Words, Producers, Directors and Editors. The actors/actresses of a movie are listed in separate lines. Each line contains the name of the actor/actress after the first TAB, and the name of the roll after the second TAB. An actor/actress may play multiple rolls in a movie. Multiple actors/ actresses may play the same roll in a movie. If a roll name is \N, it means the actual roll name is unknown. A person can be an actor/actress, a producer, a director and an editor at the same time in the movies (even in the same movie). A person can be uniquely identified by her name. In the case that multiple persons have the same name, special tags (such as roman numbers) have been appended to the name of each person in the data set to remove ambiguity. 7

Upcoming Deadlines Groups formed on Feb 10 Please email help-cs445@edlab-mail.cs.umass.edu Proposal due on Feb 24 1. Extension beyond basic functionality and performance requirements 2. E/R diagram for your entire application 8

Raw Dataset (1) movies.txt (393,289,550 bytes) Schema: Each movie consists of multiple lines. An empty line separates two movies. Title<TAB>the title Year<TAB>the year Running Time<TAB>length in minutes MPAA Rating<TAB>the rating Genres<TAB>genre1<TAB>genre2... Key Words<TAB>word1<TAB>word2... Producers<TAB>producer name1<tab>producer name2... Directors<TAB>director name1<tab>director name2... Editors<TAB>editor name1<tab>editor name2... Actor<TAB>actor name1<tab>role name1 Actor<TAB>actor name2<tab>role name2 Actress<TAB>actress name1<tab>role name3 Actress<TAB>actress name2<tab>role name4 <EMPTY LINE>... 9

Title The Dark Knight Rises Year 2012 Running Time 165 MPAA Rating PG-13 Genres Action Crime Thriller Key Words action-hero bat billionaire bomb car-crash catwoman Producers Nolan, Christopher (I) Roven, Charles Thomas, Emma (I) Directors Nolan, Christopher (I) Editors Smith, Lee (II) Actor Bale, Christian Bruce Wayne Actor Oldman, Gary Commissioner Gordon Actress Hathaway, Anne Selina Actress Cotillard, Marion Miranda Title Life of Pi Year 2012 10

Data Characteristics A movie can be uniquely identified by the combination of its title and year. A movie always has a line for Title, and a line for Year. For some movies, the year is 0, which means the actual year is unknown. All the attribute lines, except Title and Year, could be missing for a movie. For example, the directors or producers may be unknown for a movie. A movie may have multiple values for Genres, and the values are listed in the same line. The same for Key Words, Producers, Directors and Editors. The actors/actresses of a movie are listed in separate lines. Each line contains the name of the actor/actress after the first TAB, and the name of the roll after the second TAB. An actor/actress may play multiple rolls in a movie. Multiple actors/ actresses may play the same roll in a movie. If a roll name is \N, it means the actual roll name is unknown. A person can be an actor/actress, a producer, a director and an editor at the same time in the movies (even in the same movie). A person can be uniquely identified by her name. In the case that multiple persons have the same name, special tags (such as roman numbers) have been appended to the name of each person in the data set to remove ambiguity. 11

Raw Dataset (2) users.txt (64,753,924 bytes) Schema: <email, name, password, age, gender, location> Format: All the fields are separated by TAB. 12

13

Raw Dataset (3) ratings.txt (137,156,280 bytes) Schema: <user email, movie title, movie year, rating> Format: All the fields are separated by TAB. Interesting metrics Popularity: number of ratings of a movie Goodness: average rating of a movie Most people want to know about popular and good movies! 14

15

Raw Dataset (4) mpaa.txt (4,378 bytes) Schema: <MPAA rating, definition, description> Format: All the fields are separated by TAB. 16

17

Datasets Available at Small sample files for ER modeling: Edlab: /usr/net/ftp/pub/cs445/project/sample_data/ Complete data sets: Edlab: /usr/net/ftp/pub/cs445/project/data/ 19

Step 2: Relational Model CREATE TABLE RelName { AttrName Type } An entity set translated to a table Set the primary key Can introduce an ID to be the key; use auto_increment A relationship set translated to a table or embedded in an existing table Attributes (Not Null?) Primary key Foreign keys Implement complex constraints using constraints or in PHP Triggers: http://www.wikivs.com/wiki/mysql_vs_postgresql#constraints, http://dev.mysql.com/doc/refman/5.4/en/create-trigger.html, textbook Ch 5.8 PHP: program as you want 20

Step 3: Parse & Load Data Set Raw data sets Raw data format: see the data description above Location on Edlab: /usr/net/ftp/pub/cs445/project/data/ Project database in MySQL: Database name: FirstLetterOfEachLastName, sorted in lexicographic order mysql h cs445sql p u $username $group_db_name Tasks for parsing and loading into your defined tables 1. Parse data into the format defined in your schema 2. Change permission of directory and files to be publicly accessible 3. Load parsed data files into your tables in MySQL LOAD DATA INFILE 'complete_path_of_data_file' INTO TABLE $table_name FIELDS TERMINATED BY '\t'; 21

Tips on Data Loading Prepare a table in a text file outside MySQL. DO NOT: run MySQL queries to preprocess the data, or insert a tuple at a time Where to place your temporary data files: Under your 445 directory (not your home directory), which gives you enough space Make sure to make you data files publicly accessible 22

More about Data Loading If you decide to use movie_id (or person_id) Declare the field to be auto_increment ; Load movies into table M; movie ids are auto generated; Load each other data file (e.g., Acting) first without the movie id field (say, into Table A1). Join A1 with the movie table M; insert all result tuples into a new table (A2) to be the real Acting table; Delete the old table (A1). 23

If Data Loading is Slow By default, MySQL (InnoDB) builds a clustered index on the primary key and a secondary index on each foreign key Using B+Tree insertion algorithm We may have to run OPTIMIZE TABLE later Having enough memory to hold B+trees helps reduce random I/Os Turn off a few operations if data loading is still slow set autocommit = 0 set unique_checks = 0 set foreign_key_checks = 0 24

If Data Loading Hangs Background processes may hold locks of tables, and block new processes. Some useful commands: Show background processes show processlist; (You can get process id, and status info of the processes) Kill a process kill process_id; (Killing a long-running process may take long) Get some idea of the progress of loading show table status; The Data_length field shows the current size of a table. It should be increasing during the loading process. 25

Project Update on Data Loading (Mar 10) 1. List of tables created, including attributes, primary keys, foreign keys, and other relevant constraints Can show the CREATE TABLE commands 2. Basic data characteristics of the tables, including the number of rows in each table, and the size of each table SHOW TABLE STATUS; For details, see http://dev.mysql.com/doc/refman/5.1/en/show-table-status.html 26

Step 4: Test Queries & DB Tuning Test queries are released on the project web page Selection Join Group by aggregation Nested queries Test your database using the test queries. If answer not correct, check schema & queries. 27

Query 1. Find all movies that have Life as a prefix in the title. (1,007 rows) Query 2. Find all movies that Pitt, Brad acted in. (159 rows) Query 3. Find the performers who acted in at least one movie directed by Spielberg, Steven. (2,665 row) Query 4. Show the number of user ratings for each rating value. (10 rows) Query 5. Find the directors of all movies that are rated 10 by DERRICK MYERS whose age is 36. Show the name of each director. (4 rows) Query 6. For each movie that is rated more than 5,000 times, show the title, year and the average age of the users who rate the movie. (17 rows) Query 7. Find the users who are 17 or under 17 and rate at least one NC-17 movie. Show the email, name, age and location of each user. (1783 rows) 28

Query 8. Find the 5 best movies. The 5 best movies are the 5 movies with the highest average rating. Only the movies rated more than 1,000 times are considered. Show the title, year and average rating of each best movie. (5 rows) Query 9. Find all the users of good taste. A user is of good taste if and only if the user rates at least 2 of the 5 best movies, and the user never rates the 5 best movies with rating less than 9. Show the email of each user. (238 rows) Query 10. Now we consider only the ratings from the users of good taste. Find the 10 movies with the highest average rating. Only the movies rated more than 2 times by the users of good taste are considered. Show the title, year and average rating of each best movie. (10 rows) 29

Performance Tuning If performance is not good, Add indexes: CREATE INDEX index_name ON table_name ( column_name(s) )! Analyze table: collect more precise statistics! Rewrite queries: eliminate nested subqueries Use joins, but not subqueries, if possible If needed, use derived tables in the From clause Store redundant information: Adding additional columns or tables (physically stored) Make sure that redundant information is consistent with base data Adding temporal tables within a connection CREATE TEMPORARY TABLE FROM SELECT! Revisit schema design: normalization vs. denormalization 30

Report due on April 9 Extend your report with the set of test queries written in your schema, the performance issues encountered, the techniques employed to overcome performance issues, the execution time of test queries using MySQL Our performance target: <= 20 seconds 32

Step 5: Web Application Development PHP/MySQL tutorial http://avid.cs.umass.edu/courses/445/s2015/systemsupport.html#php! PHP on Edlab http://avid.cs.umass.edu/courses/445/s2015/systemsupport.html#phpedlab Group PHP directory: /courses/cs400/cs445/php-dirs/cs445_x_s15/www,! where x is the 2 or 3 letter group name, all in the upper case.! Upload your PHP files Group URL: http://cs445.cs.umass.edu/php-wrapper/cs445_x_s15/file_name.php! Need username/password to access your group URL (Will be distributed via the course mailing list) 33

Implementing the Web Interface Search interface Search by title, year, actor name, director name, etc. User activities User login, rating of movies, update of avg. rating Admin interface SQL queries are sent, including updates Results are displayed with the response time More advanced features are welcome! 34

Implementing Your Extension Many ideas are possible: Social networking among movie fans Customized advertisements Integrating with youtube, trailors May need to generate new data, communication with the DB backend Start early 35