INSTITUTO SUPERIOR TÉCNICO GESTÃO E TRATAMENTO DE INFORMAÇÃO

Similar documents
INSTITUTO SUPERIOR TÉCNICO Gestão e Tratamento de Informação

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CIS 550 Fall Final Examination. December 13, Name: Penn ID:

EECS-3421a: Test #2 Queries

10-701/15-781, Fall 2006, Final

Predict the box office of US movies

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Gestão e Tratamento da Informação

CS 474, Spring 2016 Midterm Exam #2

CSE 344 Midterm. November 9, 2011, 9:30am - 10:20am. Question Points Score Total: 100

CSE 344 Midterm. Wednesday, February 19, 2014, 14:30-15:20. Question Points Score Total: 100

CSE-3421 Test #1 Design

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Multimedia Databases. 9 Video Retrieval. 9.1 Hidden Markov Model. 9.1 Hidden Markov Model. 9.1 Evaluation. 9.1 HMM Example 12/18/2009

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )

Introduction to Algorithms May 14, 2003 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser.

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Name: Database Systems ( 資料庫系統 ) Midterm exam, November 15, 2006

CSE 190D Spring 2017 Final Exam Answers

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

XML Problem. Specification of the Publication Entity:

Chapter 6. Multiple sequence alignment (week 10)

NAME: Sample Final Exam (based on previous CSE 455 exams by Profs. Seitz and Shapiro)

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

CSE-6490B Final Exam

Databases -Normalization I. (GF Royle, N Spadaccini ) Databases - Normalization I 1 / 24

CS145 Midterm Examination

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

NOTE 1: This is a closed book examination. For example, class text, copies of overhead slides and printed notes may not be used. There are 11 pages.

Exam Marco Kuhlmann. This exam consists of three parts:

CSE-3421M Test #2. Queries

CSE 344 Midterm. November 9, 2011, 9:30am - 10:20am. Question Points Score Total: 100

CS 170 Algorithms Spring 2009 David Wagner MT2

Sequence analysis Pairwise sequence alignment

CS145 Midterm Examination

List of Exercises: Data Mining 1 December 12th, 2015

CS145 Midterm Examination

LBSC 690: Information Technology Lecture 05 Structured data and databases

Data Analytics. Qualification Exam, May 18, am 12noon

CSE 344 Midterm. Monday, Nov 4, 2013, 9:30-10:20. Question Points Score Total: 100

(a) Explain how physical data dependencies can increase the cost of maintaining an information

15-780: Problem Set #2

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Computer Science 425 Fall 2006 Second Take-home Exam Out: 2:50PM Wednesday Dec. 6, 2006 Due: 5:00PM SHARP Friday Dec. 8, 2006

CS 582 Database Management Systems II

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

CS145 Final Examination

Lecture 19 Query Processing Part 1

Recommender Systems (RSs)

Informationslogistik Unit 5: Data Integrity & Functional Dependency

CISC 3140 (CIS 20.2) Design & Implementation of Software Application II

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Classification. 1 o Semestre 2007/2008

Name: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G.

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Introduction to Algorithms October 12, 2005 Massachusetts Institute of Technology Professors Erik D. Demaine and Charles E. Leiserson Quiz 1.

XSLT and Structural Recursion. Gestão e Tratamento de Informação DEI IST 2011/2012

8) A top-to-bottom relationship among the items in a database is established by a

2. E/R Design Considerations

CIS 110 Introduction to Computer Programming 8 October 2013 Midterm

EXAMINATIONS 2013 MID-YEAR SWEN 432 ADVANCED DATABASE DESIGN AND IMPLEMENTATION

Data Integration. Lecture 23. Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems. CompSci 516: Data Intensive Computing Systems

PRACTICE Examination

Hidden Markov Models. Mark Voorhies 4/2/2012

Exam I Computer Science 420 Dr. St. John Lehman College City University of New York 12 March 2002

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

Problem Description Earned Max 1 CSS 20 2 PHP 20 3 SQL 10 TOTAL Total Points 50

Exam 2 Study Guide. Denny Hood Computer Science 101

COMP718: Ontologies and Knowledge Bases

CPSC 310: Database Systems / CSPC 603: Database Systems and Applications Exam 2 November 16, 2005

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Describe The Differences In Meaning Between The Terms Relation And Relation Schema

Question Score Points Out Of 25

Query Processing & Optimization

Introduction to Graphical Models

CSE 190D Spring 2017 Final Exam

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

Eukaryotic Gene Finding: The GENSCAN System

Advanced Data Management Technologies Written Exam

Computer Science E-119 Practice Midterm

Data Definition Language (DDL), Views and Indexes Instructor: Shel Finkelstein

Evaluating XPath Queries

Book 5. Chapter 1: Slides with SmartArt & Pictures... 1 Working with SmartArt Formatting Pictures Adjust Group Buttons Picture Styles Group Buttons

CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, CS1800 Discrete Structures Final

NESTED QUERIES AND AGGREGATION CHAPTER 5 (6/E) CHAPTER 8 (5/E)

ECE521 W17 Tutorial 10

CS 564 PS1. September 10, 2017

The University of British Columbia

CSCI-6421 Final Exam York University Fall Term 2004

NESTED QUERIES AND AGGREGATION CHAPTER 5 (6/E) CHAPTER 8 (5/E)

CS 245 Midterm Exam Solution Winter 2015

Excel 1. Module 6 Data Lists

NJIT Department of Computer Science PhD Qualifying Exam on CS 631: DATA MANAGEMENT SYSTEMS DESIGN. Summer 2012

K-Means and Gaussian Mixture Models

Transcription:

Número: Nome: INSTITUTO SUPERIOR TÉCNICO GESTÃO E TRATAMENTO DE INFORMAÇÃO Exam 2 - solution 30 January 2015 The duration of this exam is 2,5 Hours. You can access your own written materials, but the exam is to be done individually. You are not allowed to use computers, tablets, nor mobile phones. The maximum grade of the exam is 20 pts. Write your answers below the questions. Write your number and name at the top of each page. Present all calculations performed. After the exam starts, you can leave the room one hour after delivering the exam. The following table is be used by instructors, ONLY: 1 2 3 4 5 SUM 4 4 4 4 4 20 Page 1 of 12

(This page was left blank.) Page 2 of 12

Número: Nome: 1. (4 pts) XML Data Management Technology Consider the following XML document: <dvdcollection> <dvd> <title>good Night, and Good Luck</title> <release-year>2005</release-year> <director>george Clooney</director> <actors> <actor>george Clooney</actor> <actor>jeff Daniels</actor> <actor>david Strathairn</actor> </actors> </dvd> <dvd> <title>they Live</title> <release-year>1988</release-year> <director>john Carpenter</director> <actors> <actor>roddy Piper</actor> <actor>keith David</actor> <actor>meg Foster</actor> </actors> </dvd> <!-- list of remaining dvds --> </dvdcollection> 1.1. (2,5 pts) Present XPath expressions that, using the XML document, answer the following information needs: 1.1.1. What are the titles of movies directed by John Carpenter, where Roddy Piper was the leading actor (i.e., the first actor appearing in the list of actors). //dvd[./director="john Carpenter"] [.//actor[1]="roddy Piper"]/title 1.1.2. Who are the actors, in the XML dataset, that are also directors of movies released after 1995. //actor[ text() =//dvd[./release year > 1995]/director ] 1.1.3. Who is the director of the oldest movie featuring Jeff Daniels has an actor. //dvd[.//actor="jeff Daniels " and./release year= min(//dvd[.//actor="jeff Daniels"]/release year) ]/director Page 3 of 12

1.2. (1 pt) Present an XQuery expression that, using the XML document, lists all movies that were directed by actors in the movie entitled Good Night, and Good Luck. Movies in the results should be sorted according to the release year, from oldest to newest. let $a := //dvd[./title="good Night, and Good Luck"]//actor for $m in //dvd where $m/director/text() = $a/text() order by $m/release year ascending return $m 1.3. (0,5 pt) Present an XQuery updating expression for changing the XML document, deleting all but the leading actor in the movies that were released prior to 1990, and adding an attribute rating = "awesome" to the dvd elements corresponding to movies directed by John Carpenter. ( for $m in //dvd[release year < 1990] let $a := $m/actors/actor[position() > 1] return delete nodes $a, for $m in //dvd[director="john Carpenter"] return insert node attribute rating { "awesome" } into $m ) Page 4 of 12

Número: Nome: 2. (4 pts) Web Data Extraction Consider the following trees, representing two data records encoding information about a family tree. 2.1. (2,5 pts) Compute the similarity (i.e., the number of matching nodes), using the Simple Tree Matching (STM) algorithm, and considering that two nodes can be aligned if they share the same label. Page 5 of 12

2.2. (1 pt) Compute the alignment between the trees, using the calculations performed for the previous question (make clear the backtracking process that reaches the specified alignment). The backtracking is shown in pink in the previous question 2.3. (0,5 pt) Knowing that the STM algorithm is a simplification of a more general tree matching algorithm, give an example of two HTML trees containing a data record that would not be captured by STM, but could be captured if the general algorithm was used. Explain why this would happen. Consider, for example, HTML pages contaning data records with information on books, where, in some cases the title is encoded using <strong> and in others using <emph>. This could be captured by the general algorithm but not by STM, since it discards nodes with different labels. Page 6 of 12

Número: Nome: 3. (4 pts) Data Integration Suppose a data source S storing the following tables: Movie (movie name, year, director name) Play (movie name, person name) Person (person name, nationality) 3.1. (2,5 pts) 3.1.1. Rewrite the following SQL query as a conjunctive query: SELECT movie name, director name FROM Movie m, Play p, Person a WHERE m.movie name = p.movie name AND p.person name = a.person name AND a.nationality = Portuguese UNION ALL SELECT movie name, director name FROM Movie m WHERE m.year = 1995 Q(m, d) :- Movie(m, y, d), Play(m, p), Person(p, Portuguese ) Q(m, d) :- Movie(m, 1995, d) 3.1.2. Suppose you have the following mediated schema M: Portuguese movies(movie name, year) which represents the names and years of Movies whose actors are Portuguese or whose director is Portuguese. Write a global-as-view mapping between the mediated schema M and the data source schema S. Portuguese-movies (m, y) = Movie(m, y, d), Play(m, p), Person(p, Portuguese ) Portuguese-movies/m, y) = Movie(m,y, d), Person(d, Portuguese) 3.1.3. Write a conjunctive query in terms of the mediated schema that returns the names of portuguese movies directed after 1995. Then, unfold it and rewrite it in terms of the tables of data source S. Q (m) :- Portuguese-movies(m, y), y >= 1995 Unfolding: Q (m) :- Movie(m, y, d), Plays(m, p), Person(p, Portuguese ), y >= 1995 Page 7 of 12

Q (m) :- Movie(m,y, d), Person(d, Portuguese), y >= 1995 3.2. (1 pt) Suppose you have a pre-computed view: Portuguese Person(m,p) : Plays(m,p), Person(p, Portuguese ) How would write the conjunctive query of Question 3.1.2 using the view Portuguese-Person? Portuguese-movies (m, y) = Movie (m, y, d), Portuguese-Person(m,p) Portuguese-movies (m, y) = Movie(m,y, d), Person(d, Portuguese) 3.2. (0,5 pt) For the following pair of queries, state which relationship exists (equivalence or containment) between them. Justify. Q1(A,B,E) : T(A,B,C), R(C,E), T(A,B,E), R(E,C) Q2(U,V,Z) : T(U,V,Z), R(Z,5) There is no relationship Page 8 of 12

Número: Nome: 4. (4 pts) Data Cleaning and Integration 4.1. (2,5 pts) Suppose the following two tuples: Good Night, and Good Luck 2005 George Clooney George Cloony, Jeff Daniels, David Strathairn nice well directed exceptional actors Good Night Good Luck 2006 George Clooney Jeff Daniels and George Clooney and David Strahtairn wonderful nicely directed good actors of a table with schema: Movies (movie name, year, director, actors, review) The goal is to automatically detect that the two tuples refer to the same movie. 4.1.1. Which string matching algorithm would you use to compare the movie names? Justify. Would you use the same string matching algorithm to compare the reviews? Justify. We could use edit distance for instance, because they are medium-sized strings. To compare the reviews, edit distance would not give good results, because the same words can occur in a different position, so edit distance would not give good results. A possibility is to use TF/IDF. 4.1.2. Now, imagine you want to identify if the lists of actors of the two tuples are similar. Would you apply a string matching algorithm directly to the two strings that represent the actors in each record? If no, what would you do? We cannot apply a string matching directly to the two strings, because the actor names are separated by a different separator and they do not occur in the same order. It would be better to first split the actor field into one tuple per actor and store the actor tuples in a distinct table. Then a string matching algorithm could be applied. 4.1.3. Which string matching algorithm is appropriate to compare person names? Use that algorithm to compute the similarity between Clooney and Cloony in the two tuples and between Strahtair and Strathair? Do they return the same value? Why? Jaro measure is good to apply to short names Jaro (Clooney, cloony) : x = 7 y = 6 Common chars: 6 Transposed: 0 Jaro = 1/3 [ c/ x + c/ y + (c t/2)/c ] = 1/3 (6/7 + 6/6 +6/6) =1/3(1+1.167+1) = 0.95 Jaro(Strahtar, Strathair) x = 8 y = 9 Page 9 of 12

Common chars: 9 Transposed: 2 Jaro = 1/3(9/9 + 9/9 + (9 1)/8) = 0.96 Although the nb of common caracters is the size of one of the words, one of the pairs has 2 transposed characters which decreases the similarity value. 4.2. (1 pt) Consider now only the possible values of the attribute review. Besides the two values represented above (denoted t1 and t2, respectively) that correspond to positive reviews, consider that you have another two instances denoted t3 and t4 that correspond to negative reviews. Suppose as well that the review attribute values have undergone a normalization process. The resulting set of reviews is as follows: t1: {nice, well, directed, exceptional, actor} positive t2: {wonderful, nice, directed, good, actor} positive t3: {medium, film, terrible, direction, actor} negative t4: {poor, directed, medium, film} negative Now, suppose we have another table with schema T(Y) and we have one tuple of that table <nice, well, actor, good, directed>. Use a Naïve Bayes Learner to learn with the four possible instances of the review attribute of the Movie table (t1, t2, t3, and t4) and then to predict whether the value of attribute Y refers to a positive or a negative review. d: { nice, well, actor, good, directed } P(positive d)= P(d positive)p(positive)/p(d) P(negative d)= P(d negative)p(negative)/p(d) Cd = arg max ci [P(d C i)p(c i)], where ci is positive or ci is negative P(d ci) and P(ci) P(ci) - the portion of the training instances with label ci P(positive) = 0.5 P(negative) = 0.5 N(positive) = 13 N(negative) = 9 P(d positive)=p(nice positive). P(well positive).p(actor positive).p(good positive).p(directed positive) P(nice positive) = n(nice, positive)/n(positive) = 2/10 P(good positive) = n(good, positive)/n(positive) = 1/10 P(actor positive) = 2/10 P(well positive) =1/10 P(directed positive) = 2/10 P(d positive) = 0.5*8/10*10*10*10*10 P(d negative)=p(good negative).p(nice negative).p(actor negative).p(well negative).p(directed negative) P(good negative) = n(good, negative)/n(negative) = 0 P(nice negative) = 0 P(actor negative) = 1 P(well negative) = 0 P(directed negative) = 1 P(d negative)=0 Page 10 of 12

Número: Nome: So the answer is: positive review. 4.3. (0,5 pt) Suppose that you have 1 million tuples stored in the Movies table. Which method do you suggest to use to optimize the time needed to find all the tuples that refer to the same movie? Describe it briefly and point out one limitation of the method. Sorted neighborhood method. It consists of a first phase where a key composed by parts of every attribute is chosen, a second phase where the tuples are sorted according to this key, and a third where a fixed size window slide the set of tuples and only those that are within the window are compared using a set of matching rules. One limitation of this method is the possibility of loosing matches. Page 11 of 12

5. (4 pts) Miscellaneous 5.1. (1,5 pt) In this course you have seen dynamic programming at work in several algorithms/techniques. In string matching, what is dynamic programming used for? How does it work? Explain in your own words. Use a diagram or example if needed, but do not copy content from the slides. Answer: In string matching, dynamic programming is used to calculate the (minimum) edit distance between two given strings, where the possible edit operations are insertion, deletion, or substitution of characters. Basically, we build a matrix and in each cell of that matrix we consider the possibility of using each of those edit operations, but only with respect to the neighboring cells (the neighbor on top, the neighbor on the left, and the neighbor on the diagonal top-left). Usually, each edit operation is defined as having a cost of 1(one). The cost is 0(zero) if there is a match between the characters in both strings. As we build the matrix (by filling in the value in each cell), we choose the option that yields the minimum accumulated cost. Once the matrix is fully built, we backtrack over those options to find the corresponding edit operations (which gives us the alignment between both strings). 5.2. (1,5 pt) In Hidden Markov Models (HMMs), what is dynamic programming used for? How does it work? Explain in your own words. Use a diagram or example if needed, but do not copy content from the slides. Answer: In HMMs, dynamic programming is used to find the most likely sequence of states for a given observed sequence of symbols. This is called the Viterbi algorithm. Basically, we need to find which state generated each symbol. At first sight, it could seem that we would have to consider every possibility of each state generating each symbol in the observed sequence. However, there are transition probabilities between states (and symbol emission probabilities in each state), so if we know which state generated symbol i, we can determine which state is more likely to have generated symbol i+1. Therefore, at each step we keep only the state that maximizes such probability (instead of keeping all possible transitions). Once we reach the end of the sequence, we can backtrack over the sequence of states which yields the highest total probability. 5.3. (1 pt) Now that you have seen dynamic programming at work in different places, what is the essence of dynamic programming? How would you describe it in general terms? What is so special about dynamic programming that makes it a good choice to solve certain problems? What do these problems have in common? Answer: In string matching, dynamic programming allows us to find a globally optimal alignment by doing a local minimization of the accumulated cost between neighboring cells. In HMMs, dynamic programming allows us to find a globally optimum sequence of states by doing a local maximization of the transition (and symbol emission) probabilities between consecutive states. Therefore, it seems that dynamic programming can be applied to those problems where a globally optimal solution can be found by a series of locally optimal decisions. Page 12 of 12