INSTITUTO SUPERIOR TÉCNICO GESTÃO E TRATAMENTO DE INFORMAÇÃO

Número: Nome: INSTITUTO SUPERIOR TÉCNICO GESTÃO E TRATAMENTO DE INFORMAÇÃO Exam 2 - solution 30 January 2015 The duration of this exam is 2,5 Hours. You can access your own written materials, but the exam is to be done individually. You are not allowed to use computers, tablets, nor mobile phones. The maximum grade of the exam is 20 pts. Write your answers below the questions. Write your number and name at the top of each page. Present all calculations performed. After the exam starts, you can leave the room one hour after delivering the exam. The following table is be used by instructors, ONLY: 1 2 3 4 5 SUM 4 4 4 4 4 20 Page 1 of 12

(This page was left blank.) Page 2 of 12

Número: Nome: 1. (4 pts) XML Data Management Technology Consider the following XML document: <dvdcollection> <dvd> <title>good Night, and Good Luck</title> <release-year>2005</release-year> <director>george Clooney</director> <actors> <actor>george Clooney</actor> <actor>jeff Daniels</actor> <actor>david Strathairn</actor> </actors> </dvd> <dvd> <title>they Live</title> <release-year>1988</release-year> <director>john Carpenter</director> <actors> <actor>roddy Piper</actor> <actor>keith David</actor> <actor>meg Foster</actor> </actors> </dvd>  </dvdcollection> 1.1. (2,5 pts) Present XPath expressions that, using the XML document, answer the following information needs: 1.1.1. What are the titles of movies directed by John Carpenter, where Roddy Piper was the leading actor (i.e., the first actor appearing in the list of actors). //dvd[./director="john Carpenter"] [.//actor[1]="roddy Piper"]/title 1.1.2. Who are the actors, in the XML dataset, that are also directors of movies released after 1995. //actor[ text() =//dvd[./release year > 1995]/director ] 1.1.3. Who is the director of the oldest movie featuring Jeff Daniels has an actor. //dvd[.//actor="jeff Daniels " and./release year= min(//dvd[.//actor="jeff Daniels"]/release year) ]/director Page 3 of 12

1.2. (1 pt) Present an XQuery expression that, using the XML document, lists all movies that were directed by actors in the movie entitled Good Night, and Good Luck. Movies in the results should be sorted according to the release year, from oldest to newest. let $a := //dvd[./title="good Night, and Good Luck"]//actor for $m in //dvd where $m/director/text() = $a/text() order by $m/release year ascending return $m 1.3. (0,5 pt) Present an XQuery updating expression for changing the XML document, deleting all but the leading actor in the movies that were released prior to 1990, and adding an attribute rating = "awesome" to the dvd elements corresponding to movies directed by John Carpenter. ( for $m in //dvd[release year < 1990] let $a := $m/actors/actor[position() > 1] return delete nodes $a, for $m in //dvd[director="john Carpenter"] return insert node attribute rating { "awesome" } into $m ) Page 4 of 12

Número: Nome: 2. (4 pts) Web Data Extraction Consider the following trees, representing two data records encoding information about a family tree. 2.1. (2,5 pts) Compute the similarity (i.e., the number of matching nodes), using the Simple Tree Matching (STM) algorithm, and considering that two nodes can be aligned if they share the same label. Page 5 of 12

2.2. (1 pt) Compute the alignment between the trees, using the calculations performed for the previous question (make clear the backtracking process that reaches the specified alignment). The backtracking is shown in pink in the previous question 2.3. (0,5 pt) Knowing that the STM algorithm is a simplification of a more general tree matching algorithm, give an example of two HTML trees containing a data record that would not be captured by STM, but could be captured if the general algorithm was used. Explain why this would happen. Consider, for example, HTML pages contaning data records with information on books, where, in some cases the title is encoded using <strong> and in others using <emph>. This could be captured by the general algorithm but not by STM, since it discards nodes with different labels. Page 6 of 12

Número: Nome: 3. (4 pts) Data Integration Suppose a data source S storing the following tables: Movie (movie name, year, director name) Play (movie name, person name) Person (person name, nationality) 3.1. (2,5 pts) 3.1.1. Rewrite the following SQL query as a conjunctive query: SELECT movie name, director name FROM Movie m, Play p, Person a WHERE m.movie name = p.movie name AND p.person name = a.person name AND a.nationality = Portuguese UNION ALL SELECT movie name, director name FROM Movie m WHERE m.year = 1995 Q(m, d) :- Movie(m, y, d), Play(m, p), Person(p, Portuguese ) Q(m, d) :- Movie(m, 1995, d) 3.1.2. Suppose you have the following mediated schema M: Portuguese movies(movie name, year) which represents the names and years of Movies whose actors are Portuguese or whose director is Portuguese. Write a global-as-view mapping between the mediated schema M and the data source schema S. Portuguese-movies (m, y) = Movie(m, y, d), Play(m, p), Person(p, Portuguese ) Portuguese-movies/m, y) = Movie(m,y, d), Person(d, Portuguese) 3.1.3. Write a conjunctive query in terms of the mediated schema that returns the names of portuguese movies directed after 1995. Then, unfold it and rewrite it in terms of the tables of data source S. Q (m) :- Portuguese-movies(m, y), y >= 1995 Unfolding: Q (m) :- Movie(m, y, d), Plays(m, p), Person(p, Portuguese ), y >= 1995 Page 7 of 12

Q (m) :- Movie(m,y, d), Person(d, Portuguese), y >= 1995 3.2. (1 pt) Suppose you have a pre-computed view: Portuguese Person(m,p) : Plays(m,p), Person(p, Portuguese ) How would write the conjunctive query of Question 3.1.2 using the view Portuguese-Person? Portuguese-movies (m, y) = Movie (m, y, d), Portuguese-Person(m,p) Portuguese-movies (m, y) = Movie(m,y, d), Person(d, Portuguese) 3.2. (0,5 pt) For the following pair of queries, state which relationship exists (equivalence or containment) between them. Justify. Q1(A,B,E) : T(A,B,C), R(C,E), T(A,B,E), R(E,C) Q2(U,V,Z) : T(U,V,Z), R(Z,5) There is no relationship Page 8 of 12

Número: Nome: 4. (4 pts) Data Cleaning and Integration 4.1. (2,5 pts) Suppose the following two tuples: Good Night, and Good Luck 2005 George Clooney George Cloony, Jeff Daniels, David Strathairn nice well directed exceptional actors Good Night Good Luck 2006 George Clooney Jeff Daniels and George Clooney and David Strahtairn wonderful nicely directed good actors of a table with schema: Movies (movie name, year, director, actors, review) The goal is to automatically detect that the two tuples refer to the same movie. 4.1.1. Which string matching algorithm would you use to compare the movie names? Justify. Would you use the same string matching algorithm to compare the reviews? Justify. We could use edit distance for instance, because they are medium-sized strings. To compare the reviews, edit distance would not give good results, because the same words can occur in a different position, so edit distance would not give good results. A possibility is to use TF/IDF. 4.1.2. Now, imagine you want to identify if the lists of actors of the two tuples are similar. Would you apply a string matching algorithm directly to the two strings that represent the actors in each record? If no, what would you do? We cannot apply a string matching directly to the two strings, because the actor names are separated by a different separator and they do not occur in the same order. It would be better to first split the actor field into one tuple per actor and store the actor tuples in a distinct table. Then a string matching algorithm could be applied. 4.1.3. Which string matching algorithm is appropriate to compare person names? Use that algorithm to compute the similarity between Clooney and Cloony in the two tuples and between Strahtair and Strathair? Do they return the same value? Why? Jaro measure is good to apply to short names Jaro (Clooney, cloony) : x = 7 y = 6 Common chars: 6 Transposed: 0 Jaro = 1/3 [ c/ x + c/ y + (c t/2)/c ] = 1/3 (6/7 + 6/6 +6/6) =1/3(1+1.167+1) = 0.95 Jaro(Strahtar, Strathair) x = 8 y = 9 Page 9 of 12

Common chars: 9 Transposed: 2 Jaro = 1/3(9/9 + 9/9 + (9 1)/8) = 0.96 Although the nb of common caracters is the size of one of the words, one of the pairs has 2 transposed characters which decreases the similarity value. 4.2. (1 pt) Consider now only the possible values of the attribute review. Besides the two values represented above (denoted t1 and t2, respectively) that correspond to positive reviews, consider that you have another two instances denoted t3 and t4 that correspond to negative reviews. Suppose as well that the review attribute values have undergone a normalization process. The resulting set of reviews is as follows: t1: {nice, well, directed, exceptional, actor} positive t2: {wonderful, nice, directed, good, actor} positive t3: {medium, film, terrible, direction, actor} negative t4: {poor, directed, medium, film} negative Now, suppose we have another table with schema T(Y) and we have one tuple of that table <nice, well, actor, good, directed>. Use a Naïve Bayes Learner to learn with the four possible instances of the review attribute of the Movie table (t1, t2, t3, and t4) and then to predict whether the value of attribute Y refers to a positive or a negative review. d: { nice, well, actor, good, directed } P(positive d)= P(d positive)p(positive)/p(d) P(negative d)= P(d negative)p(negative)/p(d) Cd = arg max ci [P(d C i)p(c i)], where ci is positive or ci is negative P(d ci) and P(ci) P(ci) - the portion of the training instances with label ci P(positive) = 0.5 P(negative) = 0.5 N(positive) = 13 N(negative) = 9 P(d positive)=p(nice positive). P(well positive).p(actor positive).p(good positive).p(directed positive) P(nice positive) = n(nice, positive)/n(positive) = 2/10 P(good positive) = n(good, positive)/n(positive) = 1/10 P(actor positive) = 2/10 P(well positive) =1/10 P(directed positive) = 2/10 P(d positive) = 0.5*8/10*10*10*10*10 P(d negative)=p(good negative).p(nice negative).p(actor negative).p(well negative).p(directed negative) P(good negative) = n(good, negative)/n(negative) = 0 P(nice negative) = 0 P(actor negative) = 1 P(well negative) = 0 P(directed negative) = 1 P(d negative)=0 Page 10 of 12

Número: Nome: So the answer is: positive review. 4.3. (0,5 pt) Suppose that you have 1 million tuples stored in the Movies table. Which method do you suggest to use to optimize the time needed to find all the tuples that refer to the same movie? Describe it briefly and point out one limitation of the method. Sorted neighborhood method. It consists of a first phase where a key composed by parts of every attribute is chosen, a second phase where the tuples are sorted according to this key, and a third where a fixed size window slide the set of tuples and only those that are within the window are compared using a set of matching rules. One limitation of this method is the possibility of loosing matches. Page 11 of 12

5. (4 pts) Miscellaneous 5.1. (1,5 pt) In this course you have seen dynamic programming at work in several algorithms/techniques. In string matching, what is dynamic programming used for? How does it work? Explain in your own words. Use a diagram or example if needed, but do not copy content from the slides. Answer: In string matching, dynamic programming is used to calculate the (minimum) edit distance between two given strings, where the possible edit operations are insertion, deletion, or substitution of characters. Basically, we build a matrix and in each cell of that matrix we consider the possibility of using each of those edit operations, but only with respect to the neighboring cells (the neighbor on top, the neighbor on the left, and the neighbor on the diagonal top-left). Usually, each edit operation is defined as having a cost of 1(one). The cost is 0(zero) if there is a match between the characters in both strings. As we build the matrix (by filling in the value in each cell), we choose the option that yields the minimum accumulated cost. Once the matrix is fully built, we backtrack over those options to find the corresponding edit operations (which gives us the alignment between both strings). 5.2. (1,5 pt) In Hidden Markov Models (HMMs), what is dynamic programming used for? How does it work? Explain in your own words. Use a diagram or example if needed, but do not copy content from the slides. Answer: In HMMs, dynamic programming is used to find the most likely sequence of states for a given observed sequence of symbols. This is called the Viterbi algorithm. Basically, we need to find which state generated each symbol. At first sight, it could seem that we would have to consider every possibility of each state generating each symbol in the observed sequence. However, there are transition probabilities between states (and symbol emission probabilities in each state), so if we know which state generated symbol i, we can determine which state is more likely to have generated symbol i+1. Therefore, at each step we keep only the state that maximizes such probability (instead of keeping all possible transitions). Once we reach the end of the sequence, we can backtrack over the sequence of states which yields the highest total probability. 5.3. (1 pt) Now that you have seen dynamic programming at work in different places, what is the essence of dynamic programming? How would you describe it in general terms? What is so special about dynamic programming that makes it a good choice to solve certain problems? What do these problems have in common? Answer: In string matching, dynamic programming allows us to find a globally optimal alignment by doing a local minimization of the accumulated cost between neighboring cells. In HMMs, dynamic programming allows us to find a globally optimum sequence of states by doing a local maximization of the transition (and symbol emission) probabilities between consecutive states. Therefore, it seems that dynamic programming can be applied to those problems where a globally optimal solution can be found by a series of locally optimal decisions. Page 12 of 12