INSTITUTO SUPERIOR TÉCNICO Gestão e Tratamento de Informação

Similar documents
INSTITUTO SUPERIOR TÉCNICO GESTÃO E TRATAMENTO DE INFORMAÇÃO

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Data Integration. Lecture 23. Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems. CompSci 516: Data Intensive Computing Systems

INSTITUTO SUPERIOR TÉCNICO Administração e optimização de Bases de Dados

Computer Science 425 Fall 2006 Second Take-home Exam Out: 2:50PM Wednesday Dec. 6, 2006 Due: 5:00PM SHARP Friday Dec. 8, 2006

Practice Questions for Midterm

CIS 550 Fall Final Examination. December 13, Name: Penn ID:

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Repetition Algorithms

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Birkbeck (University of London)

CSE 344 Midterm. Wednesday, February 19, 2014, 14:30-15:20. Question Points Score Total: 100

Large-Scale Duplicate Detection

CS145 Midterm Examination

TABLE OF CONTENTS PAGE TITLE NO.

Problem Pts Score Grader Problem Pts Score Grader

Data Analytics. Qualification Exam, May 18, am 12noon

Parts of Speech, Named Entity Recognizer

Gestão e Tratamento da Informação

CSE 344 Midterm. November 9, 2011, 9:30am - 10:20am. Question Points Score Total: 100

CS145 Midterm Examination

COP5725 Database Management Systems Final Fall 2005

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

CSE-6490B Final Exam

Fast Contextual Preference Scoring of Database Tuples

Graham Kemp (telephone , room 6475 EDIT) The examiner will visit the exam room at 09:30 and 11:30.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

CSE 444 Final Exam. August 21, Question 1 / 15. Question 2 / 25. Question 3 / 25. Question 4 / 15. Question 5 / 20.

CSE 494: Information Retrieval, Mining and Integration on the Internet

CSE 344 Midterm. November 9, 2011, 9:30am - 10:20am. Question Points Score Total: 100

Computer Science 597A Fall 2008 First Take-home Exam Out: 4:20PM Monday November 10, 2008 Due: 3:00PM SHARP Wednesday, November 12, 2008

Information Integration

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Database Management Systems (COP 5725) (Spring 2007)

CSE 158, Fall 2017: Midterm

CS40 Exam #2 November 14, 2001

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Entity Linking. David Soares Batista. November 11, Disciplina de Recuperação de Informação, Instituto Superior Técnico

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSE332 Summer 2010: Final Exam

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

EECS-3421a: Test #2 Queries

Predict the box office of US movies

XML Problem. Specification of the Publication Entity:

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Question Score Points Out Of 25

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CSCI-6421 Final Exam York University Fall Term 2004

NJIT Department of Computer Science PhD Qualifying Exam on CS 631: DATA MANAGEMENT SYSTEMS DESIGN. Summer 2012

Exam 2 Study Guide. Denny Hood Computer Science 101

Relational Model, Key Constraints

CS371R: Final Exam Dec. 18, 2017

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

CS 474, Spring 2016 Midterm Exam #2

Databasesystemer, forår 2005 IT Universitetet i København. Forelæsning 8: Database effektivitet. 31. marts Forelæser: Rasmus Pagh

Oracle Database 12c: Use XML DB

CSC 126 FINAL EXAMINATION Spring Total Possible TOTAL 100

15-780: Problem Set #2

The University of British Columbia

10708 Graphical Models: Homework 2

Wild Mushrooms Classification Edible or Poisonous

MID II Tuesday, 1 st April 2008

Content-based Recommender Systems

Stat 342 Exam 3 Fall 2014

Exam I Computer Science 420 Dr. St. John Lehman College City University of New York 12 March 2002

Chapter 2: Frequency Distributions

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Name: Database Systems ( 資料庫系統 ) Midterm exam, November 15, 2006

Tutorial: Input Grades in Blackboard

King Fahd University of Petroleum and Minerals. Write clearly, precisely, and briefly!!

STA 4273H: Statistical Machine Learning

Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage. Yun Zhou. AMBN-2015, Yokohama, Japan 16/11/2015

CSE 344 Midterm. Monday, Nov 4, 2013, 9:30-10:20. Question Points Score Total: 100

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

This is a set of practice questions for the final for CS16. The actual exam will consist of problems that are quite similar to those you have

Instructor: Craig Duckett. Lecture 03: Tuesday, April 3, 2018 SQL Sorting, Aggregates and Joining Tables

CSE 344 Midterm. Wednesday, February 19, 2014, 14:30-15:20. Question Points Score Total: 100

Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology

CSE 190D Spring 2017 Final Exam

Key Skills IT Level 3 AFFORDABLE ARTWORK. Classroom Activity MY SKETCHES

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Computer Science E-119 Practice Midterm

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Collective Entity Resolution in Relational Data

CPSC 310: Database Systems / CSPC 603: Database Systems and Applications Exam 2 November 16, 2005

Final Examination. Winter Problem Points Score. Total 180

Web Services Interfaces

CS222P Fall 2017, Final Exam

Semi-structured Data 11 - XSLT

Introduction to Algorithms May 14, 2003 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser.

McGill April 2009 Final Examination Database Systems COMP 421

Introduction to Computer Science. Homework 1

Database System Concepts

1 (10) 2 (8) 3 (12) 4 (14) 5 (6) Total (50)

Written Exam Data Warehousing and Data Mining course code:

Math 355: Linear Algebra: Midterm 1 Colin Carroll June 25, 2011

Transcription:

-------------------------------------------------------------------------------------------------------------- INSTITUTO SUPERIOR TÉCNICO Gestão e Tratamento de Informação Exam 1 16 January 2011 -------------------------------------------------------------------------------------------------------------- The duration of this exam is 2 Hours. You can access your own paper material, but the exam is to be done individually. You are not allowed to use computers nor mobile phones. The maximum grade of the exam is 20 pts. Write your answers below the questions. Write your number and name at the top of each page. Present all calculations performed. After the exam starts, you can leave the room one hour after and after delivering the exam. To be used by instructors, ONLY: 1 2 3 4 5 SUM 4 4 4 4 4 20 1

1. (4 pts) extensible Markup Language (XML), XPath and XQuery Consider the following example XML document, encoding information about plastic artists and their works of art. <list-artists-works> <artist id="1"> <name>vincent van Gogh</name> <genre>expressionism</genre> <genre>post-impressionism</genre> </artist> <artist id="2"> <name>claude Monet</name> <genre>impressionism</genre> </artist> <artist id="3"> <name>pablo Picasso</name> <genre>modernism</genre> <genre>cubism</genre> <genre>expressionism</genre> </artist> <!-- list of remaining artists --> <work artist="1">os Comedores de Batata</work> <work artist="1">os Girassóis</work> <work artist="1">a Noite Estrelada</work> <work artist="2">impressão, nascer do sol</work> <work artist="3">dora Maar au chat</work> <!-- list of remaining works --> </list-artists-works> 1.1. (2,5 pts) Present an XQuery expression for returning the following: List the names of artists from a genre related with "impressionism" (i.e., with a genre containing "impressionism" as part of its name), sorted, in ascending order, by the number of their works. 2

1.2. (1 pt) Present an XPath expression for answering the following information need: List the names of all the works of art authored by artists that have worked in more than two different genres. 1.3. (0,5 pt) Present an XQuery Update expression for modifying the XML document in order to: o add, to each artist element, a new element entitled "number-works" containing the number of works authored by that artist, and o change each of the "work" elements in order to replace the value of the "artist" attribute by the name of the corresponding artist. The same XQuery Update expression should perform both modifications. 3

2. (4 pts) Web data and information extraction Consider the following character strings: VODKALIMA MESKALINA 2.1. (2,5 pts) Using the dynamic programming algorithm, compute the edit distance between both strings. 2.2. (1 pt) Present a minimal alignment between the strings. You must present the corresponding backtracking path on the distance matrix computed for the previous question. 4

2.3. (0,5 pt) Assume the cost of replacing a character is twice the cost of inserting/removing a character. What would be the edit distance in this case? Justify your answer. 5

3. (4 pts) Data integration 3.1. (2,5 pts) Consider a mediator schema with the following two relations: Movie(title, year, type) Schedule(cinema, title, hour) a) Write in Datalog and using the mediator schema the query: which cinemas show movies of type comedy? b) Now suppose the following views were defined over the mediator: Movies70(TT,Y) :- Movie(TT,Y,-), Y >= 1970 MoviesMonumental(TT) :- Schedule(C,TT,-), C = Monumental How would you express the following query using the views: In which year were produced the movies that are shown in Monumental? 3.2. (1 pt) Unfold the query of question 3.1 b) so that the query is expressed only in terms of the base relations. 6

3.3. (0,5 pt) Explain what is the purpose of the schema matching phase of a schema mapping process. Give an example of two techniques that can be used for schema matching. 7

4. (4 pts) Data cleaning 4.1. (2,5 pts) Compute the value of the Jaro-Winkler measure for the two names: Jonhatan and Johnny. Use 0.1 as the weight to be given to the prefix. 4.2. (1 pt) Compute the value of the Jaccard measure for the same two names, assuming 2- grams. What can you conclude about the use of these two measures for computing the similarity between these two names? 8

4.3.(0,5 val) Explain why methods for optimizing the similarity join between two sets of strings, when the number of strings in each set is very large, were proposed. Give an example of one such method. Similarity join is an operation that accepts as input two sets of objects (e.g., strings), a similarity measure sim(), and a threshold value t. It returns all pairs of objects whose similarity value is greater than or equal to t. 9

5. (4 pts) Miscelaneous 5.1. (1 pt) Explain how the string edit distance measure can be used to extract data from Web pages. Illustrate your explanation with an example. 5.2. (1 pt) Write an XQuery function for computing the Jaccard coefficient between two sets of words, considering the following signature: jaccard-coefficient ( $s1 as xs:string*, $s2 as xs:string* ) as xs:double Explain how you could change the function in order to compute a "soft" similarity score between sets of words (i.e., a "soft-jaccard" score that would take into account similar words in the sets), using an internal comparator that you consider to be appropriate to this task. 10

5.3. (1 pt) Consider the following sequence of example sentences and the corresponding classifications: "the movie is great" -> positive opinion "it's a crap movie" -> negative opinion "really wonderful movie" -> positive opinion "really awful movie" -> negative opinion "I liked the movie, it is great" -> positive opinion Use the Naive Bayes algorithm to classify the sentence "really great movie" as expressing a positive or a negative opinion, with basis on the example sentences that were presented. 11

5.4. (1 pt) Explain briefly the main steps of a data cleaning process that are usually required. 12