CIS 550 Fall Final Examination. December 13, Name: Penn ID:

CIS 550 Fall 2013 Final Examination December 13, 2013 Name: Penn ID: Email: My signature below certifies that I have complied with the University of Pennsylvania's Code of Academic Integrity in completing this examination. (Exams without signatures will not be graded.) Signature Date Instructions: This is an open-book, open-notes no-device exam: you may not make use of any electronic devices, not even calculators. Your mobile phones and MP3 players must be turned off and stored away. You have 110 minutes to answer all of the questions. The entire exam is worth 120 points, giving you a guideline of spending approximately 1 minute per point per problem. Partial credit will be given. Do not spend disproportionate time on any one question. All correct answers are short. Meandering answers or brain dumps will not be given full points. Write your answers in the spaces provided: you must turn in this printed exam. The back side of each page may be used as a scratch pad. Good luck! Score 1-15: 30pts 19: 15pts 16: 20pts 20: 15pts 17: 15pts 21: 10pts 18: 15pts Total Score

[2 pts each] Please answer Questions 1-15 on the ScanTron sheet, NOT this sheet of paper. Use a Number 2 Pencil, and be careful to fully erase if you change your answer. 1. XML solves the most difficult issues in data interchange: 2. XML Schema enables key and foreign key constraints to be specified: 3. JSON can be parsed without knowing anything about the tags: 4. XML is in 3NF: 5. SQL is converted to relational algebra, which is then converted into relational calculus to be executed in the database query execution engine: 6. Local-as-view refers to schema mappings that are defined as queries over the mediated schema: 7. What is true about the results of evaluating an XPath? a. XPath returns an unordered set of nodes b. XPath returns an ordered multiset of nodes c. XPath returns an ordered set of nodes d. XPath returns an ordered multiset of nodes 8. Which of the following can reduce the possibility of SQL injection attacks: a. Prepared statements b. Dynamic SQL c. Views

d. None of the above 9. NoSQL databases: a. Never use SQL, hence the name b. Always perform better than SQL databases c. Are especially appropriate for transactions d. None of the above 10. Virtual data integration or enterprise information integration means: a. The cloud is used to integrate data b. A central database is used to integrate data c. Extract-transform-load (ETL) scripts are used to integrate the data d. A virtual mediated schema is used e. None of the above 11. The hash join algorithm needs: a. A join condition (theta) that is a range condition b. Input relations that are sorted on their primary keys c. Input relations that are sorted on the join key d. Input relations that are unsorted 12. Sorting and hashing can be used to: a. Reduce the amount of data that needs to be considered in a multiple-pass algorithm b. Reduce the size of query results c. Project data early d. Convert data using MapReduce 13. An SQL query optimizer, such as that in Oracle or DB2, optimizes: a. The number of tuples according to cardinality estimates b. The number of requests according to workload c. The number of users according to predictions d. The estimated cost of the query according to a cost model 14. Which of the following is an ACID property: a. Concurrency b. Atomicity c. Idempotence d. Delivery

15. Cloud storage systems typically do not provide full ACID semantics because of: a. The number of clients being handled by the cloud b. The latency of communications across multiple servers c. The sizes of the databases d. The number of CEOs and CIOs who associate ACID with drugs and get a negative impression of the capability

16. [20pts] Given the document: <items> <item> <type>book</book> <isbn>978-0385349949</author> <author key= 121 >Sheryl Sandberg</author> <title>lean In</title> </item> <item> <type>book</book> <isbn>978-0385537858</author> <author key= 123 >Dan Brown</author> <title>inferno</title> </item> <item> <type>movie</book> <star key= 149 >Steve Carell</star> <star key= 300 >Kristin Wiig</star> <director key= 3 >Chris Renaud</director> <director key= 99 >Pierre Coffin</director> <title>despicable Me 2</title> </item> </items> a. Write an XPath to return all book titles by Dan Brown. /items/item[type= book ][author= Dan Brown ]/title or /items/item[type= book ] [author= Dan Brown ]/title/text()

b. Write an XQuery to convert the books and authors into a relation-like form with foreign keys. (The author key contains the value, and recall that the @ prefix in XPath can be used to query for an attribute). Your code should be generic, but its output, over the sample data above, should look like: <books> <author key= 121 > <name>sheryl Sandberg</name> </author> <author key= 123 > <name>dan Brown</name> </author> <book> <isbn>978-0385349949</isbn> <title>lean In</title> <author-id>121</author-id> </book> <book> <isbn>978-0385537858</author> <title>inferno</title> <author>123</author> </book> </books> <books> { for $a in distinctvalues(doc( input.xml )/items/item/author) return <author key=$a/@key>{ $a/text() }</author>, for $b in doc( input.xml )/items/item[type= book ] return <book>{ $i/isbn, $i/title}, {for $a in $b/author return <author-id>{ $a/text() }</author-id>} </book> } </books>

17. [15pts] Given the schema: Users(login, first, last, address, email) Friends(login1, login2) where login1, login2 reference Users Convert the following query into a relational algebra expression: SELECT user, F2.login2 AS recommendation FROM Users U, Friends F1, Friends F2 WHERE U.email = me@me.com AND U.login = F1.login1 AND F1.login2 = F2.login1 18. [15pts] Explain briefly what a clustered index means with respect to what data is stored in intermediate nodes, leaf nodes, and so on. In a clustered index, data records are stored in the same order as the key of the index. Typically this means intermediate nodes in the index are pivot values on the key, and the leaf nodes contain records. In unusual cases we can have a secondary index that s clustered, in that the leaf nodes contain pointers to data records that show up in the same exact order. (Consider a clustered primary index in lastname, firstname and a secondary clustered index on lastname.)

19. [15pts] Given the B+ Tree: R 10 20 30 81 A B C D 36 42 51 E G 30* 31* 42* 43* F H 36* 38* 51* 52* 56* 60* I 94 98 J K L 81* 82* 94* 95* 96* 97* 98* 99* 100* 105* Show the effects of inserting 71* and 93*. You may draw over the figure, or redraw the labeled nodes ( R, D L ) below. D: 36, 42, 51, 56 H: 51*, 52* H : 56*, 60*, 71* J: 81*, 82*, 93*

20. [15pts] Given the following costs: Page random access time = 5msec Page sequential read time = 0.05msec R tuples/page = 20 Cardinality of R = 2000 tuples S tuples/page = 10 Cardinality of S = 1000 tuples R and S are sorted on the join key Which join algorithm (nested loops or merge) should we choose, and why? Assume every page is filled to capacity and that the buffer pool (cache) is 2 pages. Merge join (one pass through each of the tables) 21. [10pts] Explain briefly where key-value stores offer advantages over relational database systems. Key-value stores (KVSs) offer benefits when concurrent updates are unlikely to touch the same key and transactions don t touch multiple keys; when the data is not naturally tabular; when queries are not content-related.