CSE 344 Final Review. August 16 th

Similar documents
CSE 344 MAY 7 TH EXAM REVIEW

CSE 414 Final Examination. Name: Solution

Goals for Today. CS 133: Databases. Final Exam: Logistics. Why Use a DBMS? Brief overview of course. Course evaluations

Introduction to Data Management. Lecture #26 (Transactions, cont.)

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

CMSC 461 Final Exam Study Guide

CSE344 Final Exam Winter 2017

Administrivia. CS186 Class Wrap-Up. News. News (cont) Top Decision Support DBs. Lessons? (from the survey and this course)

CSE 344 Final Examination

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Administration Naive DBMS CMPT 454 Topics. John Edgar 2

CSE 344 Final Examination

Introduction to Data Management. Lecture #25 (Transactions II)

Distributed Data Analytics Transactions

CSE 530A ACID. Washington University Fall 2013

CSE 344 Final Examination

CSE 344 MARCH 25 TH ISOLATION

Transaction Management: Concurrency Control, part 2

Locking for B+ Trees. Transaction Management: Concurrency Control, part 2. Locking for B+ Trees (contd.) Locking vs. Latching

DATABASE MANAGEMENT SYSTEMS

Database Systems CSE 414

Introduction to Data Management CSE 344

CPSC 421 Database Management Systems. Lecture 19: Physical Database Design Concurrency Control and Recovery

Database Systems CSE 414

Final Review. CS634 May 11, Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

CS Final Exam Review Suggestions

DATABASE TRANSACTIONS. CS121: Relational Databases Fall 2017 Lecture 25

CSE 190D Spring 2017 Final Exam

CSE 444: Database Internals. Lectures Transactions

Database Ph.D. Qualifying Exam Spring 2006

Distributed Data Management Transactions

CSE 190D Database System Implementation

Transactions and Isolation

CSE 562 Final Exam Solutions

Database Management Systems CSEP 544. Lecture 9: Transactions and Recovery

SYED AMMAL ENGINEERING COLLEGE

Introduction to Data Management CSE 344

Database Systems CSE 414

Database Management Systems Paper Solution

Database Systems CSE 414

Database Architectures

A7-R3: INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 10: Transaction processing. November 14, Lecturer: Rasmus Pagh

Transactions. Kathleen Durant PhD Northeastern University CS3200 Lesson 9

Total No. of Questions :09] [Total No. of Pages : 02. II/IV B.Tech. DEGREE EXAMINATIONS, NOV/DEC Second Semester CSE/IT DBMS

Announcements. Motivating Example. Transaction ROLLBACK. Motivating Example. CSE 444: Database Internals. Lab 2 extended until Monday

Huge market -- essentially all high performance databases work this way

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Techno India Batanagar Computer Science and Engineering. Model Questions. Subject Name: Database Management System Subject Code: CS 601

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

CSE 444: Database Internals. Lectures 13 Transaction Schedules

CSE 344 Final Examination

CSE 190D Spring 2017 Final Exam Answers

Distributed Databases: SQL vs NoSQL

CSE 344 MARCH 21 ST TRANSACTIONS

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Final Review. May 9, 2017

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Final Review. May 9, 2018 May 11, 2018

Database Tuning and Physical Design: Execution of Transactions

Data Analysis. CPS352: Database Systems. Simon Miner Gordon College Last Revised: 12/13/12

Final Exam Review. Kathleen Durant PhD CS 3200 Northeastern University

CSC 261/461 Database Systems Lecture 21 and 22. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Midterm 2: CS186, Spring 2015

Introduction to Data Management CSE 344

Introduction to NoSQL

CISC437/637 Database Systems Final Exam

CSE 344 JULY 30 TH DB DESIGN (CH 4)

CS5412: TRANSACTIONS (I)

Intro to Transaction Management

Column Stores vs. Row Stores How Different Are They Really?

Transaction Management Exercises KEY

Topics in Reliable Distributed Systems

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

CSE 344 Final Examination

TRANSACTION MANAGEMENT

CSE 414 Database Systems section 10: Final Review. Joseph Xu 6/6/2013

Introduction to Data Management CSE 344

CSE 344 MARCH 9 TH TRANSACTIONS

Motivating Example. Motivating Example. Transaction ROLLBACK. Transactions. CSE 444: Database Internals

Lecture 12. Lecture 12: The IO Model & External Sorting

CSE 344 MARCH 5 TH TRANSACTIONS

Administrivia. Physical Database Design. Review: Optimization Strategies. Review: Query Optimization. Review: Database Design

Introduction to Data Management CSE 344

5/17/17. Announcements. Review: Transactions. Outline. Review: TXNs in SQL. Review: ACID. Database Systems CSE 414.

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

CISC437/637 Database Systems Final Exam

Final Exam CSE232, Spring 97

Physical DB design and tuning: outline

Introduction to Data Management. Lecture #17 (Physical DB Design!)

Query Processing. Lecture #10. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

CS 445 Introduction to Database Systems

Database Systems CSE 414

Department of Information Technology B.E/B.Tech : CSE/IT Regulation: 2013 Sub. Code / Sub. Name : CS6302 Database Management Systems

Advanced Data Management Technologies Written Exam

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

COURSE 1. Database Management Systems

L i (A) = transaction T i acquires lock for element A. U i (A) = transaction T i releases lock for element A

Questions about the contents of the final section of the course of Advanced Databases. Version 0.3 of 28/05/2018.

6.830 Lecture PS1 Due Next Time (Tuesday!) Lab 1 Out end of week start early!

Transcription:

CSE 344 Final Review August 16 th

Final In class on Friday One sheet of notes, front and back cost formulas also provided Practice exam on web site Good luck!

Primary Topics Parallel DBs parallel join algorithms parallel query optimization MapReduce DB Design E/R diagrams constraints normalization

Primary Topics Transactions ACID properties serial vs serializable vs conflict serializable (strict) 2PL locking First-half material also fair game!

Review...

Motivation important uses of databases big data almost every company has too much for 1 machine very important data critical that the data is not lost or corrupted complex queries hard to answer them efficiently modern DBs solve these problems

What is a database? collection of related data provides languages to describe, query, and update data checks that new data satisfies constraints allows you to change the schema very common & hard problem easy languages for querying data also efficient implementation provides high levels of reliability hides many (physical) details

SQL (everywhere) best language we have easy for non-programmers to learn can express almost any query you have works well for parallel DBs just as well

Relational Algebra all of SQL simplifies to just a few operations union, intersect, select, projection, join, & aggregation not as useful for users harder than writing SQL have to say how to implement query very useful for DB implementers similar to intermediate language of compiler implementers call RA query plans

Datalog very different way of writing relation query closer to logic explanation for why it can answer any question (connections to AI) only slightly more expressive than SQL / RA only adds recursion! w/out recursion can convert back and forth need to write safe rules unsafe rules generate infinite results

NoSQL relational data is not natural lists are natural to users but not 1NF JSON is more natural early systems: key-val pairs & extensible records huge scale for OLTP workloads BUT reduced functionality limited data model no joins (ouch!) modern systems have JSON + full functionality

SQL++ supports querying non-1nf data all you need is un-nesting e.g., world x, x.rivers y pretty easy to add to other systems also allows working with lists directly both input and output remove restriction on subquery location can have >1 rows in result in SELECT clause more convenient for users

Internals logical plans: RA physical plans: choice of op implementations e.g., join algorithms pipelining allows tuples to go to the user more quickly don t need to wait for all tuples to be ready no need to store intermediate results no disk cost for selection & projection

Internals cont. indexing clustered vs unclustered hash vs B+ tree vs other disks are unbelievably slow hard disks are mechanical devices reading 1-2% is as slow as reading whole file for speed: store in memory of many machines becoming increasingly common

Parallel DBs shared memory & shared disk work with smaller amounts of data BUT modern systems are shared nothing workloads: OLAP vs OLTP OLAP is big read-only queries OLTP is many read/write queries, each accessing only a small amount of data can t support both at scale! best solution is to execute OLAP on old data (multi-version)

Parallel DBs cont. easy solutions for OLAP & OLTP are different partitioning & replication we want both! need to use partitioning (for OLTP) then figure out how to execute OLAP queries on partitioned data...

Parallel query plans can still use cost-based optimization could be disk cost or network cost (only network if no disk involved) only new operation is reshuffle! all other work on a single machine can use in-memory operations at no cost cost will generally be worse with more reshuffling where have I seen reshuffle before...

Parallel query plans cont. map reduce steps: input > map > shuffle > reduce > output you provide map and reduce parts framework provides (re)shuffle all steps use (key,value) pairs as data format framework only looks at key value is opaque framework handles many low-level details restarts workers that fail reassigns finished workers to straggling jobs

DB Design Getting the data right is half(?) the problem E/R is a great way to communicate design higher level than SQL, a picture! some new complexities multi-way relationships subclasses weak entities

DB Design cont. People really do make schema design errors Normalization is a great way to find them! BCNF is the standard 4NF might also be useful Functional dependencies are constraints

DB Design cont. Constraints are really important how can you ensure they are always satisfied without identifying them? DB automatically checks many constraints even auto-fixes broke FK constraints cascade, set null, etc. Rest are up to you only need to check before committing

DB Performance Main choices that affect performance indexes materialized views Indexing includes choice of clustering e.g., if multiple keys, which is primary? primary key becomes clustering indexes improve queries but slow updates also create more lock contention queries are most important though...

DB Performance cont. Views are tables computed from others you give the query used to compute them can then refer to the view by name without giving the query ways to implement: re-compute on demand just substitute the query store and updated called materialized views

Transactions ACID properties let us write correct apps (you saw this in HW8) other models are difficult consistency and durability are easy consistency: check constraints before commit durability; write to (multiple) disks ideally, geographically-separated disks atomicity and isolation are harder locking or MVCC provide these

Locking serial schedules are isolated by definition serializable schedules are more general just as good: identical behavior allows parallelism conflict serializability special type of serializable schedules can prove serializability by simple swaps

Locking cont. 2PL ensures conflict serializability strict 2PL gives atomicity & isolation trouble for 2PL is rollback (atomicity) phantom tuples are still an issue easiest solution: lock part of the index can also use predicate locks (hard) ways to fine tune performance lock modes lock granularity admission control sometimes actually helps!

Locking tradeoff between correctness & performance saw how to get ACID but at substantial cost DBs default to non-acid SQL Server allows phantoms in that case, correctness is up to you need to think through whether phantoms will break any of your apps this is very hard! especially as the code is changing especially with many programmers

Final notes Study transactions: CS, locking DB design: E/R, BCNF parallel DBs: network cost estimation see Wednesday lecture Briefly review other materials