Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Similar documents
Lecture 2: Introduction to SQL

Introduction to SQL Part 1 By Michael Hahsler based on slides for CS145 Introduction to Databases (Stanford)

Lecture 3: SQL Part II

DS Introduction to SQL Part 2 Multi-table Queries. By Michael Hahsler based on slides for CS145 Introduction to Databases (Stanford)

DS Introduction to SQL Part 1 Single-Table Queries. By Michael Hahsler based on slides for CS145 Introduction to Databases (Stanford)

Lectures 2&3: Introduction to SQL

SQL. Assit.Prof Dr. Anantakul Intarapadung

Introduction to Database Systems CSE 444

Polls on Piazza. Open for 2 days Outline today: Next time: "witnesses" (traditionally students find this topic the most difficult)

CMPT 354: Database System I. Lecture 3. SQL Basics

SQL. Structured Query Language

INF 212 ANALYSIS OF PROG. LANGS SQL AND SPREADSHEETS. Instructors: Crista Lopes Copyright Instructors.

Agenda. Discussion. Database/Relation/Tuple. Schema. Instance. CSE 444: Database Internals. Review Relational Model

Database Systems CSE 303. Lecture 02

Database Systems CSE 303. Lecture 02

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2006 Lecture 3 - Relational Model

CSE 344 JANUARY 5 TH INTRO TO THE RELATIONAL DATABASE

Before we talk about Big Data. Lets talk about not-so-big data

Introduction to Databases CSE 414. Lecture 2: Data Models

COMP 430 Intro. to Database Systems

MTAT Introduction to Databases

Announcements. Agenda. Database/Relation/Tuple. Discussion. Schema. CSE 444: Database Internals. Room change: Lab 1 part 1 is due on Monday

Announcements. Agenda. Database/Relation/Tuple. Schema. Discussion. CSE 444: Database Internals

Lecture 3 Additional Slides. CSE 344, Winter 2014 Sudeepa Roy

Introduction to Database Systems CSE 414. Lecture 3: SQL Basics

Introduction to Data Management CSE 344. Lecture 2: Data Models

CSE 344 JANUARY 8 TH SQLITE AND JOINS

COMP 430 Intro. to Database Systems

Introduction to Relational Databases

Principles of Database Systems CSE 544. Lecture #2 SQL The Complete Story

Introduction to Database Systems CSE 414. Lecture 3: SQL Basics

CS 564 Midterm Review

Data Informatics. Seon Ho Kim, Ph.D.

Keys, SQL, and Views CMPSCI 645

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Introduction to Data Management CSE 414

Announcements. Using Electronics in Class. Review. Staff Instructor: Alvin Cheung Office hour on Wednesdays, 1-2pm. Class Overview

Introduction to Database Systems. The Relational Data Model

Introduction to Database Systems. The Relational Data Model. Werner Nutt

Database Systems CSE 414

Operating systems fundamentals - B07

The Relational Data Model

CSE 544 Principles of Database Management Systems

CSC 261/461 Database Systems Lecture 13. Fall 2017

Lecture 16. The Relational Model

Announcements. Multi-column Keys. Multi-column Keys (3) Multi-column Keys. Multi-column Keys (2) Introduction to Data Management CSE 414

CSE 544 Principles of Database Management Systems

Announcements. Multi-column Keys. Multi-column Keys. Multi-column Keys (3) Multi-column Keys (2) Introduction to Data Management CSE 414

The Relational Model. Chapter 3. Comp 521 Files and Databases Fall

The Entity-Relationship Model (ER Model) - Part 2

The Relational Model. Chapter 3. Comp 521 Files and Databases Fall

CSC 261/461 Database Systems Lecture 5. Fall 2017

Introduction to SQL Part 2 by Michael Hahsler Based on slides for CS145 Introduction to Databases (Stanford)

Review. The Relational Model. Glossary. Review. Data Models. Why Study the Relational Model? Why use a DBMS? OS provides RAM and disk

Lecture 5. Lecture 5: The E/R Model

Announcements (September 14) SQL: Part I SQL. Creating and dropping tables. Basic queries: SFW statement. Example: reading a table

The Relational Model. Week 2

CMPT 354: Database System I. Lecture 2. Relational Model

The Relational Model. Roadmap. Relational Database: Definitions. Why Study the Relational Model? Relational database: a set of relations

CS275 Intro to Databases

SQL: Data Definition Language

Principles of Database Systems CSE 544. Lecture #1 Introduction and SQL

SQL DATA DEFINITION LANGUAGE

Data Modeling. Yanlei Diao UMass Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

SQL DATA DEFINITION LANGUAGE

Introduction to Data Management. Lecture #4 (E-R Relational Translation)

SQL: Part III. Announcements. Constraints. CPS 216 Advanced Database Systems

Lecture 4. Lecture 4: The E/R Model

Administrivia. The Relational Model. Review. Review. Review. Some useful terms

Lecture 2 SQL. Instructor: Sudeepa Roy. CompSci 516: Data Intensive Computing Systems

Principles of Database Systems CSE 544. Lecture #1 Introduction and SQL

Relational Databases BORROWED WITH MINOR ADAPTATION FROM PROF. CHRISTOS FALOUTSOS, CMU /615

The Relational Model. Why Study the Relational Model? Relational Database: Definitions

Database Systems ( 資料庫系統 )

CPSC 504 Background (aka, all you need to know about databases for this course in two lectures) Rachel Pottinger January 8 and 12, 2015

Review: Where have we been?

Relational data model

Database Applications (15-415)

The Relational Model of Data (ii)

Introduction to Data Management CSE 344

The Relational Model. Relational Data Model Relational Query Language (DDL + DML) Integrity Constraints (IC)

The Relational Model. Outline. Why Study the Relational Model? Faloutsos SCS object-relational model

Chapter 6 The database Language SQL as a tutorial

SQL: The Query Language Part 1. Relational Query Languages

The Relational Model Constraints and SQL DDL

The Relational Model. Chapter 3

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

The Relational Model

L07: SQL: Advanced & Practice. CS3200 Database design (sp18 s2) 1/11/2018

The Relational Model. Chapter 3. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

SQL DATA DEFINITION LANGUAGE

The Relational Data Model. Data Model

CS143: Relational Model

SQLite, Firefox, and our small IMDB movie database. CS3200 Database design (sp18 s2) Version 1/17/2018

SQL: Part II. Announcements (September 18) Incomplete information. CPS 116 Introduction to Database Systems. Homework #1 due today (11:59pm)

Administrative notes. Levels of Abstraction. Overview of the next two classes. Schema and Instances. Conceptual Database Design

The Relational Model 2. Week 3

Announcements (September 18) SQL: Part II. Solution 1. Incomplete information. Solution 3? Solution 2. Homework #1 due today (11:59pm)

Announcements. Outline UNIQUE. (Inner) joins. (Inner) Joins. Database Systems CSE 414. WQ1 is posted to gradebook double check scores

Database Technology Introduction. Heiko Paulheim

Transcription:

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn

Schedule (1) Storage system part (first eight weeks) lec1: Introduction on big data and cloud computing Iec2: Introduction on data storage lec3: Data reliability (Replication/Archive/EC) lec4: Data consistency problem lec5: Block level storage and file storage lec6: Object-based storage lec7: Distributed file system lec8: Metadata management

Schedule (2) Reading & Project part (middle two/three weeks) Database part (last five weeks) lec9: Introduction on database lec10: Relational database (SQL) lec11: Relational database (NoSQL) lec12: Distributed database lec13: Main memory database

Collaborators

1 SQL Introduction & Definitions

What you will learn about in this section 1. What is SQL? 2. Basic schema definitions 3. Keys & constraints intro 4. ACTIVITY: CREATE TABLE statements

SQL Motivation Dark times 5 years ago. Are databases dead? Now, as before: everyone sells SQL Pig, Hive, Impala Not-Yet-SQL?

SQL Introduction SQL is a standard language for querying and manipulating data SQL is a very high-level programming language This works because it is optimized well! Many standards out there: ANSI SQL, SQL92 (a.k.a. SQL2), SQL99 (a.k.a. SQL3),. Vendors support various subsets NB: Probably the world s most successful parallel programming language (multicore?) SQL stands for Structured Query Language

SQL is a Data Definition Language (DDL) Define relational schemata Create/alter/delete tables and their attributes Data Manipulation Language (DML) Insert/delete/modify tuples in tables Query one or more tables discussed next!

Tables in SQL Product PName Price Manufacturer Gizmo $19.99 GizmoWorks Powergizmo $29.99 GizmoWorks SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi A relation or table is a multiset of tuples having the attributes specified by the schema Let s break this definition down

Tables in SQL Product PName Price Manufacturer Gizmo $19.99 GizmoWorks Powergizmo $29.99 GizmoWorks SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi A multiset is an unordered list (or: a set with multiple duplicate instances allowed) List: [1, 1, 2, 3] Set: {1, 2, 3} Multiset: {1, 1, 2, 3} i.e. no next(), etc. methods!

Tables in SQL Product PName Price Manufacturer Gizmo $19.99 GizmoWorks Powergizmo $29.99 GizmoWorks An attribute (or column) is a typed data entry present in each tuple in the relation SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi NB: Attributes must have an atomic type in standard SQL, i.e. not a list, set, etc.

Tables in SQL Product PName Price Manufacturer Gizmo $19.99 GizmoWorks Powergizmo $29.99 GizmoWorks SingleTouch $149.99 Canon MultiTouch $203.99 Hitachi A tuple or row is a single entry in the table having the attributes specified by the schema Also referred to sometimes as a record

Tables in SQL Product PName Price Manufacturer Gizmo $19.99 GizmoWorks Powergizmo $29.99 GizmoWorks SingleTouch $149.99 Canon The number of tuples is the cardinality of the relation MultiTouch $203.99 Hitachi The number of attributes is the arity of the relation

Data Types in SQL Atomic types: Characters: CHAR(20), VARCHAR(50) Numbers: INT, BIGINT, SMALLINT, FLOAT Others: MONEY, DATETIME, Every attribute must have an atomic type Hence tables are flat

Table Schemas The schema of a table is the table name, its attributes, and their types: Product(Pname: string, Price: float, Category: string, Manufacturer: string) A key is an attribute whose values are unique; we underline a key Product(Pname: string, Price: float, Category: string, Manufacturer: string)

Key constraints A key is a minimal subset of attributes that acts as a unique identifier for tuples in a relation A key is an implicit constraint on which tuples can be in the relation i.e. if two tuples agree on the values of the key, then they must be the same tuple! Students(sid:string, name:string, gpa: float) 1. Which would you select as a key? 2. Is a key always guaranteed to exist? 3. Can we have more than one key?

NULL and NOT NULL To say don t know the value we use NULL NULL has (sometimes painful) semantics, more detail later Students(sid:string, name:string, gpa: float) sid name gpa 123 Bob 3.9 143 Jim NULL Say, Jim just enrolled in his first class. In SQL, we may constrain a column to be NOT NULL, e.g., name in this table

General Constraints We can actually specify arbitrary assertions E.g. There cannot be 25 people in the DB class In practice, we don t specify many such constraints. Why? Performance! Whenever we do something ugly (or avoid doing something convenient) it s for the sake of performance

Summary of Schema Information Schema and Constraints are how databases understand the semantics (meaning) of data They are also useful for optimization SQL supports general constraints: Keys and foreign keys are most important We ll give you a chance to write the others

2 Single-table queries

SQL Query Basic form (there are many many more bells and whistles) SELECT <attributes> FROM <one or more relations> WHERE <conditions> Call this a SFW query.

Simple SQL Query: Selection Selection is the operation of filtering a relation s tuples on some condition PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi SELECT * FROM Product WHERE Category = Gadgets PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks

Simple SQL Query: Projection Projection is the operation of producing an output table with tuples that have a subset of their prior attributes PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi SELECT Pname, Price, Manufacturer FROM Product WHERE Category = Gadgets PName Price Manufacturer Gizmo $19.99 GizmoWorks Powergizmo $29.99 GizmoWorks 24

Notation Input schema Product(PName, Price, Category, Manfacturer) SELECT Pname, Price, Manufacturer FROM Product WHERE Category = Gadgets Output schema Answer(PName, Price, Manfacturer)

A Few Details SQL commands are case insensitive: Same: SELECT, Select, select Same: Product, product Values are not: Different: Seattle, seattle Use single quotes for constants: abc - yes abc - no

LIKE: Simple String Pattern Matching SELECT * FROM Products WHERE PName LIKE %gizmo% s LIKE p: pattern matching on strings p may contain two special symbols: % = any sequence of characters _ = any single character

DISTINCT: Eliminating Duplicates SELECT DISTINCT Category FROM Product Versus SELECT Category FROM Product Category Gadgets Photography Household Category Gadgets Gadgets Photography Household

ORDER BY: Sorting the Results SELECT PName, Price, Manufacturer FROM Product WHERE Category= gizmo AND Price > 50 ORDER BY Price, PName Ties are broken by the second attribute on the ORDER BY list, etc. Ordering is ascending, unless you specify the DESC keyword.

3 Multi-table queries

Foreign Key constraints Suppose we have the following schema: Students(sid: string, name: string, gpa: float) Enrolled(student_id: string, cid: string, grade: string) And we want to impose the following constraint: Only bona fide students may enroll in courses i.e. a student must appear in the Students table to enroll in a class Students Enrolled sid name gpa 101 Bob 3.2 123 Mary 3.8 student_id cid grade 123 564 A 123 537 A+ student_id alone is not a key- what is? We say that student_id is a foreign key that refers to Students

Declaring Foreign Keys Students(sid: string, name: string, gpa: float) Enrolled(student_id: string, cid: string, grade: string) CREATE TABLE Enrolled( student_id CHAR(20), cid CHAR(20), grade CHAR(10), PRIMARY KEY (student_id, cid), FOREIGN KEY (student_id) REFERENCES Students(sid) )

Foreign Keys and update operations Students(sid: string, name: string, gpa: float) Enrolled(student_id: string, cid: string, grade: string) What if we insert a tuple into Enrolled, but no corresponding student? INSERT is rejected (foreign keys are constraints)! What if we delete a student? 1. Disallow the delete 2. Remove all of the courses for that student 3. SQL allows a third via NULL (not yet covered) DBA chooses (syntax in the book)

Keys and Foreign Keys Company CName StockPrice Country GizmoWork s Product 25 USA Canon 65 Japan Hitachi 15 Japan PName Price Category Manufacturer Gizmo $19.99 Gadgets GizmoWorks Powergizmo $29.99 Gadgets GizmoWorks SingleTouch $149.99 Photography Canon MultiTouch $203.99 Household Hitachi What is a foreign key vs. a key here?

Joins Product(PName, Price, Category, Manufacturer) Company(CName, StockPrice, Country) Ex: Find all products under $200 manufactured in Japan; return their names and prices. SELECT PName, Price FROM Product, Company WHERE Manufacturer = CName AND Country= Japan AND Price <= 200 Note: we will often omit attribute types in schema definitions for brevity, but assume attributes are always atomic types

Joins Product(PName, Price, Category, Manufacturer) Company(CName, StockPrice, Country) Ex: Find all products under $200 manufactured in Japan; return their names and prices. SELECT PName, Price FROM Product, Company WHERE Manufacturer = CName AND Country= Japan AND Price <= 200 A join between tables returns all unique combinations of their tuples which meet some specified join condition

Joins Product(PName, Price, Category, Manufacturer) Company(CName, StockPrice, Country) Several equivalent ways to write a basic join in SQL: SELECT PName, Price FROM Product, Company WHERE Manufacturer = CName AND Country= Japan AND Price <= 200 SELECT PName, Price FROM Product JOIN Company ON Manufacturer = Cname AND Country= Japan WHERE Price <= 200 A few more later on

Joins Product PName Price Category Manuf Gizmo $19 Gadgets GWorks Powergizmo $29 Gadgets GWorks SingleTouch $149 Photography Canon MultiTouch $203 Household Hitachi Company Cname Stock Country GWorks 25 USA Canon 65 Japan Hitachi 15 Japan SELECT PName, Price FROM Product, Company WHERE Manufacturer = CName AND Country= Japan AND Price <= 200 PName Price SingleTouch $149.99

Tuple Variable Ambiguity in Multi-Table Person(name, address, worksfor) Company(name, address) SELECT DISTINCT name, address FROM Person, Company WHERE worksfor = name Which address does this refer to? Which name s??

Tuple Variable Ambiguity in Multi-Table Person(name, address, worksfor) Company(name, address) Both equivalent ways to resolve variable ambiguity SELECT DISTINCT Person.name, Person.address FROM Person, Company WHERE Person.worksfor = Company.name SELECT DISTINCT p.name, p.address FROM Person p, Company c WHERE p.worksfor = c.name 40

Meaning (Semantics) of SQL Queries SELECT x 1.a 1, x 1.a 2,, x n.a k FROM R 1 AS x 1, R 2 AS x 2,, R n AS x n WHERE Conditions(x 1,, x n ) Almost never the fastest way to compute it! Answer = {} for x 1 in R 1 do for x 2 in R 2 do.. for x n in R n do if Conditions(x 1,, x n ) then Answer = Answer {(x 1.a 1, x 1.a 2,, x n.a k )} return Answer Note: this is a multiset union

An example of SQL semantics A 1 3 B C 2 3 3 4 3 5 Cross Product SELECT R.A FROM R, S WHERE R.A = S.B A B C 1 2 3 1 3 4 1 3 5 3 2 3 3 3 4 3 3 5 Output Apply Selections / Conditions A 3 3 A B C 3 3 4 3 3 5 Apply Projection

Note the semantics of a join SELECT R.A FROM R, S WHERE R.A = S.B 1. Take cross product: X = R S 2. Apply selections / conditions: Y = r, s X r. A == r. B} 3. Apply projections to get final output: Z = (y. A, ) for y Y Recall: Cross product (A X B) is the set of all unique tuples in A,B Ex: {a,b,c} X {1,2} = {(a,1), (a,2), (b,1), (b,2), (c,1), (c,2)} = Filtering! = Returning only some attributes Remembering this order is critical to understanding the output of certain queries (see later on ) 43

Note: we say semantics not execution order The preceding slides show what a join means Not actually how the DBMS executes it under the covers

A Subtlety about Joins Product(PName, Price, Category, Manufacturer) Company(CName, StockPrice, Country) Find all countries that manufacture some product in the Gadgets category. SELECT Country FROM Product, Company WHERE Manufacturer=CName AND Category= Gadgets

A subtlety about Joins Product PName Price Category Manuf Gizmo $19 Gadgets GWorks Powergizmo $29 Gadgets GWorks SingleTouch $149 Photography Canon Company Cname Stock Country GWorks 25 USA Canon 65 Japan Hitachi 15 Japan MultiTouch $203 Household Hitachi SELECT Country FROM Product, Company WHERE Manufacturer=Cname AND Category= Gadgets What is the problem? What s the solution? Country??

An Unintuitive Query SELECT DISTINCT R.A FROM R, S, T WHERE R.A=S.A OR R.A=T.A What does it compute?

An Unintuitive Query SELECT DISTINCT R.A FROM R, S, T WHERE R.A=S.A OR R.A=T.A S T R But what if S = f? Computes R (S T) Go back to the semantics!

An Unintuitive Query Recall the semantics! 1. Take cross-product 2. Apply selections / conditions 3. Apply projection SELECT DISTINCT R.A FROM R, S, T WHERE R.A=S.A OR R.A=T.A If S = {}, then the cross product of R, S, T = {}, and the query result = {}! Must consider semantics here. Are there more explicit way to do set operations like this?

Summary Relational model has well-defined query semantics Modern SQL extends pure relational model (some extra goodies for duplicate row, non-atomic types more in next lecture) Typically, many ways to write a query DBMS figures out a fast way to execute a query, regardless of how it is written.

Support SQLFiddle

References Leis V. Query Processing and Optimization in Modern Database Systems[J]. Datenbanksysteme für Business, Technologie und Web (BTW 2017), 2017. Azqueta-Alzúaz A, Patiño-Martinez M, Brondino I, et al. Massive Data Load on Distributed Database Systems over HBase[C]//Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 2017: 776-779. Gessert F, Wingerath W, Friedrich S, et al. NoSQL database systems: a survey and decision guidance[j]. Computer Science-Research and Development, 2017, 32(3-4): 353-365.

Thank you!