Lecture 4. Relational Algebra By now, we know that (a) All data in a relational DB is stored in tables (b) there are several tables in each DB (because of Normalization), and (c) often the information we need is divided between different tables. Therefore we need a logical method to combine data in different tables, to select some subset of the data, etc. Here, we learn how we can do so using the algebra of relational tables. What is an algebra? A formal system of manipulation of symbols to deal with general statements of relations. Algebraic systems are abstract: in the sense that a system only provides a set of symbols, and set of rules to construct valid expressions, and a set of rules by which expressions and symbols can be manipulated. The algebra we learn in high school is the algebra of real numbers. All symbols for variables represented real numbers, and all manipulations of symbols gave relations between some real numbers. Later we learnt the Algebra of Complex numbers, which worked pretty much the same way. Now we learn the Algebra of relations, where every relational table is a set of tuples, and each tuple is an ordered sequence of attribute values. Just as real algebra has operations (+, -, x, /), Relational Algebra also has operations. The main RA operations: Select, Project, Join, Divide Notice that the result of any operation in real algebra is also a real number. Thus, for real numbers x, y: (x + y) is real. So is ( x - y). So is ( x / y) for all values where / is defined. [When is it not defined?] Similarly, in RA, whenever an operation is defined the result of an RA operation is a relational table. Why is this important? Because this allows us to combine a sequence of RA operations in arbitrary order! [Why is it important to combine arbitrary sequences of RA operations?] 1
Now we shall see that extraction/modification of ANY information in a relational schema can be done by some combination of RA operations. We shall demonstrate with the following tables: EMPLOYEE Name ID SupervisorID DeptNo John 111 222 5 Frankie 222 777 4 Alice 333 444 4 Jennifer 444 777 4 William 555 222 5 Joyce 666 222 5 James 777 222 5 John 888 null 1 WORKS_ON IDno ProjNo Hours 111 1 2.5 111 2 1.5 555 3 2.5 666 1 3.5 666 2 3.5 222 2 2.5 222 3 5.5 222 4 5.5 222 5 1.5 DEPENDENT EmpID DepName Relationship 222 Jack Son 222 Jill Wife 222 John Son 444 Ted Son 111 Mike Son 111 Anita Daughter 2
Select Operation A relational table is composed of an unordered set of tuples (rows). The SELECT operation allows us select a SUBSET of the tuples of a relational table, which satisfy some specified conditions. SELECT [conditions] ( TABLE ) = SELECT [DeptNo = 5] ( EMPLOYEE ) Name ID SupervisorID DeptNo John 111 222 5 William 555 222 5 Joyce 666 222 5 James 777 222 5 All tuples in relational table EMPLOYEE, for which the condition [DeptNo = 5] was TRUE, were placed in the. NOTES: 1. SELECT looks at each tuple of its argument, and evaluates the specified conditions. If the conditions are true, then that tuple is placed in ; otherwise the tuple is rejected. 2. The result of the SELECT operation is also a Relational table! In fact, it is a subset of all the tuples of its argument. 3. Selection conditions must always evaluate to logical values: TRUE, or FALSE. Hence all conditions are EXPRESSIONS, connected by LOGICAL operators (AND, OR, NOT). = SELECT [ (DeptNo!= 4) AND ( (ID = 222) OR (ID = 111)) ] ( EMPLOYEE) Name ID SupervisorID DeptNo John 111 222 5 3
Project Operation The SELECT operation outputs a subset of the rows of a Relational Table. In contrast, the PROJECT operation outputs a subset of the columns of the Relational Table. PROJECT [attribute-list] ( TABLE ) = PROJECT [ Name, ID] ( EMPLOYEE ) Name ID John 111 Frankie 222 Alice 333 Jennifer 444 William 555 Joyce 666 James 777 John 888 NOTES: 1. Once again, the result of the PROJECT operation is another Relational Table. 2. Since Relational tables are a set of tuples, if PROJECT will not output identical tuples twice! = PROJECT [EmpID, Relationship] ( DEPENDENT ) EmpID Relationship 222 Son 222 Wife 444 Son 111 Son 111 Daughter 4
Combinations of RA operations Similar to real algebra, RA Operations can be used in arbitrary combinations: = PROJECT [Name, ID] ( SELECT [ SupervisorID = 222] ( EMPLOYEE) ) This gets evaluated in two steps: First, the SELECT returns 1, and then the PROJECT returns the final. Step 1: 1 Name ID SupervisorID DeptNo John 111 222 5 William 555 222 5 Joyce 666 222 5 James 777 222 5 and then Step 2: Name ID John 111 William 555 Joyce 666 James 777 Join Operations The join operation is used to join the data in two tables. This operator combines the information in two Relational Tables. JOIN [conditions] ( TABLE1, TABLE2) In the above, TABLE1 and TABLE2 can possibly be the same table. JOIN forms combinations of the tables TABLE1 and TABLE2. The output is another table, with all the attributes of TABLE1 and all attributes of TABLE2. 5
How it works: Since Relational Tables are sets of tuples, first form the CARTESIAN PRODUCT of the tables. If TABLE1 has A rows, and B columns; TABLE2 has N rows and M columns, then: The Cartesian product will have A*N tuples, and each tuple will have (B + M) attributes. The of the JOIN will contain every tuple of the Cartesian product for which [conditions] evaluate to TRUE. = JOIN [ID = IDno] ( EMPLOYEE, WORKS_ON ) Name ID SupervisorID DeptNo IDno ProjNo Hours John 111 222 5 111 1 2.5 John 111 222 5 111 2 1.5 Frankie 222 777 4 222 2 2.5 Frankie 222 777 4 222 3 5.5 Frankie 222 777 4 222 4 5.5 Frankie 222 777 4 222 5 1.5 William 555 222 5 555 3 2.5 Joyce 666 222 5 666 1 3.5 Joyce 666 222 5 666 2 3.5 NOTES: 1. The result of THETA-JOIN is also a Relational Table. 2. DOT-convention: Sometimes, the names of attributes in TABLE1 and TABLE2 can be the same. Whenever there is confusion, we shall refer to such attributes by assigning those attribute names in the as TABLE_NAME.Attribute. Thus, the attribute Name in can equivalently be called EMPLOYEE.Name. Likewise, the attributes can all be named as: EMPLOYEE.Name, EMPLOYEE.ID,..., WORKS_ON.IDno,..., WORKS_ON.Hours. 3. NATURAL-JOIN: A special case of the JOIN Operation is often used. In a NATURAL-JOIN, attributes of TABLE1 and TABLE2 that have the SAME NAME, must be equal in value. Data loss in a JOIN: 6
In a join operation, if there is no tuple from TABLE2 matching the conditions with a tuple of TABLE1, then that tuple of TABLE1 does not occur in the. Sometimes, when performing JOIN operations, we may specifically require that all tuples of TABLE1 must occur at least once in the. If no matching tuples are found in TABLE2, then just enter NULL values for the attributes related to TABLE2. This operation is called a LEFT- OUTER-JOIN. = LEFT-OUTER-JOIN [ID = EmpID] ( EMPLOYEE, DEPENDENT ) Name ID SupervisorID DeptNo EmpID DepName Relationship John 111 222 5 111 Mike Son John 111 222 5 111 Anita Daughter Frankie 222 777 4 222 Jack Son Frankie 222 777 4 222 Jill Wife Frankie 222 777 4 222 John Son Alice 333 444 4 null null null Jennifer 444 777 4 444 Ted Son William 555 222 5 null null null Joyce 666 222 5 null null null James 777 222 5 null null null John 888 null 1 null null null Similarly, we can define a RIGHT-OUTER-JOIN, where all tuples of TABLE2 appear at least once, with null values for tuples of TABLE1 when there is no match. 7
Set theoretic Operations: Since relational tables are sets of tuples, common set operations can be easily defined. These include Union, Intersection, Difference, Division. Union: If two tables have the same attributes, you can perform a union. UNION ( TABLE1, TABLE2) X = UNION( (SELECT [EmpID = 222] (DEPENDENTS)), ( SELECT [EmpID = 444] ( DEPENDENTS))) X EmpID DepName Relationship 222 Jack Son 222 Jill Wife 222 John Son 444 Ted Son Intersection: Performs set intersection on the tables, provided they have the same attributes. INTERSECTION ( TABLE1, TABLE2) X = INTERSECTION (( (SELECT [EmpID = 222] (DEPENDENTS)), ( SELECT [Relationship = SON] ( DEPENDENTS))) X EmpID DepName Relationship 222 Jack Son 222 John Son Difference: Performs set difference (every element of first set which is NOT a member of the second set is output) on the two table; the two tables should have identical set of attributes. 8
DIFFERENCE ( TABLE1, TABLE2) Y = DIFFERENCE (( (SELECT [EmpID = 222] (DEPENDENTS)), (SELECT [Relationship = SON] ( DEPENDENTS) ) ) Y EmpID DepName Relationship 222 Jill Wife Note that DIFFERENCE is not commutative; DIFFERENCE( A, B) DIFFERENCE( B, A) DivideBy: You may also perform a set division operation on two tables. The operation is described using the simple tables and example that follow. DIVIDEBY ( TABLE1, TABLE2) The result is defined as follows: 1. Attributes of TABLE2 must be a proper subset of attributes of TABLE1. 2. Let the attributes of TABLE2 be {B1,, Bn}, and of TABLE1 be {A1,, Am, B1,, Bn}. 3. The output of the DIVIDEBY is a table with attributes {A1,, Am}. 4. The output contains all tuples with values <A1i,, Ami> such that for every distinct tuple in TABLE2, with value <B1j,, Bnj>, there is a tuple < A1i,, Ami, B1j,, Bnj> in TABLE1 for every j. 9
Consider the following tables: WORKS_ON EmployeeID ProjectNo 111 1 111 2 222 2 222 3 222 1 333 2 PROJECTS ProjectNo 1 2 3 We compute = DIVIDEBY( WORKS_ON, PROJECTS) 1. will be a table with those attributes of WORKS_ON that are not in PROJECTS, namely, {EmployeeID}. 2. From the PROJECTS table, there are three distinct values of < ProjectNo>: <1>, <2> and <3>. 3. From WORKS_ON, there is exactly one value of {EmployeeID}, namely <222>, such that there are rows in WORKS_ON corresponding to <222, 1>, <222, 2> and <222, 3>. Note that <111> does not qualify since there is no tuple in WORKS_ON with values <111, 3>; likewise, <333> does not qualify. 4. Thus, we get the result: EmployeeID 222 Notice that the above operation answers the question: Which employee works on all the projects. RA is an elegant mathematical tool for accessing and manipulating data stored in relational schemas. It can be shown that RA is complete, in the sense that any modification you would like to do on any subset of a set of relational tables can be done using some combination of RA commands. Thus, RA can be used to construct a Data Manipulation Language (DML) for relational DB s; however, the de facto standard DML used by all DBMS s is SQL, which we shall learn next. SQL is based on a mathematical system called Relational Calculus (which we will not learn here). 10