Outline Introduction to the Course Introduction Distributed DBMS Architecture Distributed Database Design Query Processing Transaction Management Issues in Distributed Databases and an Example Objectives of the course, deadlines, exam Transparency Intuitively Definition of Distributed Computing and Distributed Databases Transparency in Distributed Databases Distributed Concurrency Control Distributed DBMS Reliability Parallel Database Systems Spring, 2006 Arturas Mazeika Page 2 Issues in Distributed Databases An Example of a Distributed Database Architecture and Design How to fragment data How to replicate data Query and Transaction Management Convert user transactions into data manipulation queries Convert user queries into parallel queries Optimize queries Concurrency Control Synchronization of concurrent accesses Deadlock Management Reliability How to make the system resilient to failures Atomicity and Durability Database consists of two relations: employees and projects The relations are distributed What are the problems with queries, transactions, concurrency, and reliability? Spring, 2006 Arturas Mazeika Page 3 Spring, 2006 Arturas Mazeika Page 4
Objectives Teaching Format Course: Present and compare the key aspects of the distributed databases. I expect a high participation of the students in the course Labs: Select a topic of the distributed databases or extend your Data Warehousing and Data Mining project Lectures Project-based programming in the labs Spring, 2006 Arturas Mazeika Page 5 April 24th. The course starts Deadlines May 1st (one week) The students forms the groups, choose a mini-project, and defines the problem. The groups should consists of two to four people. The group sends an email to the professor of the course with the list of the members (together with their email addresses) and the problem definition June 8th (five weeks) The presentation of the projects. The group makes a five minutes presentation on the project. The presentation should include (i) the problem definition, (ii) example of the extended algorithm June 9th (five weeks) Mini-project report deadline. The students must send a PDF version of the report to the professor of the course. Also the students should hand in a hard copy of the report to the secretaries of the faculty. The report must define the problem (0.5-1 page), give a solution (0.5-1 page), illustrate the solution by examples (2-3 pages), give a manual to the command line interface (0.5-1 page) Spring, 2006 Arturas Mazeika Page 6 Exam Organization of the Exam: The exam is oral. The exam consists of two parts: (i) the defense of the project, and (ii) a question from the DDB course. The student will have to explain his project by an example. The student will have to explain the aspect of the distributed databases and show the positive and negative characteristics of the aspect Each student will be examined individually for 15 minutes This is an open book exam, i.e. the student is allows to use any material in the exam. However, the student should freely operate with the concepts and know definitions of the course Grading of the Exam: The student can earn up to 18 points for the project part (60 per cent of the grade) The student can earn up to 12 points for the theory part (40 per cent of the grade) The students of a group can earn up to 6 points for the presentation in the DB seminar series (optional) The final grade is a sum of 1 3. However, the maximum grade for the course is 30 points (+cum laude) Spring, 2006 Arturas Mazeika Page 7 Spring, 2006 Arturas Mazeika Page 8
Exam Cont. Data Independence: Evolution of Programs 1/3 Our Advice: Prepare a plan for each question of the theory part. Absence of a plan typically introduces some degree of disorder in the presentation, and lowers the final grade Answer the questions in the exam precisely and shortly. Giving all information that is just relevant to the asked question does not increase the grade of the exam The successful pass of the exam depends on two things: the knowledge of the course, and the ability to present the knowledge. Make sure you learn the subject and learn how to present the subject Programs can store data in regular files and achieve data independence (transparency) Spring, 2006 Arturas Mazeika Page 9 Data Independence: Evolution of Programs 2/3 Spring, 2006 Arturas Mazeika Page 10 Data Independence: Evolution of Programs 3/3 Programs can store data with a help of DBMS and fully achieve data independence (transparency) Goal: achieve data distribution transparency Integration of databases and networking does not mean centralization (in fact quite opposite) Spring, 2006 Arturas Mazeika Page 11 Spring, 2006 Arturas Mazeika Page 12
Definition of a Distributed Computation Synonymous Terms A distributed database is a collection of autonomous processing elements that are interconnected by a computer network. The elements cooperate in order to perform the assigned task Synonymous terms: distributed function distributed data processing multiprocessors/multicomputers satellite processing back-end processing dedicated/special purpose computers timeshared systems functionally modular systems The term distributed is very broadly used. The exact meaning of the word depends on the context Spring, 2006 Arturas Mazeika Page 13 What Can Be Distributed? Spring, 2006 Arturas Mazeika Page 14 Definition of DDB and DDBMS The following can be distributed: Processing logic Functions Data Control A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network A distributed database management system (DDBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users I ll use the term DDBMS and DDBS interchangeably Spring, 2006 Arturas Mazeika Page 15 Spring, 2006 Arturas Mazeika Page 16
What is not a DDBS? DDBS Environment Shared Memory Shared Disk Shared Nothing Central Databases These systems are parallel database systems and are quite different (though related to) distributed DB systems Spring, 2006 Arturas Mazeika Page 17 Implicit Assumptions Spring, 2006 Arturas Mazeika Page 18 Applications Data stored at a number of sites each site logically consists of a single processor Processors at different sites are interconnected by a computer network (we do not consider multiprocessors in DDBMS, cf. parallel systems) DDBS is a database, not a collection of files (cf. relational data model). Placement and query of data is impacted by the access patterns of the user DDBMS is a collections of DBMSes (not a remote file system) Manufacturing, especially multi-plant manufacturing Military command and control Airlines Hotel chains Any organization which has a decentralized organization structure Spring, 2006 Arturas Mazeika Page 19 Spring, 2006 Arturas Mazeika Page 20
Motivation for DDBS Motivation for DDBS (Higher Reliability) Distributed Database Systems delivers: Higher reliability Partially improved performance System expansion Transparency Replication components No single points of failure Communications links Spring, 2006 Arturas Mazeika Page 21 Motivation for DDBS (Partial Improved Performance) Spring, 2006 Arturas Mazeika Page 22 Motivation for DDBS (System Expansion) Proximity of data to its points of use Requires some support for fragmentation and replication Parallelism in execution Inter-query parallelism Intra-query parallelism Update and read-only queries influences the design of DDBS substantially Issue is database scaling Emergence of microprocessor and workstation technologies Data communication cost versus telecommunication cost Increasing database size Spring, 2006 Arturas Mazeika Page 23 Spring, 2006 Arturas Mazeika Page 24
Motivation for DDBS (Transparency) Transparency User Wants to See One Database Programmer Sees Many Databases Transparent system hides the implementation details from the users Distribution transparency Replication transparency Fragmentation transparency Network transparency Location transparency Naming transparency Transaction transparency Concurrency transparency Failure transparency Performance transparency Spring, 2006 Arturas Mazeika Page 25 Transparency (Distribution Transparency) Spring, 2006 Arturas Mazeika Page 26 Transparency (Naming Transparency 1/3) Distribution transparency allows user to perceive database as single, logical entity If DDBMS exhibits distribution transparency, user does not need to know: data is fragmented (fragmentation transparency) location of data items (location transparency) With replication transparency, user is unaware of replication of fragments Each item in a DDB must have a unique name DDBMS must ensure that no two sites create a database object with same name One solution is to create central name server. However, this results in: loss of some local autonomy central site may become a bottleneck low availability. If the central site fails remaining sites cannot create any new objects Spring, 2006 Arturas Mazeika Page 27 Spring, 2006 Arturas Mazeika Page 28
Transparency (Naming Transparency 2/3) Transparency (Naming Transparency 3/3) Alternative solution - prefix object with identifier of site that created it For example, branch created at site S1 might be named S1.BRANCH Also need to identify each fragment and its copies Thus, copy 2 of fragment 3 of Branch created at site S1 might be referred to as S1.BRANCH.F3.C2 An approach that resolves these problems uses aliases for each database object Thus, S1.BRANCH.F3.C2 might be known as Local Branch by user at site S1 DDBMS has task of mapping an alias to appropriate database object Spring, 2006 Arturas Mazeika Page 29 Transparency (Transaction Transparency) Spring, 2006 Arturas Mazeika Page 30 Transparency (Concurrency Transparency 1/2) Ensures that all distributed transactions maintain integrity and consistency of the DDB Distributed transaction accesses data at more than one location Each transaction is divided into a number of sub-transactions (a sub-transaction for each site that has relevant data) DDBMS must ensure the indivisibility of both the global transaction and each of the sub-transactions All transactions must execute independently and be logically consistent. The result of a set of the (parallel) execution should be the same, as if the transactions were executed in some arbitrary serial order Same fundamental principles as for centralized DBMS DDBMS must ensure both global and local transactions do not interfere with each other Similarly, DDBMS must ensure consistency of all sub-transactions of global transaction Spring, 2006 Arturas Mazeika Page 31 Spring, 2006 Arturas Mazeika Page 32
Transparency (Concurrency Transparency 2/2) Transparency (Failure Transparency) Replication makes concurrency more complex If a copy of a replicated data item is updated, update must be propagated to all copies Could propagate changes as part of original transaction, making it an atomic operation However, if one site holding copy is not reachable, then transaction is delayed until site is reachable Could limit update propagation to only those sites currently available. Remaining sites updated when they become available again DDBMS must ensure atomicity and durability of the global transaction, i.e., the sub-transactions of the global transaction either all commit or all abort Thus, DDBMS must synchronize global transaction to ensure that all sub-transactions have completed successfully before recording a final COMMIT for the global transaction The solution should be robust in presence of site and network failures Could allow updates to copies to happen asynchronously, sometime after the original update. Delay in regaining consistency may range from a few seconds to several hours Spring, 2006 Arturas Mazeika Page 33 Transparency (Performance Transparency 1/2) Spring, 2006 Arturas Mazeika Page 34 Transparency (Performance Transparency 1/2) Distributed Query Processor (DQP) maps data request into an ordered sequence of operations on local databases DQP must consider fragmentation, replication, and allocation schemas DDBMS must perform as if it were a centralized DBMS DDBMS should not suffer any performance degradation due to the distributed architecture DDBMS should determine most cost-effective strategy to execute a request DQP has to decide: which fragment to access which copy of a fragment to use which location to use DQP produces execution strategy optimized with respect to some cost function Typically, costs associated with a distributed request include: I/O cost CPU cost communication cost Spring, 2006 Arturas Mazeika Page 35 Spring, 2006 Arturas Mazeika Page 36
Complicating Factors Related Technologies/Problems Complexity Cost Security Integrity control more difficult Lack of standards Lack of experience Database design more complex Distributed database design Distributed query processing Distributed directory management Distributed concurrency control Distributed deadlock management Reliability Heterogeneous databases Spring, 2006 Arturas Mazeika Page 37 Relationship Between Issues Spring, 2006 Arturas Mazeika Page 38 Conclusion A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network Data stored at a number of sites, the sites are connected by a network. DDB supports the relational model. DDB is not a remote file system Transparent system hides the implementation details from the users Distribution transparency Network transparency Transaction transparency Performance transparency Programming a distributed database involves: Distributed database design Distributed query processing Distributed directory management Distributed concurrency control Distributed deadlock management Reliability Spring, 2006 Arturas Mazeika Page 39 Spring, 2006 Arturas Mazeika Page 40