8. Secondary and Hierarchical Access Paths

Size: px

Start display at page:

Download "8. Secondary and Hierarchical Access Paths"

Arnold Holmes
5 years ago
Views:

1 8. Secondary and Access Paths Theo Härder Main reference: Theo Härder, Erhard Rahm: Datenbanksysteme Konzepte und Techniken der Implementierung, Springer, 2, Chapter 8. Patrick O Neil, Elizabeth O Neil: Database Principles, Programming, Performance, 2nd edition, Morgan Kaufmann Publ., 2, Chapter 8. 2 AG DBIS of Database Systems SS 2 Secondary and Access Paths Goals Design principles for to all qualified records of a table Evaluation of search predicates by set-theoretic operations Mapping choices for hierarchical requirements Access via secondary keys Entry structure and link structure Use of pointer lists and compressed bit lists Run length, Null-sequence, Golomb codes Multi-mode, block Huffman codes Merged (generalized path structure) Join and path index 2 AG DBIS 8-2

2 Connection Structures for Record Sets Materialized storage. Physical contiguity of records (clustering, lists) 2. Chaining of records record record 2 record 3 record record 2 record 3 Referenced storage 3. Physical contiguity of pointers (inversion) 4. Chaining of pointer record record 2 record 3 record record 2 record 3 2 AG DBIS 8-3 Access Paths for Secondary Keys Search for records having given values of non-identifying attributes (secondary keys) Result is record set Dno A2 A3 A2... Eno Dno Loc Salary 2345 A2 KL A2 MA A3 KL A2 MA A3 F A2 KL 5 Loc KL MA F... : entry structure + link structure Primary key applicable as entry structure to record sets In principle, all connection structures can be used for record sets most frequently: use of B*-trees and inversion techniques Standard solution for inversion are sequential reference lists (often called OID lists or TID lists) efficient processing of set operations cost-effective maintenance 2 AG DBIS 8-4

3 Access Paths for Secondary Keys (2) Frequent realization: inversion Separation of path data and data records (referenced storage) Reference Z realized as TID, DBK/PPP,... Two representation methods are possible: a) Combined representation of lookup structure and pointer lists key pointer lists K2 4 Z Z Z Z K3 3 Z Z Z K2 Z... relatively short pointer lists assumed! b) In the lookup structure, there exists (similar to for primary keys) only a single reference per key value which points to a list with references to records (pointer list) key K2 4 K3 3 K2. Z Z Z Z Z Z Z pointer lists are managed in separate containers 2 AG DBIS 8-5 Z Access Paths for Secondary Keys (3) I table Emp Emp (Dno) K25 K6 K99 Emp ( Eno, Name, Dno, ) E Müller K55 E7 Maier K5 E25 Schmitt K55 K8 K3 K25 K33 K45 K6 K75 K86 K99 K5 2 TID TID k K55 n TID TID 2... TID n K6 2 TID l TID k B*-tree as path for secondary key Dno represents sort order of secondary keys and forward/backward chaining Complex search operation Range search Generic search Mask search (LIKE) Phonetic search 2 AG DBIS 8-6

4 Access Paths for Secondary Keys (4) Use for Information Retrieval Unformatted data: documents Inversion by means of descriptors (no assignment to attributes!) system Z D Z D29... Z D bit list bit list Z D57 Z D32... Z D777 Z D595 very many and very few references are possible Inversion using bit lists Addressing of data records or documents - Via allocation table AT - Directly in case of fixed length and contiguous storage Markings in the bit list correspond to entries of AT or computable addresses (b records per page) Attribute A has j attribute values a,..., a j 2 AG DBIS 8-7 Access Paths for Secondary Keys (5) Bit matrix for A n a a 2 a j Storage as vertical bit lists enables indexing of multi-valued attributes (example: shopping cart with products) s of fixed length j i bit lists of attribute A i Simple update operations Fast comparison Very space consuming Only for small j Often long null sequences: 2 AG DBIS 8-8

5 Access Paths for Secondary Keys (6) Compressed bit lists of variable length Space saving Reduction of I/O time Additional overhead for coding and decoding Fast comparison Ponderous update operations Application areas of Data Warehouse (inversion of Fact table) Transfer/storage of - Multimedia objects (Image, Audio, Video,...) - Sparse matrices - Objects in Geo-DBs,... Many techniques available 2 AG DBIS 8-9 Compression of Bit Lists Run length A run is a bit sequence of uniform bit marks. The uncompressed bit list is divided into subsequent alternating sequences of s and s. The technique represents each run in a coding sequence by its length (stored as a binary number). A coding sequence can be composed of several coding units of fixed length (k bits). In case of a run length larger than (2 k -) bits, a coding sequence having more than one coding units has to be used for the mapping. Compression of a run of length L with (n-) (2 k -) < L n (2 k -), n =, 2, requires n coding units, where the first (n-) coding units are completely filled with s (low value) which allows to recognize that subsequent coding units belong to a coding sequence. Checking each coding unit for low values needs an extra test in case of de; such an implicit continuation mark of a sequence prevents that the method fails for sequences of lengths > 2 k. Example (k=6): run length coding list of marks: 4, 5, 5, 5 2 AG DBIS 8-

6 Compression of Bit Lists (2) Null sequence A Null sequence is a sequence of bits between two bits in the uncompressed bit list. The basic idea of the method is to represent the bit list only by subsequent null sequences, where a bit is implicitly expressed in each case. Because now length L= of a Null sequence can happen, the following coding can be chosen (k=6), which corresponds to the addition of binary numbers 2 k -: length of null sequence coding Because a coding sequence can be composed in an additive way by several coding units, null sequences of arbitrary lengths can be represented. n coding units are required if for L holds: (n-) (2 k -) L < n (2 k -), n =, 2, list of marks: 4, 5, 5, 5 k=6 2 AG DBIS 8- Compression of Bit Lists (3) Golomb coding (for null sequence ) A Null sequence of length L is represented by a coding sequence consisting of a variable-length prefix, a separator bit, and a remainder field of fixed length using log 2 m bits. The prefix is composed of L/m bits followed by a bit as separator. The remainder contains (as a binary number) the number of remaining bits of the Null sequence: L - m*l/m L/m. This method enables the of arbitrary long Null sequences (improved by Exp-Golomb), independent of the chosen parameters. If p is the -bit probability in the bit list, parameter m should be chosen such that p m.5. Example (m=4): Null sequence m m m m prefix remainder separator list of marks: 4, 5, 5, 5 m=8 2 AG DBIS 8-2

7 Compression of Bit Lists (4) Multi-mode Some bits of a coding sequence of fixed length k are reserved as so-called type bits to mark different modes of a coding sequence. A single type bit enables two modes: : k- bits of the sequence are stored as bit pattern ; : 2 k- - bits of a Null sequence are expressed by a binary number Example list of marks: 4, 5, 5, 5 k=6, single type bit 2 AG DBIS 8-3 Compression of Bit Lists (5) Multi-mode (cont.) Because of the restricted k, long Null sequences require the use of several subsequent coding sequences. Furthermore, isolated s in a bit list need a separate coding sequence to code them as a bit pattern. Reserving a further type bit enables greater flexibility with, for example, the following four modes: : k-2 bits of the sequence are stored as bit pattern; : 2 k-2 - bits are encoded as a sequence of s by a binary number; : 2 k-2 - bits are encoded as Null sequence by a binary number; : 2 2k-2 - bits are encoded as Null sequence in a doubled coding sequence If an -sequence is large enough to compress any Null sequence, isolated s could be implicitly expressed Example list of marks: 4, 5, 5, 5 k=8, two type bits 2 AG DBIS 8-4

8 Compression of Bit Lists (6) Block The uncompressed bit list is divided into blocks of length k. A first method replaces the individual blocks by codes of variable length. If the probabilities of specific bit patterns are known or can be estimated, Huffman codes can be used. Using block length k, 2 k different patterns require 2 k code words of variable length (use of a translation table with optimally assigned code words). A second method stores only blocks where at least one bit occurs. To mark the blocks (low value blocks) which are not stored, a second bit list is used as a directory, in which each mark corresponds to a block stored. Because long Null sequences may occur in the directory, it again can be compressed using null-sequence- or multi-mode-. The idea to apply again block on the directory, leads to hierarchical block. It can be recursively continued until the elimination of Null sequences is not worth it. Starting from the highest hierarchy level, the uncompressed bit list (index depth d) can be easily reconstructed. 2 AG DBIS 8-5 Compression of Bit Lists (7) root level inner nodes level 2 leaves level Example node size l = 4 and index depth d = 3 indexed set S = {2, 3, 9, 2, 3, 4, 38, 4} physical storage AG DBIS 8-6

9 Optimal Codes Extended binary trees with minimal external path length can be used to design optimal codes for n+ characters Sequence to be coded: A A B C A A B B C A D B A B A (5 characters) Codes of fixed length: 2 bit A =... D = C 2Bit = 5 * 2 = 3 Are there better codings? character frequency code no character is prefix of another one E w = C Code Decoding can be performed with the same extended binary tree used to determine the codes Proceeding: A A B C A... =..... = A A B C A... 2 AG DBIS 8-7 Huffman Algorithm The minimal coding can be derived using extended binary trees having minimal weighted external path length. The resulting codes are called Huffman codes. Algorithm for the construction of binary trees with minimal weighted external path length Given: List of trees which initially consists of n external nodes as roots. The frequencies q i are carried by the roots of the trees Idea: Determine the two trees with the lowest frequencies and remove them from the list. By means of a new root, both trees found are composed as left and right subtree to a new tree and inserted into the list. external nodes n- trees = internal nodes Algorithm: Huffman (TreeList list, int n) for (i = ; i < n; i += ) { p = smallest element from list remove p from list p2 = smallest element from list remove p2 from list create node p attach p and p2 as subtrees to p determine the weight of p as sum of the weights p and p2 insert p into list } 2 AG DBIS 8-8

10 Huffman Algorithm (2) Execution example T T2 T3 T4 T5 q i T T2 T3 T4 E w = E w Cost: n n 2 C C ( n i) C( n )( n ) O( n ) i 2 2 AG DBIS 8-9 Assignment of Huffman-Codes Example Bitstring L i O i value range 48 [-2.8x 4, -4.3x 9 ] 32 [-4.3x 9, ] 6 [ , -444] 2 [-444, -345] 8 [-344, -89] 6 [-88, -25] 4 [-24, -9] 3 [-8, -] 3 [, 7] 4 [8, 23] 6 [24, 87] 8 [88, 343] 2 [344, 4439] 6 [444, 69975] 32 [69976, 4.3x 9 ] 48 [3.3x 9, 2.8x 4 ] Bitstring L i O i value range 2 [-8485, -699] 6 [-6999, -4374] 2 [-4373, -278] 8 [-277, -22] 4 [-2, -6] 2 [-5, -2] [-, ] [, ] [2, 3] 2 [4, 7] 4 [8, 23] 8 [24, 279] 2 [28, 4375] 6 [4376, 699] 2 [6992, 8487] 2 AG DBIS 8-2

11 Access Paths of functional relationships among two record types Owner Member: Set types according to the network model Each instance of an Owner record type is linked to..n instances of the Member record type Logical view: Illustration of navigation options Dno Mno D-Loc Owner Dept: PRIOR K2 ABEL KL LAST/PRIOR FIRST/NEXT NEXT OWNER OWNER OWNER NEXT NEXT Member Emp: 234 K2 DA K2 KL K2 KL 5 Eno Dno Loc Salary PRIOR PRIOR K3 SCHULZ DA 6927 K3 DA K3 FR 55 Three implementations for different performance requirements 2 AG DBIS 8-2 Access Paths Implementation Sequential list based on pages SET OWNER Last SET MEMBER SET MEMBER 2 SET MEMBER 3 SET MEMBER 4 Chained list SET OWNER Last/PRIOR SET MEMBER SET MEMBER 2 SET MEMBER 3 SET MEMBER 4 : optional pointer 2 AG DBIS 8-22

12 Access Paths Implementation (2) Pointer array structure ENTRY SET OWNER POINTER-ARRAY ENTRY ENTRY ENTRY SET MEMBER SET MEMBER 2 SET MEMBER 3 SET MEMBER 4 : optional pointer 2 AG DBIS 8-23 Access Paths Evaluation of Implementation Techniques Pointer array Stable performance behavior Behavior independent of Set growth and Set sequence Standard method in case of imprecise information concerning Set size and frequency Sequential list Restricted to a single Set type per Member record type (clustering) Fast location / insertion in Set sequence Updates more expensive than for pointer array Chained list Advantages in case of membership of the Member record type in several Sets Cheap switch to other Set occurrences Sequential faster than for pointer array Only useful in small Set occurrences 2 AG DBIS 8-24

13 Access Path Structure Idea: Shared exploitation of an index structure (B*-tree) for several record types for which the relationships (:, :n, n:m) are defined over the same domain (e.g. for Dno) and represented by equality of attribute values Dept Emp Use of the Index structure for primary key e.g. as I Dept (Dno) secondary key e.g. as I Emp (Dno) hierarchical e.g. of Dept(Dno) to Emp(Dno) or vice versa join operations (Join) e.g. of Dept.Dno = Emp.Dno Mgr Equipment all tables carry an attribute (e.g. Dno) which is defined on domain Deptno Combined realization of primary key, secondary key, and hierarchical using an extended B*-tree Inner tree nodes remain unchanged Leaves contain references for primary and secondary 2 AG DBIS 8-25 B*-Tree as Combined Access Path Structure I Emp (Dno) K25 K6 K99 K8 K3 K25 K33 K45 K6 K75 K86 K99 K5 2 TID TID k K55 n TID TID 2... TID n K6 2 TID l TID k I Emp/Dept (Dno) K25 K6 K99 K8 K3 K25 K33 K45 K6 K75 K86 K99... K55 n TID TID TID 2... TID n... Structure contains index for Dept, Emp and link for Dept-Emp with direct from. OWNER to each MEMBER, 2. Each MEMBER to each other MEMBER, 3. Each MEMBER to the OWNER 2 AG DBIS 8-26

14 B*-Tree as Access Path Structure I Emp/Dept/Mgr/Equip (Dno) K25 K6 K99 K8 K3 K25 K33 K45 K6 K75 K86 K99 TIDs for Dept TIDs for Mgr... K TID TID TID TID TID TID TID TID TID... PRIOR NEXT TIDs for Emp TIDs for optional Equipment reference to overflow page Access path structure comprises -4 index structures - 6 link structures 2 AG DBIS 8-27 Access Path Structure Evaluation Keys are stored only once Saving of storage space Uniform structure for all path types Simplification of implementation ti Support of join operation and certain statistical queries Simple checking of referential integrity and further integrity constraints (e.g., cardinality restrictions) Increased number of leaf pages More page es in case of scanning all records of a record type in sort order Height of the tree remains stable in most cases Similar performance behavior for locating data and update 2 AG DBIS 8-28

15 Join and Path Indexes Join index The join index VI between two tables V and S (not necessarily disjoint) with the join attributes A and B is defined as follows: VI = {(v.tid, s.tid) f(v.a, s.b) is TRUE, v V, s S} f denotes a Boolean function which defines the join predicate, which may be very complex. Especially, -joins ( {=,, <,, >, }) can be specified in this way. Application of selection predicates and parallelism for the join VI V : VI S : V S V S S V TID s4 TID v TID s3 TID s2 TID v TID s3 TID s2 TID s3 TID v TID s2 TID s4 TID s4 TID s6 TID s6 TID s6 logical view Index auf TID V Index auf TID S 2 AG DBIS 8-29 Join and Path Indexes (2) Multi-join index Generalization of the idea to efficiently process join operations via a statically computed join index (compile time instead of runtime) Index for a two-way join is used to determine the join partners in a third table T and to extend the index table by a column for the TID ti. If two index tables for VS and ST already exist, these can be immediately used to combine them to an extended Index table VST If the VST join should contain only attributes of V and T, a VT index can be created. Column S is indispensable for the join computation Multi-join index (example) Index tables for the join: logical view V S S T V S T TID v TID s3 TID s2 TID t TID v TID s3 TID t2 TID s4 TID s3 TID t2 TID v TID s3 TID t3 TID s2 TID s3 TID t3 TID s4 TID t4 TID s4 TID t4 TID s4 TID t5 TID s4 TID t5 TID s2 TID t 2 AG DBIS 8-3

16 Join and Path Indexes (4) Example Given are the tables Dept, Emp, Proj and EP (Eno, Jno) which embodies an (n:m) relationship between Emp (Eno, Dno,...) and Proj (Jno,..., Loc). Q2: SELECT D.Dno, A.ANAME FROM Dept D, Emp E, EP M, Proj J WHERE DDno=EDno D.Dno E.Dno AND E.Eno = M.Eno AND M.Jno = J.Jno AND J.Loc = :X Extension to n tables possible Path index Integration of an index Loc into multi-join index DEMJ Enables evaluation of special queries on the index Assumption: multi-valued reference attributes in ORDBMS Analogous path expression to Q2: Dept.Employs-Emp.Works-at.Loc = :X Dept Emp EP Proj Loc TID a TID p TID m TID j Berlin TID a TID p2 TID m3 TID j Berlin TID a TID p2 TID m4 TID j2 Köln TID a2 TID p3 TID m5 TID j3 Bonn AG DBIS 8-3 Summary Access paths for secondary keys Entry structure: B*-tree etc. Link structure: pointer lists, bit lists Many methods available Support of set-theoretic theoretic operations Compression of bit lists Support of variable-length keys and entries required s are highly efficient in case of low domain cardinality Huffman codes allow for flexible adaptation to value distributions Support of join operations (relational model) Efficient processing of Set operations (network model) Link structure: chains, pointer lists, lists (adjustment to special workloads) path structure Support of primary key-, secondary key- and hierarchical es Also applicable as special join index Join and path Explicit construction of join results and their indexing Path only enable optimization of special queries 2 AG DBIS 8-32

10. Record-Oriented DB Interface

10. Record-Oriented DB Interface 10 Record-Oriented DB Interface Theo Härder wwwhaerderde Goals - Design principles for record-oriented and navigation on logical access paths - Development of a scan technique and a Main reference: Theo