Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Similar documents
Chapter 1 Disk Storage, Basic File Structures, and Hashing.

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe

Hash-Based Indexes. Chapter 11

Hashed-Based Indexing

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Selection Queries. to answer a selection query (ssn=10) needs to traverse a full path.

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Hash-Based Indexes. Chapter 11. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Hashing file organization

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

CS143: Index. Book Chapters: (4 th ) , (5 th ) , , 12.10

Topics to Learn. Important concepts. Tree-based index. Hash-based index

Lecture 8 Index (B+-Tree and Hash)

CSIT5300: Advanced Database Systems

Indexing: Overview & Hashing. CS 377: Database Systems

Darshan Institute of Engineering & Technology

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe

Indexing Methods. Lecture 9. Storage Requirements of Databases

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

CSC 261/461 Database Systems Lecture 17. Fall 2017

Chapter 12: Indexing and Hashing (Cnt(

Chapter 11: Implementing File Systems. Operating System Concepts 8 th Edition,

The physical database. Contents - physical database design DATABASE DESIGN I - 1DL300. Introduction to Physical Database Design

Database Systems. File Organization-2. A.R. Hurson 323 CS Building

File System Interface and Implementation

Hash-Based Indexing 1

Hash Table and Hashing

Lecture Data layout on disk. How to store relations (tables) in disk blocks. By Marina Barsky Winter 2016, University of Toronto

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

Hashing. 1. Introduction. 2. Direct-address tables. CmSc 250 Introduction to Algorithms

ΗΥ360 Αρχεία και Βάσεις εδοµένων

Symbol Table. Symbol table is used widely in many applications. dictionary is a kind of symbol table data dictionary is database management

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

More B-trees, Hash Tables, etc. CS157B Chris Pollett Feb 21, 2005.

Physical Level of Databases: B+-Trees

Chapter 11: Indexing and Hashing

Chapter 17 Indexing Structures for Files and Physical Database Design

File Management. Chapter 12

CS 222/122C Fall 2016, Midterm Exam

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing

Chapter 12: Indexing and Hashing

CS 245 Midterm Exam Winter 2014

Chapter 6. Hash-Based Indexing. Efficient Support for Equality Search. Architecture and Implementation of Database Systems Summer 2014

CARNEGIE MELLON UNIVERSITY DEPT. OF COMPUTER SCIENCE DATABASE APPLICATIONS

RAID in Practice, Overview of Indexing

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Indexing and Hashing

Hashing Techniques. Material based on slides by George Bebis

Database Applications (15-415)

CSC 553 Operating Systems

Intro to DB CHAPTER 12 INDEXING & HASHING

(2,4) Trees Goodrich, Tamassia. (2,4) Trees 1

Hashing IV and Course Overview

Hash-Based Indexes. Chapter 11 Ramakrishnan & Gehrke (Sections ) CPSC 404, Laks V.S. Lakshmanan 1

TUTORIAL ON INDEXING PART 2: HASH-BASED INDEXING

Chapter 12: Indexing and Hashing

Chapter 18. Indexing Structures for Files. Chapter Outline. Indexes as Access Paths. Primary Indexes Clustering Indexes Secondary Indexes

Introduction. Choice orthogonal to indexing technique used to locate entries K.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Tree-Structured Indexes

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File Management By : Kaushik Vaghani

Some Practice Problems on Hardware, File Organization and Indexing

4 Hash-Based Indexing

Midterm Review. March 27, 2017

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

Database Applications (15-415)

5. Hashing. 5.1 General Idea. 5.2 Hash Function. 5.3 Separate Chaining. 5.4 Open Addressing. 5.5 Rehashing. 5.6 Extendible Hashing. 5.

Hashing. Data organization in main memory or disk

Data Management for Data Science

Kathleen Durant PhD Northeastern University CS Indexes

Introduction to Indexing 2. Acknowledgements: Eamonn Keogh and Chotirat Ann Ratanamahatana

Question Bank Subject: Advanced Data Structures Class: SE Computer

Hashing for searching

Data and File Structures Chapter 11. Hashing

Access Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value

Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

File Organization and Storage Structures

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 14-1

Indexes as Access Paths

File-System Structure. Allocation Methods. Free-Space Management. Directory Implementation. Efficiency and Performance. Recovery

Tree-Structured Indexes. Chapter 10

CS 350 Algorithms and Complexity

Material You Need to Know

File Systems: Fundamentals

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Chapter 11: Indexing and Hashing

Tree-Structured Indexes

File Systems: Fundamentals

Introduction to Data Management. Lecture 15 (More About Indexing)

2, 3, 5, 7, 11, 17, 19, 23, 29, 31

File System CS170 Discussion Week 9. *Some slides taken from TextBook Author s Presentation

Data Structure Lecture#22: Searching 3 (Chapter 9) U Kang Seoul National University

Overview of Storage and Indexing

Transcription:

Chapter 17 Disk Storage, Basic File Structures, and Hashing Records Fixed and variable length records Records contain fields which have values of a particular type (e.g., amount, date, time, age) Fields themselves may be fixed length or variable length Variable length fields can be mixed into one record: separator characters or length fields are needed so that the record can be parsed. 2 Blocking Blocking: refers to storing a number of records in one block on the disk. Blocking factor (bfr) refers to the number of records per block. There may be empty space in a block if an integral number of records do not fit in one block. Spanned Records: refer to records that exceed the size of one or more blocks and hence span a number of blocks. 3

4 Files of Records A file is a sequence of records, where each record is a collection of data values (or data items). A file descriptor (or file header ) includes information that describes the file, such as the field names and their data types, and the addresses of the file blocks on disk. Records are stored on disk blocks. The blocking factor bfr for a file is the (average) number of file records stored in a disk block. A file can have fixed-length records or variable-length records. (a) A fixed-length record with six fields and size of 71 bytes. Three Record Storage Formats (b) A record with two variablelength fields and three fixed-length fields. (c) A variable-field record with three types of separator characters. 5 Files of Records File records can be unspanned (no record can span two blocks) or spanned (a record can be stored in more than one block). The physical disk blocks that are allocated to hold the records of a file can be contiguous, linked, or indexed. In a file of fixed-length records, all records have the same format. Usually, unspanned blocking is used with such files. Files of variable-length records require additional information to be stored in each record, such as separator characters and field types. Usually spanned blocking is used with such files. 6

7 Types of Record Organization (a) Unspanned (b) Spanned Unordered Files Also called a heap or a pile file. New records are inserted at the end of the file. To search for a record, a linear search through the file records is necessary. This requires reading and searching half the file blocks on average, and is hence quite expensive. Record insertion is quite efficient. Reading the records in order of a particular field requires sorting the file records. 8 Ordered Files Also called a sequential file. File records are kept sorted by the values of an ordering field. Insertion is expensive: records must be inserted in the correct order. It is common to keep a separate unordered overflow (or transaction ) file for new records to improve insertion efficiency; this is periodically merged with the main ordered file. A binary search can be used to search for a record on its ordering field value. This requires reading and searching log 2 of the file blocks on average, an improvement over linear search. Reading the records in order of the ordering field is quite efficient. 9

10 Ordered Files Average Access Times The following table shows the average access time to access a specific record for a given type of file 11 There are numerous methods for collision resolution, including the following: Open addressing: Proceeding from the occupied position specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found. Chaining: For this method, various overflow locations are kept, usually by extending the array with a number of overflow positions. In addition, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location. Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, the program uses open addressing or applies a third hash function and then uses open addressing if necessary. 12

13 Hashing for disk files is called External Hashing The file blocks are divided into M equal-sized buckets, numbered bucket 0, bucket 1,..., bucket M-1. Typically, a bucket corresponds to one (or a fixed number of) disk block. One of the file fields is designated to be the hash key of the file. The record with hash key value K is stored in bucket i, where i=h(k), and h is the hashing function. Search is very efficient on the hash key. Collisions occur when a new record hashes to a bucket that is already full. An overflow file is kept for storing such records. Overflow records that hash to each bucket can be linked together. 14 To reduce overflow records, a hash file is typically kept 70-80% full. The hash function h should distribute the records uniformly among the buckets; otherwise, search time will be increased because many overflow records will exist. Main disadvantages of static external hashing: - Fixed number of buckets M is a problem if the number of records in the file grows or shrinks. - Ordered access on the hash key is quite inefficient (requires sorting the records). 15

16 - Overflow handling (Chaining) Dynamic And Extendible Dynamic and Extendible Hashing Techniques Hashing techniques are adapted to allow the dynamic growth and shrinking of the number of file records. These techniques include the following: dynamic hashing, extendible hashing, and linear hashing. Both dynamic and extendible hashing use the binary representation of the hash value h(k) in order to access a directory. In dynamic hashing the directory is a binary tree. In extendible hashing the directory is an array of size 2 d where d is called the global depth. 17 Dynamic And Extendible Hashing The directories can be stored on disk, and they expand or shrink dynamically. Directory entries point to the disk blocks that contain the stored records. An insertion in a disk block that is full causes the block to split into two blocks and the records are redistributed among the two blocks. The directory is updated appropriately. Dynamic and extendible hashing do not require an overflow area. Linear hashing does require an overflow area but does not use a directory. Blocks are split in linear order as the file expands. 18

Extendible Hashing 19