Biostatistics 615 / 815

Similar documents
Hashing. Biostatistics 615 Lecture 12

Lecture 3. Review. CS 141 Lecture 3 By Ziad Kobti -Control Structures Examples -Built-in functions. Conditions: Loops: if( ) / else switch

Tables. The Table ADT is used when information needs to be stored and acessed via a key usually, but not always, a string. For example: Dictionaries

Table ADT and Sorting. Algorithm topics continuing (or reviewing?) CS 24 curriculum

Announcements. Biostatistics 615/815 Lecture 8: Hash Tables, and Dynamic Programming. Recap: Example of a linked list

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Data Structures And Algorithms

Hash Table and Hashing

CS 310 Hash Tables, Page 1. Hash Tables. CS 310 Hash Tables, Page 2

Hashing for searching

Hashing Algorithms. Hash functions Separate Chaining Linear Probing Double Hashing

Outline. 1 Hashing. 2 Separate-Chaining Symbol Table 2 / 13

CS 223: Data Structures and Programming Techniques. Exam 2

Lecture Notes on Interfaces

CSE100. Advanced Data Structures. Lecture 21. (Based on Paul Kube course materials)

Midterm 3 practice problems

Hash Tables. Hashing Probing Separate Chaining Hash Function

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Tirgul 7. Hash Tables. In a hash table, we allocate an array of size m, which is much smaller than U (the set of keys).

Hash Tables. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Data Dictionary Revisited

Biostatistics 615/815 - Lecture 2 Introduction to C++ Programming

4 Hash-Based Indexing

CS 223: Data Structures and Programming Techniques. Exam 2. April 19th, 2012

CS 315 Data Structures mid-term 2

Random Number Generation. Biostatistics 615/815 Lecture 16

Computer Science 302 Spring 2007 Practice Final Examination: Part I

Hashing. Manolis Koubarakis. Data Structures and Programming Techniques

Unit #5: Hash Functions and the Pigeonhole Principle

Chapter 20 Hash Tables

Algorithms in Systems Engineering ISE 172. Lecture 12. Dr. Ted Ralphs

8/5/10 TODAY'S OUTLINE. Recursion COMP 10 EXPLORING COMPUTER SCIENCE. Revisit search and sorting using recursion. Recursion WHAT DOES THIS CODE DO?

Merge Sort. Biostatistics 615/815 Lecture 11

cs3157: another C lecture (mon-21-feb-2005) C pre-processor (3).

CMSC 341 Hashing (Continued) Based on slides from previous iterations of this course

Lecture 12 Hash Tables

Merge Sort. Biostatistics 615/815 Lecture 9

Data Structures and Algorithms 2018

CSL 201 Data Structures Mid-Semester Exam minutes

DATA STRUCTURES/UNIT 3

C-1. Overview. CSE 142 Computer Programming I. Review: Computer Organization. Review: Memory. Declaring Variables. Memory example

CS 137 Part 8. Merge Sort, Quick Sort, Binary Search. November 20th, 2017

C:\Temp\Templates. Download This PDF From The Web Site

Sample Problems for Quiz # 3

Summer Final Exam Review Session August 5, 2009

Lecture 16. Reading: Weiss Ch. 5 CSE 100, UCSD: LEC 16. Page 1 of 40

Hashing. Hashing Procedures

Lecture 6: Hashing Steven Skiena

CS3157: Advanced Programming. Outline

CSE 214 Computer Science II Searching

Topics Applications Most Common Methods Serial Search Binary Search Search by Hashing (next lecture) Run-Time Analysis Average-time analysis Time anal

Structured programming

Today: Finish up hashing Sorted Dictionary ADT: Binary search, divide-and-conquer Recursive function and recurrence relation

CS1020 Data Structures and Algorithms I Lecture Note #15. Hashing. For efficient look-up in a table

CS 206 Introduction to Computer Science II

Successful vs. Unsuccessful

Binary Search Trees. Manolis Koubarakis. Data Structures and Programming Techniques

HO #13 Fall 2015 Gary Chan. Hashing (N:12)

Introduction to Computers and Programming. Today

You must include this cover sheet. Either type up the assignment using theory5.tex, or print out this PDF.

Quick Sort. Biostatistics 615 Lecture 9

CMSC 132: Object-Oriented Programming II. Hash Tables

ECE 242 Data Structures and Algorithms. Hash Tables II. Lecture 25. Prof.

EL2310 Scientific Programming

The C Programming Language Guide for the Robot Course work Module

Quiz 0 Answer Key. Answers other than the below may be possible. Multiple Choice. 0. a 1. a 2. b 3. c 4. b 5. d. True or False.

Introduction hashing: a technique used for storing and retrieving information as quickly as possible.

Dictionary. Dictionary. stores key-value pairs. Find(k) Insert(k, v) Delete(k) List O(n) O(1) O(n) Sorted Array O(log n) O(n) O(n)

CSCE 2014 Final Exam Spring Version A

About this exam review

Hashing 1. Searching Lists

C Language Part 2 Digital Computer Concept and Practice Copyright 2012 by Jaejin Lee

C Arrays. Group of consecutive memory locations Same name and type. Array name + position number. Array elements are like normal variables

Solutions to Chapter 8

My malloc: mylloc and mhysa. Johan Montelius HT2016

Algorithm Efficiency & Sorting. Algorithm efficiency Big-O notation Searching algorithms Sorting algorithms

Lecture 12 Notes Hash Tables

Solution for Data Structure

CSE 332 Winter 2015: Midterm Exam (closed book, closed notes, no calculators)

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )

1 CSE 100: HASH TABLES

Algorithms, Data Structures, and Problem Solving

Common Misunderstandings from Exam 1 Material

Algorithms & Data Structures

Shell Sort. Biostatistics 615/815

Section 1: True / False (1 point each, 15 pts total)

TABLES AND HASHING. Chapter 13

BBM371& Data*Management. Lecture 6: Hash Tables

Hashing. Dr. Ronaldo Menezes Hugo Serrano. Ronaldo Menezes, Florida Tech

Chapter 11 Introduction to Programming in C

Lecture 4. Hashing Methods

Basic C Programming (2) Bin Li Assistant Professor Dept. of Electrical, Computer and Biomedical Engineering University of Rhode Island

MA/CSSE 473 Day 23. Binary (max) Heap Quick Review

Biostatistics 615/815 Lecture 6: Linear Sorting Algorithms and Elementary Data Structures

Parallel Random-Access Machines

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 6 Sorting and Searching

Algorithms and Data Structures

Chapter 6. Hash-Based Indexing. Efficient Support for Equality Search. Architecture and Implementation of Database Systems Summer 2014

Transcription:

Hashing Biostatistics 615 / 815 Lecture 9

Scheduling Lectures on Hashing: October 6 & 8, 2008 Review session: October 14, 2008 Mid-term (in class): October 16, 2008

Last Lecture Merge Sort Bottom-Up Top-Down Divide and conquer sort with guaranteed N log N running time Requires additional auxiliary storage

Today Hashing Algorithms Fast way to organize data prior to searching Trade savings in computing time for Trade savings in computing time for additional memory use

Almost Trivia Short detour Finding gprimes How do we find all prime numbers less How do we find all prime numbers less than some number?

Eratosthenes Sieve List all numbers less than N Ignore 0 and 1 Find the smallest number in the list Mark this number as prime Remove all its multiples from the list Repeat previous step until list is empty

void list_primes() { int i, j, a[n]; The Sieve in C for (i = 2; i < N; i++) a[i] = 1; for (i = 2; i < N; i++) ) if (a[i]) for (j = i * i; j < N; j += i) a[j] = 0; for (i = 2; i < N; i++) if (a[i]) printf("%4d is prime\n", i); }

Notes on Prime Finding The algorithm is extremely fast Takes <1 sec to find all primes <1,000,000 Performance can be improved by tweaking the inner loop Can you suggest a way? Illustrates useful idea: Illustrates useful idea: Use values an indices into an array where items denote presence / absence of the value in a set.

Idea If all items are integers within a short range speed up search operations avoid having to sort data How?

Even better! With this strategy Adding an item to the collection takes constant time Searching through the collection takes constant time Independent of the number of objects in the collection!

Previous Search Strategies Place data into an array O(N) Sort array containing data O(N log N) Search for items of interest Search for items of interest log N per search

Using Items as Array Indexes Place data into an array O(N) Sort array containing data Search for items of interest Sea c o te s o te est O(1) per search

Another Example Consider a sorted array with N elements Assume elements uniformly distributed between 0 and 1 Using binary search, checking for a specific element is about O(log 2 N) Can we do better by taking into account data is Can we do better by taking into account data is uniformly distributed?

Improving on Binary Search Let N = 100 What are likely locations for the values 0.0001 0.4281 0.9941 In fact, we can improve on binary search so that we need only about ½log 2 N comparisons.

General Principle If the elements we are searching for provide information about their location in the array, we can reduce search times We will describe a way to convert an arbitrary element of interest into a likely array location In fact, a series of locations!

Hashing Method for converting arbitrary items into array indexes Items can always serve as array indexes A different approach to searching Not (primarily) based on comparisons between elements

Time Space Trade Off If memory were no issue Could allocate arbitrarily large array so that each possible item could be a unique index If computing time were no issue Could use linear search to identify matches Hashing balances these two extremes

Components of Hashing Hash Function Generates table address for individual key Collision-Resolution Strategy Deals with keys for which the Hash Function generates identical addresses

Desirable Hash Functions Poor Hash Function: Data is clustered Freque ency Good Hash Functions: Data is Evenly Distributed Hash Value

Hash Function #1 Assume indexes are in range [0, M 1] Items are floating point values between 0..1 Multiply by M and round

Hash Function #1 In general, if items take Minimum value min Maximum value max Define hash function as (item min) / (max min) * M

Unfortunately If the items are not randomly distributed within their range Hash function will generate a lot of collisions. Better strategies exist

Hash Function #2 For integers Ensure that t table size M is prime Define hash function as item modulus M Note: The modulus operators is % (in C)

Hash Function #2 For floating point values First map item into 0.. 1 range As before Multiply result by large integer (say 2 K ) and truncate Calculate modulus M (where M is prime)

Two Simple Hash Functions int hash_int(int item, int M) { return item % M; } int hash_double(double item, int M) { return (int) ((item min)/(max-min)* LARGE_NUMBER) % M; }

Hashing for Strings It is also possible to hash strings One way is to convert each string to a number List all possible characters that could occur Assign the value 1 to one of them, 2 to another, Although the conversion sounds cumbersome Although the conversion sounds cumbersome, it is built into the way C represents strings Each character is also a number

A Hash Function for Strings int hash_string(char * s, int M) { int hash = 0, mult = 127, i; for (i = 0; s[i]!= 0; i++) hash += (s[i] + hash*mult) % M; return hash; h } In C characters can be treated as numbers A string is an array of In C, characters can be treated as numbers. A string is an array of characters that terminates with the element zero.

Conflict Resolution: Separate Chaining What to do with items where the hash function returns the same value? One option is to make each entry in the hash table an array or list Each entry corresponds to a "chain of items"

Separate Chaining Example Key Hash Value Chains

Chaining in R (Relatively Simple Code) # Create a hash table as a list of vectors table <- vector(m, mode = "list) for (i in 1:M) table[[i]] <- c() # Add an element to the table h <- hash(item, M) table[[h]] <- c(item, table[[h]]) # Check if an element is in the table item %in% table[[hash(item, M)]]

Chaining in C++ (Certainly More Complicated!) #include "stdio.h" #include "stdlib.h" #define EMPTY -1 #define M 997 #define N 50 class ValueAndPointer { public: int value; ValueAndPointer * next; }; ValueAndPointer * hash[m]; void InitializeTable() { for (int i = 0; i < M; i++) hash[i] = NULL; }

Chaining in C++ (Adding an integer ) void Insert(int value) { int h = value % M; ValueAndPointer ** pointer = &(hash[h]); while (*pointer!= NULL) { if ((**pointer).value == value) // Value is already in the chain... return; pointer = &(**pointer).next; } // Value not found, add a new link to the chain *pointer = (ValueAndPointer *) malloc(sizeof(valueandpointer *)); (**pointer).next = NULL; (**pointer).value = value; }

Chaining in C++ (Finding an integer ) bool Find(int value) { int h = value % M; ValueAndPointer ** pointer = &(hash[h]); while (*pointer!= NULL) { if ((**pointer).value i == value) ) return 1; pointer = &(**pointer).next; } return false; }

Chaining in C++ (Checking the previous functions) int main(int argc, char ** argv) { srand(123456); InitializeTable(); for (int i = 0; i < N; i++) { int value = rand() % (N * 10); printf("inserting the value %d into table...\n", value); Insert(value); } for (int i = 0; i < N; i++) { int value = rand() % (N * 10); printf("checking for value %d in table...\n", value); if (Find(value)) printf(" FOUND!!!\n"); else printf(" Not found.\n"); } }

Properties of Separate Chaining If the hashing function results in random indexes N M expected number of entries in each chain N k 1 M k 1 1 M N k number of entries follows Binomial distribution (well approximated with ihpoisson distribution) ib i

Interesting Known Properties In most slots, no. of entries is close to average e α probability of an empty slot ~ 1.25 M M 1 M i= 1 i number of items before first collision "the birthday problem" number of items before all slots have one item "the coupon collector problem" α = N / M is the load factor

Notes on Hashing Good performance when Searching for elements Inserting elements Ineffective when Selecting elements based on rank Sorting elements

Recommended Reading Sedgewick, Chapter 14 Peterson W. W. (1957) IBM Journal of Research and Development 1:130-146 146