Randomized Algorithms Part Four

Similar documents
Randomized Algorithms Part Four

Outline for Today. How can we speed up operations that work on integer data? A simple data structure for ordered dictionaries.

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Lecture 7: Efficient Collections via Hashing

Greedy Algorithms Part Three

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

HashTable CISC5835, Computer Algorithms CIS, Fordham Univ. Instructor: X. Zhang Fall 2018

Randomized Algorithms: Element Distinctness

Hash Tables. Hash Tables

Outline for Today. How can we speed up operations that work on integer data? A simple data structure for ordered dictionaries.

CS369G: Algorithmic Techniques for Big Data Spring

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ.! Instructor: X. Zhang Spring 2017

Algorithms with numbers (2) CISC4080, Computer Algorithms CIS, Fordham Univ. Acknowledgement. Support for Dictionary

Acknowledgement HashTable CISC4080, Computer Algorithms CIS, Fordham Univ.

CS166 Handout 14 Spring 2018 May 22, 2018 Midterm Practice Problems

Outline for Today. A quick refresher on simple and efficient hash tables. Hashing with worst-case O(1) lookups.

1 CSE 100: HASH TABLES

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Outline for Today. Towards Perfect Hashing. Cuckoo Hashing. The Cuckoo Graph. Analysis of Cuckoo Hashing. Reducing worst-case bounds

CSE 332: Data Structures & Parallelism Lecture 10:Hashing. Ruth Anderson Autumn 2018

Outline. hash tables hash functions open addressing chained hashing

General Idea. Key could be an integer, a string, etc e.g. a name or Id that is a part of a large employee structure

Today s Outline. CSE 326: Data Structures Asymptotic Analysis. Analyzing Algorithms. Analyzing Algorithms: Why Bother? Hannah Takes a Break

THINGS WE DID LAST TIME IN SECTION

Graph Theory Part Two

Range Minimum Queries Part Two

Graphs. Carlos Moreno uwaterloo.ca EIT

Hashing Techniques. Material based on slides by George Bebis

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Binary Relations Part One

Fundamental Graph Algorithms Part Four

Algorithms in Systems Engineering IE172. Midterm Review. Dr. Ted Ralphs

CS 161 Problem Set 4

SFU CMPT Lecture: Week 8

Comparing sizes of sets

Advanced Algorithmics (6EAP) MTAT Hashing. Jaak Vilo 2016 Fall

Introduction to Randomized Algorithms

Selection (deterministic & randomized): finding the median in linear time

The dictionary problem

CIS 190: C/C++ Programming. Lecture 12 Student Choice

Solution for Homework set 3

Introduction to. Algorithms. Lecture 6. Prof. Constantinos Daskalakis. CLRS: Chapter 17 and 32.2.

Lecture 6: Hashing Steven Skiena

Randomized Algorithms, Hash Functions

Algorithms and Data Structures, or

Quiz 1 Practice Problems

Set4. 8 February 2019 OSU CSE 1

Course Review. Cpt S 223 Fall 2009

Algorithms (III) Yu Yu. Shanghai Jiaotong University

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

MITOCW watch?v=yarwp7tntl4

Algorithms (III) Yijia Chen Shanghai Jiaotong University

Introduction to Data Mining

Quiz 1 Practice Problems

Some basic set theory (and how it relates to Haskell)

Can "scale" cloud applications "on the edge" by adding server instances. (So far, haven't considered scaling the interior of the cloud).

Fast Lookup: Hash tables

Context-Free Grammars

CSE202 Greedy algorithms. Fan Chung Graham

Algorithms and Data Structures

Hashing. Hashing Procedures

Greedy algorithms is another useful way for solving optimization problems.

Elementary maths for GMT. Algorithm analysis Part II

CS 337 Project 1: Minimum-Weight Binary Search Trees

14.1 Encoding for different models of computation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSE373: Data Structures & Algorithms Lecture 11: Implementing Union-Find. Lauren Milne Spring 2015

Data Structures and Algorithm Analysis (CSC317) Hash tables (part2)

III Data Structures. Dynamic sets

Hash Tables. Hash functions Open addressing. November 24, 2017 Hassan Khosravi / Geoffrey Tien 1

Cpt S 223 Fall Cpt S 223. School of EECS, WSU

Linked lists (6.5, 16)

Solutions to Problem 1 of Homework 3 (10 (+6) Points)

Hashing Based Dictionaries in Different Memory Models. Zhewei Wei

Announcements. Submit Prelim 2 conflicts by Thursday night A6 is due Nov 7 (tomorrow!)

Lecture Notes on Hash Tables

Tutorial #11 SFWR ENG / COMP SCI 2S03. Interfaces and Java Collections. Week of November 17, 2014

Designing for the Cloud

Algorithms (III) Yijia Chen Shanghai Jiaotong University

Lecture 12 Notes Hash Tables

Binary Relations Part One

CPSC 536N: Randomized Algorithms Term 2. Lecture 4

Hash Open Indexing. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Outline for Today. Why could we construct them in time O(n)? Analyzing data structures over the long term.

Unique Permutation Hashing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP?

CS161 Handout 07 Summer 2013 July 17, 2013 Guide to Reductions. Thanks to Julie Tibshirani for helping with this handout!

Steven Skiena. skiena

Algorithms and Data Structures

Hashing and sketching

Hash tables. hashing -- idea collision resolution. hash function Java hashcode() for HashMap and HashSet big-o time bounds applications

Approximation Algorithms

Practice Midterm Exam Solutions

AAL 217: DATA STRUCTURES

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

CS 561, Lecture 2 : Randomization in Data Structures. Jared Saia University of New Mexico

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

Throughout this course, we use the terms vertex and node interchangeably.

Transcription:

Randomized Algorithms Part Four

Announcements Problem Set Three due right now. Due Wednesday using a late day. Problem Set Four out, due next Monday, July 29. Play around with randomized algorithms! Approximate NP-hard problems! Explore a recent algorithm and why hashing matters! Handout: Guide to Randomized Algorithms also released.

Outline for Today Chained Hash Tables How can you compactly store a small subset of a large set of elements? Universal Hash Functions Groups of functions that distribute elements nicely.

Associative Structures The data structures we've seen so far are linear: Stacks, queues, priority queues, lists, etc. In many cases, we want to store data in an unordered fashion. Queries like Add element x. Remove element x. Is element x contained?

Bitvectors A bitvector is a data structure for storing a set of integers in the range {0, 1, 2, 3,, Z 1}. Store as an array of Z bits. If bit at position s 0, x does not appear in the set. If bit at position s 1, x appears in the set.

Bitvectors 000000000000000000000000000000000000000000

Bitvectors 010000000000000000000000000000000000000000

Bitvectors 010100000000000000000000000000000000000000

Bitvectors 010100010000000000000000000000000000000000

Analyzing Bitvectors What is the runtime for Inserting an element? Removing an element? Checking if an element is present? How much space is used if the bitvector contains all Z possible elements? How much space is used if the bitvector contains n of the Z possible elements?

Another Idea Store elements in an unsorted array. To determine whether s contained, scan over the array elements and return whether s found. To add x, check to see s contained and, if not, append x. To remove x, check to see if s contained and, if so, remove x.

Analyzing this Approach How much space is used if the array contains all Z possible elements? How much space is used if the array contains n of the Z possible elements? What is the runtime for Inserting an element? Removing an element? Checking if an element is present?

The Tradeoff Bitvectors are fast because we know where to look to find each element. Bitvectors are space-inefficient because we store one bit per possible element. Unsorted arrays are slow because we have to scan every element. Unsorted arrays are space-efficient because we only store the elements we use. This is a time-space tradeoff: we can improve performance by using more space.

Combining the Approaches Bitvectors always use a fixed amount of space and support fast lookups. Good when number of possible elements is low, bad when number of possible elements is large. Unsorted arrays use variable space and don't support fast lookups. Good when number of used elements is low, bad when number of used elements is large.

Chained Hash Tables Suppose we have a universe U consisting of all possible elements that we could want to store. Create m buckets, numbered {0, 1, 2,, m 1} as an array of length m. Each bucket is an unsorted array of elements. Find a rule associating each element in U with some bucket. To see if s contained, look in the bucket s associated with and see if s there. To add x, see if s contained and add it to the appropriate bucket if it's not. To remove x, see if s contained and remove it from its bucket if it is.

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Mary Ann Evans Piers Jacob Julius Marx Malcolm Little Jean-Baptiste Poquelin Moses Horwitz Samuel Clemens Theodore Geisel David Cornwell Association rule: rule: (length of of first first name) mod mod 4

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Piers Jacob Moses Horwitz David Cornwell Mary Ann Evans Jean-Baptiste Poquelin Theodore Geisel Julius Marx Samuel Clemens Malcolm Little Association rule: rule: Party Party in in bucket 1! 1!

Analyzing Runtime The three basic operations on a hash table (insert, remove, lookup) all run in time O(1 + X), where X is the total number of elements in the bucket visited. (Why is there a 1 here?) Runtime depends on how well the elements are distributed. If n elements are distributed evenly across all the buckets, runtime is O(1 + n / m). If there are n elements distributed all into the same bucket, runtime is O(n).

Hash Functions Chained hash tables only work if we have a mechanism for associating elements of the universe with buckets. A hash function is a function h : U {0, 1, 2,, m 1} In other words, for any x U, the value of h(x) is the bucket that x belongs to. Since h is a mathematical function, it's defined for all inputs in U and always produces the same output given the same input. For simplicity, we'll assume hash functions can be computed in O(1) time.

Choosing Good Hash Functions The efficiency of a hash table depends on the choice of hash function. In the upcoming analysis, we will assume U m (that is, there are vastly more elements in the universe than there are buckets in the hash table.) Assume at least U > mn, but probably more.

A Problem Theorem: For any hash function h, there is a series of n values that, if stored in the table, all hash to the same bucket. Proof: Because there are m buckets, under the assumption that U > mn, by the pigeonhole principle there must be at least n + 1 elements that hash to the same bucket. Inserting any n of those elements into the hash table places all those elements into the same bucket.

A Problem No matter how clever we are with our choice of hash function, there will always be an input that will degenerate operations to worst-case Ω(n) time. Theoretically, limits the worst-case effectiveness of chained hashing. Practically, leads to denial-of-service attacks.

Randomness to the Rescue For any fixed hash function, there is a degenerate series of inputs. The hash function itself cannot involve randomness. (Why?) However, what if we choose which hash function to use at random?

A (Very Strong) Assumption Let's suppose that when we create our hash table, we choose a totally random function h : U {0, 1, 2,, m 1} as our hash function. This has some issues; more on that later. Under this assumption, what would the expected cost of the three major hash table operations be?

Some Notation As before, let n be the number of elements in a hash table. Let those elements be x₁, x₂,, xₙ. Suppose that the element that we're looking up is the element z. Perhaps z is in the list; perhaps it's not.

Analyzing Efficiency Suppose we perform an operation (insert, lookup, delete) on element z. The runtime is proportional to the number of elements in the same bucket as z. For any xₖ, let Cₖ be an indicator variable that is 1 if xₖ and z hash to the same bucket (i.e. h(xₖ) = h(z)) and is 0 otherwise. Let random variable X be equal to the number of elements in the same bucket as z. Then X = C i

Analyzing Efficiency E[ X ] = E[ C i ] = E[C i ] = = P (h( )=h(z)) 1 m n m

Analyzing Efficiency E[ X ] = E[ C i ] = E[C i ] = = P (h( )=h(z)) 1 m n m

Analyzing Efficiency E[ X ] = E[ C i ] = E[C i ] = = P (h( )=h(z)) 1 m n m

Analyzing Efficiency E[ X ] = E[ C i ] = E[C i ] = = P (h( )=h(z)) 1 m n m

Analyzing Efficiency E[ X ] = E[ C i ] = E[C i ] = = P (h( )=h(z)) 1 m n m

Analyzing Efficiency E[ X ] = E[ C i ] = E[C i ] = = P (h( )=h(z)) 1 m n m So the expected cost of an operation is O(1 + E[X]) = O(1 + n / m)

Analyzing Efficiency Assuming we choose a function uniformly at random from all functions, the expected cost of a hash table operation is O(1 + n / m). What's the space usage? O(m) space for buckets. O(n) space for elements. Some unknown amount of space to store the hash function.

Analyzing Efficiency Assuming we choose a function uniformly at random from all functions, the expected cost of a hash table operation is O(1 + n / m). What's the space usage? O(m) space for buckets. O(n) space for elements. Some unknown amount of space to store the hash function.

A Problem We assume h is chosen uniformly at random from all functions from U to {0, 1,, m 1}. There are m U possible functions from U to {0, 1,, m 1}. (Why?) How much memory does it take to store h? If we assign k bits to store h, there are 2 k possible combinations of those bits. We need at least U log₂ m bits to store h. Question: How can we get this performance without the huge space penalty?

Analyzing Efficiency E[ X ] = E[ C i ] = E[C i ] = = P (h( )=h(z)) 1 m n m So the expected cost of an operation is O(1 + E[X]) = O(1 + n / m)

Universal Hash Functions A set H of hash functions from U to {0, 1,, m 1} is called a universal family of hash functions iff For any x, y U where x y, if h is drawn uniformly at random from H, then P(h(x) = h(y)) 1 / m In other words, the probability of a collision between two elements is at most 1 / m as long as we choose h from H uniformly at random.

Universal Hashing E[ X ] = E[ C i ] = E[C i ] = P (h( )=h(z)) 1 m n m

Universal Hashing E[ X ] = E[ C i ] = E[C i ] = P (h( )=h(z)) 1 m n m

Universal Hashing E[ X ] = E[ C i ] = E[C i ] = P (h( )=h(z)) 1 m n m

Universal Hashing E[ X ] = E[ C i ] = E[C i ] = P (h( )=h(z)) 1 m n m

Universal Hashing E[ X ] = E[ C i ] = E[C i ] = P (h( )=h(z)) 1 m n m

Universal Hashing E[ X ] = E[ C i ] = E[C i ] = P (h( )=h(z)) 1 m n m So the expected cost of an operation is O(1 + E[X]) = O(1 + n / m)

Universal Hash Functions The set of all possible functions from U to {0, 1,, m 1} is a universal family of hash functions. However, requires Ω( U log m) space. For certain types of elements, can find families of universal hash functions we can evaluate in O(1) time and store in O(1) space. The Good News: The intuitions behind these functions are quite nice. The Bad News: Formally proving that they're universal requires number theory and/or field theory, which is beyond the scope of this class.

Simple Universal Hash Functions We'll start with a simplifying assumption and generalize from there. Assume U = {0, 1, 2,, m 1} and that m is prime. (We'll relax this later.) Let H be the set of all functions of the form h(x) = ax + b (mod m) Where a, b {0, 1, 2,, m 1} Claim: H is universal.

Showing Universality We'll show H is universal by showing it obeys a stronger property called 2-independence: For any x₁, x₂ U where x₁ x₂, if h is chosen uniformly at random from H, then for any y₁ and y₂ we have P(h(x₁) = y₁ h(x₂) = y₂) = 1 / m 2. (The probability that you can guess where any two distinct elements will be hashed is 1 / m 2 ). Claim: Any 2-independent family of hash functions is universal.

h(x) = ax + b

Showing Universality If h(x) = ax + b (mod m), knowing two points on the line determines the entire line. Can only guess the output at two points by guessing the coefficients: probability is 1 / m 2! Need to use some more advanced math to formalize why this works; revolves around the fact that m is a finite field.

Generalizing the Result This hash function only works if m is prime and U = m. Suppose we can break apart any x U into k integer blocks x₁, x₂,, xₖ, where each block is between 0 and m 1. Then the set H of all hash functions of the form h(x) = a₁x₁ + a₂x₂ + + aₖxₖ + b (mod m) is universal. Intuitively, after evaluating k 1 of the products, you're left with a linear function in one remaining block and the same argument applies.

A Quick Aside Most programming languages associate a hash code with each object: Java: Object.hashCode Python: hash C++: std::hash Unless special care is taken, there always exists the possibility of extensive hash collisions!

Looking Forward This is not the only type of hash table; others exist as well: Dynamic perfect hash tables have worst-case O(1) lookup times and O(n) total storage space, but use a bit more memory. Open addressing hash tables avoid chaining and have better locality, but require stronger guarantees on the hash function. Hash functions have lots of applications beyond hash tables; you'll see one in the problem set.

Next Time Greedy Algorithms Interval Scheduling