Search Engine Report May - PDF Free Download

Search Engine Report May 02 2016 Momin Irfan Luke Wood May 02 2016

Description of Data Structures: AVL Tree: An AVL Tree is a specialization of the binary search tree. Like a binary search tree, it stores data into nodes that have a relationship with each other based on the data that is contained in the nodes. By convention, if the data of a node that is being added to the structure is less than the data of an existing node, it is stored to the left of that node. Conversely, if the data of a node that is being added to the structure is greater than the data of an existing node, it is stored to the right of that node. What this allows is the computer to be able to cut close to half of the data while searching or inserting based on the understanding that this property is met (known as the binary search property). This basic idea leads to the conclusion that searching and inserting in a binary search tree is O (lg(n)). However, given a certain data set, it is possible to degenerate this relationship to O(n). Say for example a user enters values into a tree in an increasing order. The values will always be added to the right of the list, and therefore, the tree will be no different from a linked list. To combat this, there exists the AVL tree. The AVL tree is a tree that can keep itself from becoming lopsided, or node heavy on one end. It accomplishes this by maintain the height of each node. If the difference of heights of a node is >= 2, the AVL rebalance while still adhering to the binary search property. The four ways it will accomplish this is either rotating the alpha node (the unbalanced node) with its left child, its right child, double with its left child, or double with its right child. All of these operation cause for a balanced tree. This guarantees O (lg(n)) for both insertion and deletion, as there will be no degeneration occurring.

Hash Table: A hash table is a structure that works very similarly to the way arrays work. Each element had an index based on order of addition, or many other various causes. Where arrays and Hash Table s separate is how that index is chosen. Although with an array the index is chosen by index * size, (given homogeneous typing), hash tables have complicated hash functions that take answers and churn out large numbers. A Hash Table takes advantage of this repeated relationship. It mods (%) that value with its table size and finds an index in the array. Thus then to locate the value again you run the process backwards. This pulls up the question of how to deal with collisions, or having two elements map to the same bucket. One way is linear probing, which is to just move the new object to the next available spot. Although this is a simple solution, it creates clusters in the Hash Table. Having to linear probe thought the entire array not only takes up space, but is inefficient. To combat this we use separate chaining, which is to create an array of a data structure as the type of our hash table. In this project we use the AVL Trees we created. If we can guarantee that these AVL Trees get no larger than a certain size, we have a data structure that searches, and inserts at a O (1) rate. Although re-hashing will cause the overall time to fell potentially wrong, amortizing that function will still yield us an O (1) search time and an O (1) insert time.

This is an example of a Hash Table that uses separate chaining, as well as our implemented AVL Tree. AVL Tree vs. Hash Table Given this data, it is easy to picture that the Hash Table is the more efficient data structure. We expect constant time access for the hash table vs log (n) based access for the AVL Tree. Based on the data we calculated, this theory seems to be true. The following is a table timing the parsing into each data structure. Data Set AVL Tree Time (s) Hash Table Time (s) 1000 2 3 10000 10 12 100000 140 110 ~300000 192 120

What this data shows is that as the data set gets larger, the Hash Table becomes the more efficient structure. However for small data sets, the AVL Tree is the winner. This is most likely because although the Hash Table is O (1), the hidden constant is very large. This is because the hash function is difficult to compute and creating space in an array is more difficult than just dereferencing and traversing a few pointers. An AVL Tree has a smaller hidden constant, making it better on smaller data sets. This shows that although Hash Tables are the more efficient data structure, they are only good once after a certain size. The hidden constant in the big O can cause these subtle changes.