Balancing Trees and Hash Tables: Data Structures for Efficient Searching, Study notes of Computer Science

Class notes from day 6 of a computer science ii course, focusing on additional data structures: balancing trees and hash tables. Balancing trees ensure efficient searching as search trees grow larger, while hash tables use key-to-address transformations for quick access. Various types of key-to-address transformations, their advantages and disadvantages, and specific techniques like digit analysis and folding and adding.

Typology: Study notes

Pre 2010

Uploaded on 02/25/2010

koofers-user-a0p
koofers-user-a0p 🇺🇸

9 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
COP 3503 – Computer Science II – CLASS NOTES - DAY #6
Additional Data Structures
Balancing Trees
As search trees get large, it becomes important to ensure that the tree is
balanced, otherwise the time required by the various tree operations (searching
primarily) will increase to a worst case of O(N).
Later in the term, we will examine several different variants of trees and see
how they are balanced. Some trees require that balance be maintained by all
operations on the tree while other trees allow balancing to occur only after the
tree has become unbalanced to the point of requiring too much time for
individual operations on the tree.
Recall that a binary tree is height-balanced or simply balanced if the difference in
height of both subtrees of any node is either zero or one. A perfectly balanced tree
is one in which all leaf nodes are found on one or two levels.
For example, a perfectly balanced binary tree consisting of 10,000 nodes, the
height of this tree will be log(10,001) = 13.289 = 14. In practical terms, this
means that if 10,000 elements are stored in a perfectly balanced tree, then at most
14 nodes will need to be checked to locate a specific element. This is a substantial
difference when compared to the worst case of 10,000 elements in a list!
Therefore, in trees which are to be used primarily for searching, it is worth the
effort to either build the tree so that it is balanced or modify the existing tree so
that it is balanced.
Day 6 - 1
A binary tree is height-balanced (or simply balanced) if the difference in
height of both subtrees of any node in the tree is either zero or one. A
tree is said to be perfectly balanced if it is balanced and all of the leaves
are found on one or two levels.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Balancing Trees and Hash Tables: Data Structures for Efficient Searching and more Study notes Computer Science in PDF only on Docsity!

COP 3503 – Computer Science II – CLASS NOTES - DAY #

Additional Data Structures

Balancing Trees  As search trees get large, it becomes important to ensure that the tree is balanced, otherwise the time required by the various tree operations (searching primarily) will increase to a worst case of O(N).  Later in the term, we will examine several different variants of trees and see how they are balanced. Some trees require that balance be maintained by all operations on the tree while other trees allow balancing to occur only after the tree has become unbalanced to the point of requiring too much time for individual operations on the tree. Recall that a binary tree is height-balanced or simply balanced if the difference in height of both subtrees of any node is either zero or one. A perfectly balanced tree is one in which all leaf nodes are found on one or two levels. For example, a perfectly balanced binary tree consisting of 10,000 nodes, the height of this tree will be log(10,001) = 13.289 = 14. In practical terms, this means that if 10,000 elements are stored in a perfectly balanced tree, then at most 14 nodes will need to be checked to locate a specific element. This is a substantial difference when compared to the worst case of 10,000 elements in a list! Therefore, in trees which are to be used primarily for searching, it is worth the effort to either build the tree so that it is balanced or modify the existing tree so that it is balanced. A binary tree is height-balanced (or simply balanced ) if the difference in height of both subtrees of any node in the tree is either zero or one. A tree is said to be perfectly balanced if it is balanced and all of the leaves are found on one or two levels.

Hash Tables Hash functions are a specific case of a more general technique known as key-to- address transformations (KTA transformations). There are many different KTA transformation techniques possible. Figure 1, illustrates the hierarchy of KTA transformations. Figure 1 – Key-to-address transformation hierarchy. Distribution dependent transformations depend on at least approximate knowledge of the key values that will be expected. The benefits that can be gained by distribution dependent techniques depend on open-addressing, bucket size. file density, and the appropriateness of the transformation itself. For small bucket size and a good distribution algorithm, the improvement over randomizing transformations can be significant. On the other hand, the liabilities of distribution dependent transformations are major, since a change in the key distribution can cause these methods to generate many more collisions than a randomization would generate for the same data. A benefit of some distribution dependent KTA transforms is that they can allow for maintaining sequentiality. Such sequence maintaining transforms allow the addresses produced to increase with increasing value of the key. Serial access is made possible in this case. Otherwise, a direct file does not generally support serial access. In Figure 1, there are two distribution Key-to-address Transformations Known Key Distribution Unknown Key Distribution Deterministic Transformations Probabilistic Transformations Sequence Maintaining Transformation Hashing Techniques Exponential Transform Piecewise Linear Transform Digit Analysis Remainder of Division XOR Folding and Adding

full address; otherwise combinations of other digit positions (perhaps taken modulo 10 or as appropriate) can be tested. A sequence maintaining transformation function can be obtained by taking a simplified inverse of the distribution of keys found. The addresses are generated to maintain sequentiality with respect to the source key. In a piece-wise linear transformation the observed distribution is approximated either automatically or manually, by simple line segments. This approximation is then used to distribute the addresses in a complementary manner. The remainder of division (modulo operation) of the key by a divisor equal to the number of record spaces allocated in the file, can be used to obtain the desired address. Division is in some sense similar to taking the low-order digits, but when the divisor is not a multiple of the base of the number system of the key (or the hardware), information from the high-order portions of the key will be included; and this additional will have a positive effect on the number of addresses generated and thus on the uniformity of the generated addresses. Large prime numbers are generally used as divisors, since their quotients exhibit a well-distributed behavior, even when parts of the keys do not. In general, divisors that do not contain small primes (<= 19) are adequate. Empirical data has shown that division tends to preserve better than other methods preexisting uniform distributions, especially uniformity due to sequences of low-order digits in assigned identification numbers. The remainder does not preserve sequentiality. The problem with division is in the capability of the available division operation itself. Frequently the key field to be transformed is larger than the largest dividend the divide operation can accepts, and some hardware does not have division instructions which provide a remainder (although this is rare). When this occurs, the remainder (address) can be calculated according to the expression: m m key address key         The floor operation is necessary to prevent a smart optimizer from generating address = 0 for every key, which would lead to an extreme number of collisions (n- 1 if n records are to be stored). The exclusive-or technique typically divides the key digit string is segmented into parts which match the required address size. Using this operation results in random patterns for random binary inputs. The various segments are then exclusively-or’ed together to produce the address. Segment sizes need to be chosen carefully so that they have no common divisor relative to word sizes. This is among the faster KTA transformations available and is widely used.

Folding and adding of the key digit string produces a shorter string as the address and is a commonly used hashing technique. Alternate segments of the key digit string are bit-reversed.  Static hashing and dynamic hashing are two very different problems and result in two very different structures to support them. Static hashing is most suitable for “internal hash structures” which are relatively small structures which fit into main memory in their entirety while dynamic hashing is most suitable for “external hash structures” which are relatively large structures on secondary memory.  Provides dynamic searching capabilities based upon name alone.  Avoids two problems of the BST. (1) Not O(N) in the worst case, and (2) does not require the repetitive memory maintenance of the BST which requires reorganization of the tree after every insertion and deletion.  A hashing function is associated with the table that converts an input value (a key value) into an integer value that represents an address within the table (a location in the hash table).  Data collision results any time that the hash function yields an address for a new input value that is already occupied by an existing data value. Without resolving the collision – the new input value is simply lost!  Many different collision resolution techniques have been developed including, open addressing (linear probing or rehashing), chaining, multiple hash functions, and buckets.  Searching the hash table is an O(1) operation in optimal situations.  Hash tables are used in search engines and extensively by compilers and assemblers.  Hash tables are very useful any time a fast lookup is needed.  Search space (file space) is of size M, key space (set of all possible key values) is K , and the number of expected key values is k. The relationship that must hold between these three parameters (for internal hash structures) is: K >> M > n

Example of Dynamic Hashing showing tree expansion and 3-bit key value Priority Queues  This data structure supports access only to the item which has the highest priority (this is the minimum priority value).  Three operations are supported:

  1. insertion – a normal queue insertion.
  2. deleteMin deletes the item in the queue with minimum priority value.
  3. FindMin – searches for the item in the queue with minimum priority value.  Worst case performance is faster the BST (O(log 2 N) in worst case)  Less pointer overhead than with BST.  FindMin operation is O(1).  deleteMin operation is O(log 2 N).  Insert is O(1) on average and O(log 2 N) in worst case.  Basic priority queue with these three operations is called a binary heap. Basic Operations on the Heap Insert:
  4. Insert a node into the next available spot (i.e., in the bottom ply).
  1. Compare the key value of the new node with its parent’s key value, if the new node’s key value is less than its parent’s – interchange the nodes.
  2. “Percolate” the node up into its correct position by recursively applying step #2. Example: Example: Final tree maintaining structure and ordering properties. DeleteMin:
  3. Get the key value from the root node.
  4. Locate the bottom, rightmost child and interchange it with the root.

`

`

1 0 7 4 1 `

Data Structure Access is to Comments Stack only to most recently inserted item, pop = O(1) very, very fast Queue only to least recently inserted item, dequeue = O(1) very, very fast Linked List any item O(N) Search Tree any item by name or ranking, O(log 2 N) average case; worst case is O(N) Static Hash Table any named item = O(1) collision rate affects performance Priority Queue findMin = O(1) deleteMin = O(log 2 N) insert is O(1) on average and O(log 2 N) in worst case Table summarizing the basic data structures