From charlesreid1

Notes

Goodrich Chapter 12

Chapter 12 - sorting and selection

Merge sort

  • Merge sort as an example of divide-and-conquer
  • Running time of merge sort
  • Merge sort and recurrence relations

Sorting through an algorithmic lens

  • lower bound
  • linear time sorting
  • comparing sort functions

Skiena Chapter 4

Lots of good information here.

See Skiena Chapter 4 Questions for end of chapter questions.

Sorting

Applications of sorting:

  • Searching
  • Closest pairs
  • Element uniqueness
  • Frequency distribution
    • To find how often an arbitrary element k occurs, look up k using a binary search, and walk to the left of that point until first the element is not k, then do the same to the right side
  • Selection
  • Convex hulls
    • Once you have the points sorted by x-coordinate, the points can be inserted from left to right into the hull. Total time is linear after the sort is done.

Sorting lies at the heart of many algorithms. Sorting data is one of the first things any algorithm designer should try in the quest for efficiency.

Finding intersection of two sets of size m and n (m << n):

  • Sorting the big set first costs O(n log n) time, and doing a binary search for each of the m elements in n costs O(m log n), for a total cost of O((n+m) log n)
  • Sorting the small set first costs O(m log m) time, and doing a binary search for each of the n elements to find it in m costs O(n log m), for a total cost of O((m+n) log m)
  • Sorting both sets, and comparing each element of the two sets starting with the smallest. This costs O(n log n + m log m + n + m)

Overall the small-set sorting is the fastest.

Expected linear time can be achieved by hashing - building a hash table with elements from both sets - and this is probably a better solution.

Pragmatics

Several practical considerations when sorting:

  • Increasing (ascending) or decreasing (descending) order?
  • Sorting just the key, or the entire record?
  • What do you do when you get a tie?
    • Certain efficient sort algorithms (like quicksort) run into quadratic performance trouble unless engineered to deal with ties
  • Non-numerical data

The general approach should be to use an external comparison function (a comparator) that takes pointers to records and returns the result of the comparison. In general, compare(a,b) will return negative if a comes before b, positive if b comes before a, and 0 if they are equal.

Why Search

Why focus so much on search?

  • Knuth estimated 25% of compute time is spent sorting
  • Learning sorting is not just about learning sorting - the design techniques you learn in sort are important for other algorithmic problems

Sorting can be used to illustrate many algorithm design paradigms - data structure techniques, divide-and-conquer, randomization, and incremental construction.

Heap Sort

NOTE: Heap sort is only for sorting - not for searching!

Heap sort: the practical implementation is with a priority queue

O(1) time to remove a particular item from an unsorted array once it is located, but O(n) time to locate it. Priority queue improves this to O(log n) time to find items, instead of O(n). Selection sort is also sped up from O(n^2) to O(n log n).

Heap sort is nothing but an implementation of selection sort using the right data structure.

Heaps maintain weaker-than-sorted but stronger-than-random order

We have to trade a fast O(1) add, which slows down to O(log n) add for the heap, to improve efficiency of O(n) search, which becomes O(log n) search with a heap

Hierarchical organization - minimum items are at the top, heaps guarantee that any node (level i) is larger than its parent (level i-1)

Implementation:

  • Heaps can be implemented using a linked structure, but because they are filled from left to right, level by level, they are easy to implement as an array as well
  • We store the 2^k keys of a complete binary tree from left-to-right; level k keys are stored in positions 2^{k-1} to 2^k-1.
  • Assume array index starts at 1, to simplify matters
  • can store any binary tree in an array without pointers, usually easiest option for heaps
  • Why do we use this method for storing the heap? Space efficiency. Filling in from the bottom ensures no holes in the tree.

Since all but the last level is always filled, the height h of an n-element heap is logarithmic:

Each level is guaranteed to have 2^k nodes, so:


\sum_{k=0}^{h} 2^k = 2^{h+1} - 1 \leq n

and therefore


n = floor( \log{(n)} )

Constructing Heaps

To construct heaps, one option is to add from the top, left to right, as you go. This requires re-enforcing the heap order each time you add a node, which runs the up-bubbling routine on a tree that grows to an increasing size. A better approach is to think in reverse - if we move through the tree from the bottom up, looking for unformed heaps, we know that at any given level, a node's children will be well-formed heaps. This performs n/2 calls to the bubble down procedure, for O(n) cost, times the upper bound of the cost of each bubble down procedure, which is O(log n) (the worst case is that we have to traverse the entire height of the tree each time), for a total runtime of O(n log n).

While this is no different from the big-O cost of constructing a heap using down-bubbling, the difference lies in the fact that it is a big-O ceiling on the cost - in reality moving through the tree from the bottom up and calling the down-bubble method on nodes with well-formed child heaps ensures we won't run into the worst case of O(log n) for each down bubble procedure - rather, the cost of down-heaping is linear and proportional to the height of the heaps being merged. Working from the bottom-up, most of these heaps are small. For the last level, there are n/2 leaf nodes; at the next level, there are n/4 nodes; at the next level, n/8 nodes; etc.

Heap Construction Cost Analysis

Let's examine the computational cost of assembling the heap from the bottom up.

Each level in the heap has a maximum of \dfrac{n}{2^{h+1}} for a given level with height h < n. To get the cost of constructing the heap, we can sum up the cost of constructing each level of the heap. Recall as mentioned that if we are at level h of the heap, and using a bottom-up construction process, all children will be well-formed heaps, and the cost of merging them will be O(h), where h is the height of the heaps. So if we sum up the cost h times the number of nodes at level h, for h ranging from 0 to log n (the maximum height of our heap), we get:


\sum_{h=0}^{\log{(n)}} \dfrac{n}{2^{h+1}} h \leq n \sum_{h=0}^{\log{(n)}} \dfrac{h}{2^h} \leq 2n

This is just an inequality constructed from observations, plus a tiny bit of algebra.

Therefore, we get linear construction time, instead of the O(n log n) cost of heap construction using the top-down method.

Insertion Sort

Incremental insertion provides another sorting method - put each element into its proper position in the sorted set.

This is part of a more general class of incremental insertion techniques whereby we build a complicated structure on n items by first building it on n-1 items, then making changes to the last item.

Incremental insertion is particularly useful for geometric algorithms.

Merge Sort Cost Analysis

Description of merge sort - classic divide-and-conquer algorithm.

Recursive algorithm to reduce the larger sorting problem into a smaller one.

We split an array in half, and call merge sort on the successive halves, until we have a trivial problem; we then solve the trivial problem, and merge the solutions.

Total running time of merge sort? Think about how much work is done at each level of the tree. We will have \log{n} levels of merge sort calls, with the number of calls at a given level equal to 2^k. At the kth level the merging part of the algorithm will merge two lists of size \frac{n}{2^{k+1}} into a list of size \frac{n}{2^k}. This will involve a linear number of comparisons - worst case  \frac{n}{2^k}.

This leads to a total of n - 2^k comparisons max.

Merge sort does linear work to merge the elements at each level, so it takes linear work for the total merge step.

To get the overall cost of the algorithm, we must divide the array successively into halves - for a maximum of log(n) halving steps - and perform the linear merge step for each halving step, for a total of O(n log n).

Quick Sort - Randomization

Quick sort - pick a random item p from the items we seek to sort, and partition the other n-1 items into two piles: less than p, and greater than p.

Advantages:

  • Pivot p ends up in correct (final) array position
  • After partitioning, nothing will move from one side to the other - so left and right sides cn be sorted independently.
  • This lends itself to a recursive problem

Quicksort can be thought of as building a recursion tree, splitting the n-element range into two sub-ranges. The difference with mergesort is that quicksort does not enforce limits on where the split happens, so the worst case scenario is that quicksort always picks the smallest next item, and one partition is always size 1.

Best case: quick sort will always pick the median element, and we will only perform log n recursions.

Worst case, we always pick the smallest (or biggest) element in the sub-array. We spend linear work, and reduce the size of the problem by a single element. It costs n-1 to cut the array down to one element per level, so the worst case time is \Theta(n^2).

Quicksort, like hash tables, has reasonable expected performance, but really bad worst-case performance. This calls for randomization.

Randomization - avoids problems with quick sort algorithm caused by data being in a certain order/format, by randomizing it before sorting.

Quick Sort Cost Analysis

What is the likelihood of the best case, that we select the median element each time? The likelihood is 1/n each time, for a total likelihood that gets really small really fast:


\prod_{k=1}{n} \dfrac{1}{k} = \dfrac{1}{n!}

If we decide a key is close enough if it is within +/- 1/4 of the median, then we are considering items n/4 to 3n/4. This gives us much better odds of 1/2.

Let's see what happens in the worst possible best-case scenario - i.e., we have picked a key that is "close enough," but it is just barely close enough, at 3n/4.

Now we have 3n/4 items preceding that point in one portion of the array.

Normally, if we successively split an array of size n into halves, we have a number of halving steps equal to:


n = 2^h

or,


h = \log_{2}{(n)}

if instead we split the array into a different size, like 3n/4, we have a number of halving steps (denoted hg) equal to:


n = \frac{4}{3}^h_g

so this gives the good choice height:


h_{g} = \log_{\frac{4}{3}}{(n)}

Logs with non-2 bases can always be rewritten in terms of log2, so this is still O(log n) cost for the worst-case height, if we have chosen a key that we deem "good enough".

What if we don't pick a key that is "good enough"? Remember, we don't get to pick - we're stuck with whichever case we get. If we choose a bad key, we end up with the worst case, which will at worst double the height of the recursion tree, so we still have


h \approx 2 h_{g}

This means the overall cost of the algorithm is \Theta(\log n) - meaning the worst-case cost won't kill performance.

Summary: we want to go from this:

"Quicksort runs in \Theta(n \log n) time, with high probability, if you give me randomly ordered data to sort."

to this:

"Randomized quicksort runs in \Theta(n \log n) time on any input, with high probability."

Randomization Technique

The randomization technique makes proper analysis more complicated (requires probability analysis), but overall there are some principles to keep in mind:

  • Random sampling - if you want to get an idea of the median value, but not enough time to look at all of them
  • Randomized hashing - hashing can be used to implement dictionary operations in O(1) expected time; to make this high probability, can choose a random hash function - make your hash function random so your keys don't have to be

Actually Sorting Quickly

Three solid O(n log n) sort algorithms:

  • merge sort
  • heap sort
  • quick sort

Bucket Sort

The idea behind bucket sort is to group things into buckets, based on their distribution, and sort within each bucket.

Examples of good cases: random data, random points, etc.

Examples of bad cases: phone book (Shifflett example), extreme tails, can kill performance

Lower Bound on Sorting

Any sorting algorithm must have a \Omega(n \log n) bound, shown by observing that any sorting algorithm must behave differently during execution on any of the possible n! permutations of n keys. Because \log n! \sim \Theta(n \log n), the algorithm big-Omega is also n log n.

Sorting has one of the few nontrivial lower bounds among algorithm problems

Flags