Sorting | Lecture Notes | Foundations of Software Engineering | Civil and Environmental Engineering

Topics

Introduction
Performance Criteria
Selection Sort
Insertion Sort
Shell Sort
Quicksort
Choosing a Sorting Algorithm

1. Introduction

Sorting techniques have a wide variety of applications. Computer-Aided Engineering systems often use sorting algorithms to help reason about geometric objects, process numerical data, rearrange lists, etc. In general, therefore, we will be interested in sorting a set of records containing keys, so that the keys are ordered according to some well defined ordering rule, such as numerical or alphabetical order. Often, the keys will form only a small part of the record. In such cases, it will usually be more efficient to sort a list of keys without physically rearranging the records.

2. Performance Criteria

There are several criteria to be used in evaluating a sorting algorithm:

Running time. Typically, an elementary sorting algorithm requires O(N²) steps to sort N randomly arranged items. More sophisticated sorting algorithms require O(N log N) steps on average. Algorithms differ in the constant that appears in front of the N² or N log N. Furthermore, some sorting algorithms are more sensitive to the nature of the input than others. Quicksort, for example, requires O(N log N) time in the average case, but requires O(N²) time in the worst case.
Memory requirements. The amount of extra memory required by a sorting algorithm is also an important consideration. In place sorting algorithms are the most memory efficient, since they require practically no additional memory. Linked list representations require an additional N words of memory for a list of pointers. Still other algorithms require sufficent memory for another copy of the input array. These are the most inefficient in terms of memory usage.
Stability. This is the ability of a sorting algorithm to preserve the relative order of equal keys in a file.

Examples of elementary sorting algorithms are: selection sort, insertion sort, shell sort and bubble sort. Examples of sophisticated sorting algorithms are quicksort, radix sort, heapsort and mergesort. We will consider a selection of these algorithms which have widespread use. In the algorithms given below, we assume that the array to be sorted is stored in the memory locations a[1],a[2],...,a[N]. The memory location a[0] is reserved for special keys called sentinels, which are described below.

3. Selection Sort

This "brute force'' method is one of the simplest sorting algorithms.

Approach

Find the smallest element in the array and exchange it with the element in the first position.
Find the second smallest element in the array and exchange it with the element in the second position.
Continue this process until done.

Here is the code for selection sort:

Selection.cpp

#include "Selection.h" // Typedefs ItemType.

inline void swap(ItemType a[], int i, int j) {
    ItemType t = a[i];
    a[i] = a[j];
    a[j] = t;
}

void selection(ItemType a[], int N) {
int i, j, min;

    for (i = 1; i < N; i++) {
        min = i;
        for (j = i+1; j <= N; j++)
            if (a[j] < a[min])
                min = j;
        swap(a,min,i);
    }
}

Selection sort is easy to implement; there is little that can go wrong with it. However, the method requires O(N²) comparisons and so it should only be used on small files. There is an important exception to this rule. When sorting files with large records and small keys, the cost of exchanging records controls the running time. In such cases, selection sort requires O(N) time since the number of exchanges is at most N.

4. Insertion Sort

This is another simple sorting algorithm, which is based on the principle used by card players to sort their cards.

Approach

Choose the second element in the array and place it in order with respect to the first element.
Choose the third element in the array and place it in order with respect to the first two elements.
Continue this process until done.

Insertion of an element among those previously considered consists of moving larger elements one position to the right and then inserting the element into the vacated position.

Here is the code for insertion sort:

Insertion.cpp

#include "Insertion.h" // Typedefs ItemType.

void insertion(ItemType a[], int N) {
int i, j;
ItemType v;

    for (i = 2; i <= N; i++) {
        v = a[i];
        j = i;
        while (a[j-1] > v) {
            a[j] = a[j-1];
            j--;
        }
       a[j] = v;
    }
}

It is important to note that there is no test in the while loop to prevent the index j from running out of bounds. This could happen if v is smaller than a[1],a[2],...,a[i-1]. To remedy this situation, we place a sentinel key in a[0], making it at least as small as the smallest element in the array. The use of a sentinel is more efficient than performing a test of the form while (j > 1 && a[j-1] > v). Insertion sort is an O(N²) method both in the average case and in the worst case. For this reason, it is most effectively used on files with roughly N < 20. However, in the special case of an almost sorted file, insertion sort requires only linear time.

5. Shell Sort

This is a simple, but powerful, extension of insertion sort, which gains speed by allowing exchanges of non-adjacent elements.

Definition

An h-sorted file is one with the property that taking every hth element (starting anywhere) yields a sorted file.

Approach

Choose an initial large step size, h_K, and use insertion sort to produce an h_K-sorted file.
Choose a smaller step size, h_K-1, and use insertion sort to produce an h_K-1-sorted file, using the h_K-sorted file as input.
Continue this process until done. The last stage uses insertion sort, with a step size h₁ = 1, to produce a sorted file.

Each stage in the sorting process brings the elements closer to their final positions. The method derives its efficiency from the fact that insertion sort is able to exploit the order present in a partially sorted input file; input files with more order to them require a smaller number of exchanges. It is important to choose a good sequence of increments. A commonly used sequence is (3^K-1)/2,...,121,40,13,4,1, which is obtained from the recurrence h_k = 3 h_k+1+1. Note that the sequence obtained by taking powers of 2 leads to bad performance because elements in odd positions are not compared with elements in even positions until the end.

Here is the complete code for shell sort:

Shell.cpp

#include "Shell.h" // Typedefs ItemType.

void shell(ItemType a[], int N) {
int i, j, h;
ItemType v;

for (h = 1; h < = N/9; h = 3*h+1);

    for (; h > 0; h /= 3)
        for (i = h+1; i <= N; i++) {
            v = a[i];
            j = i;
            while (j > h && a[j-h] > v) {
                a[j] = a[j-h];
                j -= h;
            }
            a[j] = v;
        }
}

Shell sort requires O(N^3/2) operations in the worst case, which means that it can be quite effectively used even for moderately large files (say N < 5000).

6. Quicksort

This divide and conquer algorithm is, in the average case, the fastest known sorting algorithm for large values of N. Quicksort is a good general purpose method in that it can be used in a variety of situations. However, some care is required in its implementation. Since the algorithm is based on recursion, we assume that the array (or subarray) to be sorted is stored in the memory locations a[left],a[left+1],...,a[right]. In order to sort the full array, we simply initialize the algorithm with left = 1 and right = N.

Approach

Partition the subarray a[left],a[left+1],...,a[right] into two parts, such that
- element a[i] is in its final place in the array for some i in the interval [left,right].
- none of the elements in a[left],a[left+1],...,a[i-1] are greater than a[i].
- none of the elements in a[i+1],a[i+2],...,a[right] are less than a[i].
Recursively partition the two subarrays, a[left],a[left+1],...,a[i-1] and a[i+1],a[i+2],...,a[right], until the entire array is sorted.

How to partition the subarray a[left],a[left+1],...,a[right]:

Choose a[right] to be the element that will go into its final position.
Scan from the left end of the subarray until an element greater than a[right] is found.
Scan from the right end of the subarray until an element less than a[right] is found.
Exchange the two elements which stopped the scans.
Continue the scans in this way. Thus, all the elements to the left of the left scan pointer will be less than a[right] and all the elements to the right of the right scan pointer will be greater than a[right].
When the scan pointers cross we will have two new subarrays, one with elements less than a[right] and the other with elements greater than a[right]. We may now put a[right] in its final place by exchanging it with the leftmost element in the right subarray.

Here is the complete code for quicksort:

Quicksort.cpp

// inline void swap() is the same as for selection sort.

void quicksort(ItemType a[], int left, int right) {
int i, j;
ItemType v;

    if (right > left) {
        v = a[right];
        i = left - 1;
        j = right;
        for (;;) {
            while (a[++i] < v);
            while (a[--j] > v);
            if (i >= j) break;
            swap(a,i,j);
        }
        swap(a,i,right);
        quicksort(a,left,i-1);
        quicksort(a,i+1,right);
    }
}

Note that this code requires a sentinel key in a[0] to stop the right-to-left scan in case the partitioning element is the smallest element in the file. Quicksort requires O(N log N) operations in the average case. However, its worst case performance is O(N²), which occurs in the case of an already sorted file! There are a number of improvements which can be made to the basic quicksort algorithm.

Using the median of three partitioning method makes the worst case far less probable, and it eliminates the need for sentinels. The basic idea is as follows. Choose three elements, a[left], a[middle] and a[right], from the left, middle and right of the array. Sort them (by direct comparison) so that the median of the three is in a[middle] and the largest is in a[right]. Now exchange a[middle] with a[right-1]. Finally, we run the partitioning algorithm on the subarray a[left+1],a[left+2],...,a[right-2] with a[right-1] as the partitioning element.
Another improvement is to remove recursion from the algorithm by using an explicit stack. The basic idea is as follows. After partitioning, push the larger subfile onto the stack. The smaller subfile is processed immediately by simply resetting the parameters left and right (this is known as end-recursion removal). With the explicit stack implementation, the maximum stack size is about log₂ N. On the other hand, with the recursive implementation, the underlying stack could be as large as N.
A third improvement is to use a cutoff to insertion sort whenever small subarrays are encountered. This is because insertion sort, albeit an O(N²) algorithm, has a sufficiently small constant in front of the N² to be more efficient than quicksort for small N. A suitable value for the cutoff subarray size would be approximately in the range 5 ~ 25.

7. Choosing a Sorting Algorithm

Table 1 summarizes the performance characteristics of some common sorting algorithms. Shell sort is usually a good starting choice for moderately large files N < 5000, since it is easily implemented. Bubble sort, which is included in Table 1 for comparison purposes only, is generally best avoided. Insertion sort requires linear time for almost sorted files, while selection sort requires linear time for files with large records and small keys. Insertion sort and selection sort should otherwise be limited to small files. Quicksort is the method to use for very large sorting problems. However, its performance may be significantly affected by subtle implementation errors. Furthermore, quicksort performs badly if the file is already sorted. Another possible disadvantage is that quicksort is not stable i.e. it does not preserve the relative order of equal keys. All of the above sorting algorithms are in-place methods. Quicksort requires a small amount of additional memory for the auxiliary stack. There are a few other sorting methods which we have not considered. Heapsort requires O(N log N) steps both in the average case and the worst case, but it is about twice as slow as quicksort on average. Mergesort is another O(N log N) algorithm in the average and worst cases. Mergesort is the method of choice for sorting linked lists, where sequential access is required.

Table 1: Approximate running times for various sorting algorithms

Grading information.
METHOD	# COMPARISONS - AVERAGE CASE	# COMPARISONS - WORST CASE	# EXCHANGES - AVERAGE CASE	# EXCHANGES - WORST CASE
Selection sort Insertion sort Bubble sort Shell sort Quicksort	N²/2 N²/4 N²/2 ~N^.1.25 2 N ln N (1.38 N log₂ N)	N²/2 N²/2 N²/2 N^3/2 N²/2	N N²/8 N²/2 ? N	N N²/4 N²/2 ? N