Sunday, 1 September 2013

Median of medians

Median of Medians Algorithm

Quickselect is linear-time on average, but it can require quadratic time with poor pivot choices. This is because quickselect is a decrease and conquer algorithm, with each step taking O(n) time in the size of the remaining search set. If the search set decreases exponentially quickly in size (by a fixed proportion), this yields a geometric series times the O(n) factor of a single step, and thus linear overall time. However, if the search set decreases slowly in size, such as linearly (by a fixed number of elements, in the worst case only reducing by one element each time), then a linear sum of linear steps yields quadratic overall time. For example, the worst case occurs when pivoting on the smallest element at each step, such as applying quickselect for the maximum element to already sorted data and taking the first element as pivot each time.

If one instead consistently chooses "good" pivots, this is avoided and one always gets linear performance even in the worst case. A "good" pivot is one for which we can establish that a constant proportion of elements fall both below and above it, as then the search set decreases at least by a constant proportion at each step, hence exponentially quickly, and the overall time remains linear. The median is a good pivot – the best for sorting, and the best overall choice for selection – decreasing the search set by half at each step. Thus if one can compute the median in linear time, this only adds linear time to each step, and thus the overall complexity of the algorithm remains linear.

The median-of-medians algorithm does not actually compute the exact median, but computes an approximate median, namely a point that is guaranteed to be between the 30th and 70th percentiles . Thus the search set decreases by a fixed proportion at each step, namely at least 30% (so at most 70% left). Lastly, the overhead of computing the pivot consists of recursing in a set of size 20% the size of the original search set, plus a linear factor, so at linear cost at each step, the problem is reduced to 90% (20% and 70%) the original size, which is a fixed proportion smaller, and thus by induction the overall complexity is linear in size.


Median-of-medians algorithm:
  • Line up elements in groups of five (this number 5 is not important, it could be e.g. 7 without changing the algorithm much). Call each group S[i], with i ranging from 1 to n/5.
  • Find the median of each group. (Call this x[i]). This takes 6 comparisons per group, so 6n/5 total (it is linear time because we are taking medians of very small subsets).
  • Find the median of the x[i], using a recursive call to the algorithm. If we write a recurrence in which T(n) is the time to run the algorithm on a list of n items, this step takes time T(n/5). Let M be this median of medians.
  • Use M to partition the input and call the algorithm recursively on one of the partitions, just like in quickselect.
The chosen pivot is both less than and greater than half of the elements in the list of medians, which is around n/10 elements (½×n/5) for each half. Each of these elements is a median of 5, making it less than 2 other elements and greater than 2 other elements outside the block. Hence, the pivot is less than 3(n/10) elements outside the block, and greater than another 3(n/10) elements inside the block. Thus the chosen median splits the elements somewhere between 30%/70% and 70%/30%, which assures worst-case linear behavior of the algorithm. To visualize:
One iteration on the list {0,1,2,3,...99}

12
15
11
2
9
5
0
7
3
21
44
40
1
18
20
32
19
35
37
39

13
16
14
8
10
26
6
33
4
27
49
46
52
25
51
34
43
56
72
79
Medians 17
23
24
28
29
30
31
36
42
47
50
55
58
60
63
65
66
67
81
83

22
45
38
53
61
41
62
82
54
48
59
57
71
78
64
80
70
76
85
87

96
95
94
86
89
69
68
97
73
92
74
88
99
84
75
90
77
93
98
91
(red = "(one of the two possible) median of medians", gray = "number < red", white = "number > red")

Proof of O(n) running time

The median-calculating recursive call does not exceed worst-case linear behavior because the list of medians is 20% of the size of the list, while the other recursive call recurses on at most 70% of the list, making the running time
T(n) \leq T(n \cdot 2/10) + T(n \cdot 7/10) + c \cdot n.
The O(n) term c n is for the partitioning work (we visited each element a constant number of times, in order to form them into n/5 groups and take each median in O(1) time).
From this, using induction – or summing the geometric series, obtaining 1/(1-9/10) = 1/(1/10) = 10 for overall scaling factor – one can easily show that
T(n) \leq 10 \cdot c \cdot n \in O(n).

Code
// selects the median of medians in an array
int select(int *a, int s, int e, int k){
    // if the partition length is less than or equal to 5
    // we can sort and find the kth element of it
    // this way we can find the median of n/5 partitions
    if(e-s+1 <= 5){
        sort(a+s, a+e);
        return s+k-1;
    }
    
    // if array is bigger 
    // we partition the array in subarrays of size 5
    // no. of partitions = n/5 = (e+1)/5
    // iterate through each partition
    // and recursively calculate the median of all of them
    // and keep putting the medians in the starting of the array
    for(int i=0; i<(e+1)/5; i++){
        int left = 5*i;
        int right = left + 4;
        if(right > e) right = e;
        int median = select(a, 5*i, 5*i+4, 3);
        swap(a[median], a[i]);
    }
    
    // now we have array 
    // a[0] = median of 1st 5 sized partition
    // a[1] = median of 2nd 5 sized partition
    // and so on till n/5
    // to find out the median of these n/5 medians
    // we need to select the n/10th element of this set (i.e. middle of it)
    return select(a, 0, (e+1)/5, (e+1)/10);
}

int main(){
    int a[] = {6,7,8,1,2,3,4,5,9,10};
    int n = 10;
    
    int mom = select(a, 0, n-1, n/2);
    cout<<"Median of Medians: " << a[mom] << endl;
    return 0;
}