8

**This article has been co-authored by Aakriti**

Suppose you are a teacher of a class of 1000 students. There is a scholarship offer for the top 50 students of your class. Now, you as a teacher, want to find those top 50 students. How would you go about doing this?

A) There is an easy approach you could follow. You just sort the students according to their ranks and pick up the top 50 from this sorted list. But you will realize that you have wasted a lot of effort in sorting the other 950 students of your class. The best known sorting algorithm takes $O(n*log_2(n))$, where $n$ is the number of students in the class.

B) Another approach which is slightly clever is as follows. Pick up the student with highest rank from the list by going through all of them. Then, remove him from the list and find out the highest rank from the remaining students, which turns out to be the student with second highest rank. Repeat the process 50 times and you have the top 50 students with you. Upon trying to analyze the time complexity of this approach, it would take $O(k*n)$ time, where $k$ is 50 in this case.

So, is this really better than the first? Well, it depends! You will notice that the comparison here is between $k$ and $log_2(n)$. If you had to find just 5 students, then clearly, second strategy wins. But if you were trying to find let's say 500 students out of 1000, then you should have gone with the first strategy.

Now, can you do better?

Well, the answer is Yes!

C) There are two observations to make. One, you just need to know top 50 rankers, but you don't need them in order. Both of the above methods give you the top 50 students in order. Seems like you are doing extra work, so there might be some other approach which reduces your effort. Second observation is that your job would be done if you just get the student who is ranked $50^{th}$ in the class. Now in another single pass ($O(n)$), you can just compare the marks of the rest of the students with the selected student. All of those students who rank higher than this selected student get the scholarship. The problem now reduces to: Is there any efficient algorithm which can find the student with $k^{th}$ rank? Or in other words, how can we find $k^{th}$ largest number from an unsorted list efficiently? Next, we look at one such algorithm which does the job in $O(n)$, thereby making this complete method $O(n)$ and better than the previous two.

**Selection Algorithm**

The current best known algorithm to find out the $k^{th}$ largest number from a list of n numbers uses the concept of "Median of Medians". So what is median of medians? Let's assume we have a magic algorithm Magic() which finds out the median in linear time ($O(n)$). Now, using this algorithm, we can easily find $k^{th}$ largest number in linear time as explained below :

/* Example : 5,3,7,2,1 is unsorted list and k=2 will return the number 5 ($2^{nd}$ largest) */

// beg indicates the starting index and end indicates the last index of the unsorted list.

Find_kth_largest( int k, int beg, int end ) { i = Magic(); // 'i' is the index of the median element /* Here in a single pass, shift all the elements smaller than A[i] to the left and greater than A[i] to the right. So A[i] occupies its correct place in the list. */ if ( i == k ) Return the element a[i]; else if ( i < k ) Return Find_kth_largest ( k, i+1 , end); else Return Find_kth_largest ( k-(n-i), beg, i ); }

This is an example of divide and conquer paradigm. If you are familiar with binary search algorithm, it works on similar lines. First, it finds the median element of the current list. Then it compares it's index with the value k. If k is equal, we are done. If k is smaller, recursive call is made into the left sub-array. Else, recursive call is made into the right sub-array.

**Complexity Analysis of the above algorithm**

If there are $n$ elements, let's say the complexity of finding the median is $T(n)$. This gives us the following recurrence relation

$T(n) = T(n/2) + n + T(Magic())$

Since, after every iteration, the search space reduces to half(either left or right sub-array), the term $T(n/2)$ appears in the right hand side. Here $n$ is for a single pass required for rearrangement of elements to bring the median to its correct position and the other term, $T(magic())$ is for the call to $Magic()$ algorithm. Given, $T(Magic())$ is $O(n)$, therefore,

$T(n) = T(n/2) + n $

The solution of the above recurrence relation is $2n$, which is nothing but $O(n)$.

After analyzing all this, you are one happy and proud teacher. You seem to have solved the problem in $O(n)$ time! Only thing that's stopping you is - figuring out 'Magic()' algorithm.

Mathematics is on your side today, and in this particular case it is not required that you have the exact median to get this linear time complexity. Here's the proof for that:

Say, you have an approximate median and you are certain that it is greater than 3/10 of all elements and lesser than 3/10 of all elements. Precisely, say this approx. median's correct position in the sorted list is 'x'. You have:

$3n/10 <= x <= 7n/10$

So the recurrence, in the worst case becomes: $T(n) = T(7n/10) + n$

This is because, in the worst case, the search space is reduced to $7n/10$ after each iteration. The solution of the above recurrence relation is also $O(n)$.

Turns out you don't really need pure magic. An approximate trick would do! And now we justify the title of this post: This approximate trick is what is called 'median of medians'.

**Median of Medians**

It starts with dividing the elements into groups of 5. The mystery behind the number 5 will be explained in a short while, no magic here!

So, we get $n/5$ groups of size 5 each, except maybe the last group if $n$ is not a multiple of 5. Let's ignore that for the purpose of this post. We can always add dummy elements to make the total a multiple of 5.

Next, we sort these groups of 5 individually. From our knowledge of sorting we know that it will take $5\log *{ 2 }{ \left( 5 \right) } $ time per group, multiplying this with total number of groups give us the total time: $n*log*2(5)$ which is $O(n)$.

Further, we take out the middle element from each of these groups: which gives us $n/5$ elements. The median of these $n/5$ elements has a very nice property. It is greater than half of the medians, thus $n/10$ elements. These $n/10$ medians are also greater than 2 elements each from their respective groups. So our median of medians is greater than at least $3n/10$ elements. Similar argument justifies that our median of medians is lesser than at least $3n/10$ elements. This proves that we have an approximate median element in our hands.

To find the median of the n/5 medians, we apply Find_kth_largest() again with k as $n/2$.

To summarize, we list the complete magic() algorithm:

Magic() { /* Step 1: Divide in groups of 5 */ /* Step 2: Sort these small groups */ /* Step 3: Extract the median elements of each group */ /* Step 4: Calculate median of these medians by recursive call to Find\_kth\_largest() where k is n/2 */ }

Now with Magic() function defined, the earlier recurrence for Find_kth_largest() becomes:

$T(n) = T(n/5) + T(7n/10) + n$

The solution is O(n)!!!

The magic number 5 can be replaced by any number greater than 5, and the resultant complexity would be the same. But if we go below 5, say 3, you can plug it in the recurrence and let us know the complexity!

Cheers!

51.41

Recommended Discussions

- How do data compression softwares works?
- What are the different types of cost models used in determining the time efficiency of an algorithm?
- Finding the least missing positive integer of a sorted array