ben's notes

Median Finding

Problem: Given a set $S=\{a_1, \cdots, a_n\}$, find the median $a \in S$ such that half of the values in $S$ are greater than $a$ and the other half is less than $a$.

The median is less sensitive to outliers than the mean.

Finding the Median #

A naive solution #

One method of finding the median is simply sorting the list, then getting the middle element. This requires $n\log(n)$ comparisons (see 03 Sorting for more information).

A Divide and Conquer solution #

We can realize that all of the numbers above and below the median do not matter, so if we don’t have to sort them then we might be able to get a better runtime.

We can use divide and conquer to solve a harder problem (analogous to strong induction). This problem is selection: given a set of numbers $S$ and an index $k$, output the $k$th smallest element in $S$.

The main idea: Pick an $a \in S$ and split $S$ into three lists:

  • $S_l$ for elements less than $a$
  • $S_a$ for elements equal to $a$
  • $S_r$ for elements greater than $a$

We can split $S$ into these three lists using linear time ($n$ number of comparisons).

Then:

  • If $k \le |S_l|$, the element must be in $S_l$ so we can re-run the algorithm only on $S_l$.
  • If $k > |S_l| + |S_a|$, then the element must be in $S_r$ so we can re-run the algorithm only on $S_r$.
  • Otherwise, $k$ must be in $S_a$ so return $a$.

Runtime:

The runtime depends on which $a$ is selected.

Bad case: If $a$ is always the largest or smallest element, then the problem size only decreases by $1$ every iteration. This results in a runtime of $O(n) + O(n-1) + \cdots = O(n^2)$.

Good case: If $a$ always splits $S$ in half, then we get the recurrence relation $T(n) = 1 \cdot T(n/2) + O(n) = O(n)$.

The Problem: guaranteeing the good case requires finding the median!

The Solution: pick $a$ at random.

  • Yes, this doesn’t guarantee a good case all the time… but what if we relax the notion of ‘good’ to make it probable?
  • Suppose we say $a$ is ‘good’ if it’s between the 25th and 75th percentiles. This means that the size of $S$ is reduced to at least $\frac{3}{4}$ size per iteration.
  • There is a 50% chance that $a$ is ‘good’, so the expectation of the number of attempts to get a good $a$ at random should be $2$.
  • It takes $O(n)$ time to determine if $a$ is good (check against other elements in list), so finding $a$ takes an expected $2 \cdot O(n)$ time.
  • Therefore, $\mathbb{E}T(n) \le \mathbb{E}T(\frac{3}{4} \cdot n) + 2 O(n)$ and thus by the Master Theorem, this is $O(n)$.