This blog, we still talk about randomized algorithms. Today we will focus on Randomized Selection Problem.
Selection Problem and LazySelect
Given unsorted list $S$ with $n = \vert S \vert$ distinct elements, and ${k \in \{1, \ldots, n\}}$, find ${S_{(k)}}$, which is the $k$th element of S in sorted order.
For $y \in S$, let $r_S(y) := \vert {y’ \in S \mid y’ \leq y} \vert$ be the rank of $y$ in $S$. The equivalent goal is to find $y \in S$ such that $r_S(y) = k$.
Observe that $r_S(S_{(k)}) = k$ and $S_{r_S(y)} = y$. (Because, all the elements are distinct).
Then, we give the pseudocode of LazySelect Algorithm as below
01: function LAZYSELECT(S, k)
02: repeat
03: R ← random ${n^{3/4}}$ elements from S, picked uniformly at random with replacement.
04: Sort R in $O(\vert R \vert \ln \vert R \vert)$ steps.
05: x ← $kn^{-1/4}$, $\ell$ ← ${\max \{\lfloor x - \sqrt{n} \rfloor , 1\}}$, $a$ ← $R_{(\ell)}$, $h$ ← ${\min \{ \lceil x - \sqrt{n} \rceil , 1\}}$, $b$ ← $R_{(h)}$. By comparing a and b to every $s \in S$, find $r_S(a)$ and $r_S(b)$.
06: \(P \leftarrow \begin{cases} \{y \in S \mid y \leq b\} & \text{if } k < n^{3/4} \\ \{y \in S \mid \leq y \leq b\} & \text{if } k \in [n^{3/4}, n - n^{3/4}] \\ \{y \in S \mid a \leq y\} & \text{if } k > n - n^{3/4} \end{cases}\)
07: until $S_{(k)} \in P$ and $\vert P \vert < 4n^{3/4} + 2$
08: Sort P in $O(\vert P \vert \ln \vert P \vert)$ steps.
09: return $P_{(k-r_S(a)+1)}$ // This is $S_{(k)}$.
Explanation of Algorithm
- First, we randomly select ${n^{3/4}}$ elements from ${S}$ and construct list ${R}$.
- Second, we sort ${R}$ in $O(\vert R \vert \ln \vert R \vert)$ time.
- The likely position of ${S_{(k)}}$ in ${R}$ is ${x = n^{3/4} \cdot \frac{k}{n} = k n^{-1/4}}$. So, here we select a range around position ${x}$, that is $\ell$ ← ${\max \{\lfloor x - \sqrt{n} \rfloor , 1\}}$, $h$ ← ${\min \{ \lceil x - \sqrt{n} \rceil , 1\}}$. And, we can pick up corresponding element $a$ ← $R_{(\ell)}$, $b$ ← $ R_{(h)}$.
- Then, we can compare all the elements from $S$ to $a$ and $b$. In the meantime, we can divide $S$ into three categories ${y<a, y\in [a,b],y>b}$.
- Then, we construct list $P$, in fact we don’t need extra computation, we can select it by above step.
- If $k$ is small, that is $k<n^{3/4}$, then we pick up $y \leq b$.
- If $k$ is in the middle, that is $n^{3/4} < k < n - n^{3/4}$, then we pick up $ a \leq y \leq b$.
- If $k$ is large, that is $k> n - n^{3/4}$, then we pick up $y \geq a$.
- If $P$ contains $S_{(k)}$, and size is small, that is $\vert P \vert < 4n^{3/4} + 2$. We will continue following steps.
- Sort ${P}$ in $O(\vert P \vert \ln \vert P \vert)$ time.
- Then, we can get ${S_{(k)}}$ by shifting in list ${P}$, that is $P_{(k-r_S(a)+1)}$.
LazySelect Analysis
Lemma: Let ${X_{(i)} = \vert \{ y \in R \mid y \leq S_{(i)} \}\vert}$. Hence we have \(S_{(i)} < R_{(j)} \Leftrightarrow X_{(i)} < j\)
Proof.
${\Rightarrow}$: If ${S_{(i)} < R_{(j)}}$, we know there at most ${j-1}$ elements in ${R}$ that are less or equal to ${S_{(i)}}$. So, ${X_{(i)} \leq j-1}$, that implies ${X_{(i)} < j}$.
${\Leftarrow}$: If ${X_{(i)} < j}$, we know there at most ${j-1}$ elements in ${R}$ that are less or equal to ${S_{(i)}}$. So, ${S_{(i)} \leq R_{(j-1)}}$, that means ${S_{(i)} < R_{(j)}}$. ${\square}$
**Theorem: ** With probability at least ${1-n^{-1/4}}$, LazySelect can find ${S_{(k)}}$ after running the loop for only once, and thus it does only ${2n+o(n)}$ comparisons.
Proof. First, we can compute the running time of the algorithm. If the algorithm finished in one loop, we take ${2n}$ comparisons from computing ${r_s(a)}$ and ${r_s{(b)}}$ (we need compare ${a,b}$ to each element in ${S}$) and ${\mathcal{O}(n^{3/4}\ln n)}$ for sorting lists ${R}$ and ${P}$ (and ${\mathcal{O}(n^{3/4}\ln n )= \mathcal{o}}(n)$).
Following, we need to show LazySelect can find ${S_{(k)}}$ after running the loop for only once with high probability. Here, we assume ${n^{3/4}}$ is an integer and only prove the cases ${n^{3/4} < k < n - n^{3/4}}$. The other two cases are similar and simpler.
Assume ${n^{3/4} < k < n - n^{3/4}}$, then ${x \in [\sqrt{n},n^{3/4}-\sqrt{n}]}$. And we have two ways to fail:
Type Ⅰ: ${S_{(k)}\notin P}$
Type Ⅱ: ${S_{(k)}\in P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor}$
So, we have
\[\begin{equation} \begin{aligned} \Pr [\text{multiple runs}] &= \Pr[S_{(k)} \notin P \lor \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor] \\ & = \Pr[S_{(k)} \notin P] + \Pr[\vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor] - \Pr[S_{(k)} \notin P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor]\\ &= \Pr[(S_{(k)} \notin P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor) \lor ((S_{(k)} \in P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor)] \\ & \quad + \Pr[S_{(k)} \notin P] - \Pr[S_{(k)} \notin P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor]\\ & = \Pr[(S_{(k)} \notin P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor)] + \Pr[(S_{(k)} \in P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor)] \\ & \quad + \Pr[S_{(k)} \notin P] - \Pr[S_{(k)} \notin P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor]\\ & = \Pr[(S_{(k)} \in P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor)] + \Pr[S_{(k)} \notin P] \\ & = \Pr[\text{Type I}] + \Pr[\text{Type II}] \end{aligned} \end{equation}\]Hence, we can calculate the upper bound of ${\Pr[\text{Type I}] }$ and ${\Pr[\text{Type II}]}$.
\[\begin{equation} \begin{aligned} \Pr[S_{(k)} \notin P] = \Pr[S_{(k)} < a] + \Pr[S_{(k)} > a] \end{aligned} \end{equation}\]We denote ${k_{\ell} = \max {1, k - 2 n^{3/4}}}$ and ${k_h = \min {k+ 2n^{3/4}, n}}$, then:
\[\begin{equation} \begin{aligned} \Pr[(S_{(k)} \in P \land \vert P \vert > \lfloor 4 n^3/4 + 2 \rfloor)] \leq \Pr[S_{(k_\ell)} > a] + \Pr[S_{(k_h)} < b] \end{aligned} \end{equation}\](Note: here ${k}$ is the median in ${P}$ with high probability, so it’s upper bound is at least ${S_{(k_\ell)} > a}$ or ${S_{(k_h)} < b}$ is true.)
Let ${X_i}$ indicate that the ${i}$th element picked for ${R}$ is less or equal to ${S_{(k)}}$, that is
\[\begin{equation} X_i = \begin{cases} 1 & \text{If } R_{(i)} \leq S_{(k)}, \\ 0 & \text{otherwise.} \end{cases} \end{equation}\]Then ${\Pr[X_i = 1] = \frac{k}{n}}$ and we denote the random variable ${X}$ as ${X_{(k)}}$ (which means the number of elements in ${R}$ that are less than ${S_{(k)}}$). Hence, we have ${}$${X = \sum_{i=1}^{\vert R \vert} X_i = \sum_{i=1}^{n^{3/4}}} X_i$. We know ${X_i}$ are Bernoulli trials with success probability ${p = \frac{k}{n}}$. Thus,
\[\begin{aligned} \mu_X &= \mathbb{E}\left[\sum_{i=1}^{n^{3/4}}X_i \right] = \sum_{i=1}^{n^{3/4}} \mathbb{E}[X_i] = p n^{3/4} = k n^{-1/4} = x\\ \sigma^2_X &= \mathbb{E}[X^2] - \mathbb{E}[X]^2 \\ &= \mathbb{E}\left[(\sum_{i=1}^{n^{3/4}} X_i)^2\right] - (p n^{3/4})^2\\ &=\sum_{i=1}^{n^{3/4}} \mathbb{E}[X_i^2] - (p n^{3/4})^2 + \sum_{i=1}^{n^{3/4}} \sum_{j\neq i} \mathbb{E}[X_i]\mathbb{E}[X_j]\\ &= pn^{3/4} - (p n^{3/4})^2 + p^2n^{3/4}(n^{3/4}-1)\\ & = pn^{3/4} - p^2n^{3/4} \\ & = p(1-p) n^{3/4} \\ & \leq \frac{n^{3/4}}{4} \end{aligned}\]Now, we compute the upper bound of probability of Type Ⅰ.
\[\begin{equation} \begin{aligned} \Pr[S_{(k)}<a] &= \Pr[S_{(k)} < R_{(\ell)}] \quad \text{(By lemma)}\\ &= \Pr[X_{(k)}< \ell]\\ &\leq \Pr[X < \lfloor x -\sqrt{n}\rfloor + 1]\\ &\leq \Pr[X \leq x -\sqrt{n}] \quad (\mu_X = x)\\ & = \Pr[X - \mu_x \leq - \sqrt{n}] \\ & \leq \Pr[\vert X - \mu_X \vert \geq \sqrt{n}] \\ & \leq \Pr[\vert X - \mu_X \vert \geq 2n^{1/8} \cdot \frac{n^{3/8}}{2}] \quad (\sigma^2_X \leq \frac{n^{3/4}}{2})\\ & \leq \Pr[\vert X - \mu_X \vert \geq 2n^{1/8} \sigma_X] \quad (\text{By Chebyshev's inequality})\\ & \leq \frac{1}{(2n^{1/8})^2}\\ & = \frac{1}{4} n^{-1/4} \end{aligned} \end{equation}\]Similarly, we have
\[\begin{equation} \begin{aligned} \Pr[S_{(k)}> b] &= \Pr[S_{(k)} > R_{(h)}] \quad \text{(By lemma)}\\ &= \Pr[X_{(k)} \geq h]\\ & = \Pr[X \geq \lceil x+ \sqrt{n} \rceil]\\ & \leq \Pr[X \geq x+ \sqrt{n} ] \\ &= \Pr[\vert X -\mu_X \vert \geq \sqrt{n}]\\ & \leq \frac{1}{4} n^{-1/4} \end{aligned} \end{equation}\]Now, we can use same strategy to compute Type Ⅱ failure
\[\begin{equation} \begin{aligned} \mu_{X_{(k_{\ell})}} &= \frac{k_{\ell}}{n}n^{3/4} \leq \frac{k - 2n^{3/4}}{n}n^{3/4} = x - 2\sqrt{n} \\ \sigma_{X_{(k_{\ell})}} &= p(1-p) n^{3/4} \leq \frac{n^{3/4}}{4} \end{aligned} \end{equation}\]Hence, we have
\[\begin{equation} \begin{aligned} \Pr[S_{(k_{\ell})}> a] &= \Pr[S_{(k_{\ell})} > R_{(\ell)}] \\ &= \Pr[X_{(k_{\ell})} \geq \ell]\\ & = \Pr[X_{(k_{\ell})} \geq \lfloor x - \sqrt{n} \rfloor]\\ & \leq \Pr[X_{(k_{\ell})} \geq x - \sqrt{n} ] \\ & = \Pr[X_{(k_{\ell})} - \mu_{X_{(k_{\ell})}} \geq \sqrt{n} ] \\ & \leq \Pr[\vert X -\mu_{X_{(k_{\ell})}} \vert \geq \sqrt{n}]\\ & \leq \frac{1}{4} n^{-1/4} \end{aligned} \end{equation}\]Similarly, we have
\[\begin{equation} \begin{aligned} \mu_{X_{(k_{h})}} &= \frac{k_{h}}{n}n^{3/4} \geq \frac{k + 2n^{3/4}}{n}n^{3/4} = x + 2\sqrt{n} \\ \sigma_{X_{(k_{h})}} &= p(1-p) n^{3/4} \leq \frac{n^{3/4}}{4} \end{aligned} \end{equation}\]Thus, we have
\[\begin{equation} \begin{aligned} \Pr[S_{(k_{h})} < b] &= \Pr[S_{(k_{h})} < R_{(h)}] \\ &= \Pr[X_{(k_{h})} < h]\\ & = \Pr[X_{(k_{h})} < \lceil x - \sqrt{n} \rceil]\\ & \leq \Pr[X_{(k_{h})} \leq x - \sqrt{n} ] \\ & = \Pr[X_{(k_{h})} - \mu_{X_{(k_{h})}} \geq \sqrt{n} ] \\ & \leq \Pr[\vert X -\mu_{X_{(k_{h})}} \vert \geq \sqrt{n}]\\ & \leq \frac{1}{4} n^{-1/4} \end{aligned} \end{equation}\]In summary, we have
\[\Pr [\text{multiple runs}] \leq \Pr[\text{Type I}] + \Pr[\text{Type II}] \leq 4*\frac{1}{4}n^{-1/4} = n^{-1/4}\]We have shown that with high probability at least ${1-n^{-1/4}}$, LazySelect does only ${2n + \mathcal{o}(n)}$ comparisons. ${\square}$