Why Linear Search is O(n) – A Bit Mathy

OK this won’t hurt. I don’t want to hurt my readers. But I think it’s time we get a little serious. Alright not serious but lets tackle something a little harder than normal. Let’s analyze linear search? Why not? Oh no, you would rather watch Scandal. I’ll tell you the punchline. The president falls deeply in love with his mistress, Kerry Washington but never marries her as the plot would end.

So linear search is the most fundamental search algorithm. Right? Can we agree on that? Its advantage is that it’s simple and doesn’t require the data to be sorted.

Most proofs fall into four categories. Oh sure there are many more categories of proof than four but most are in those four. They four proof categories are as follows:

  1. Direct proof
  2. Proof by induction
  3. Proof by contradiction
  4. Proof by contraposition

The most common is the direct proof. Contradiction and induction are also both fairly popular. Contraposition would then be next. I’ll describe these proof techniques more in a later post. Just know that the direct proof is your go-to method in most cases. That will be the case today as we prove that linear search is n-th order in complexity (said order n and written as O(n)).

Let’s search an index i in a set \(S=\{s_1,s_2,s_3 … s_n\}\)

Examples are much easier though aren’t they. So, specifically, as an example, let

\[ S=\{3, 7, 24, 5, 23, 66, 99, 25, 2, 6\} \]

Linear search just starts at the first index value, in this case 3, and says is this the number I was looking for. If it was, were done. Otherwise we look at the next element and ask the same question. Rinse and repeat until we are at the last element (in this case 6). If that wasn’t the element we were searching for we conclude it’s not in the list.

Let’s write this in pseudo-code. If you want, you can write pseudo-code in Python. Python is so close to English I often will just write my pseudo-code in it. Or at least in something that looks a lot like it. In the example below I actually use a cross between C and Python. Again pseudo-code can whatever effectively communicates your intent. There is no pseudo-code standard.

So, assuming 1 for the initial index (opposed to 0 as in the C language), we can write the linear search algorithm as:

bool linearSearch(int find){
  for (int i=1 to n){
    if(S[i]==find)
      return true;
    else
      return false;
  }
}

Let’s calculate the order of this algorithm.

First, what is the best case? Well, the best case would occur when we are looking for 3 (the first item in the list). In that case, the order would be 1 or O(1) since it took only one comparison. The worst case would occur when we are trying to find 6, which is located in the array at the last index of 10. The order would be O(10) since it took 10 comparisons.

Next, what is the average number of comparisons for some arbitrary value n? Another way of saying this is, what is the average index value for some arbitrary value of n? To determine that, we note that average is another word for expectation.

This may seem like a detour and in a way it is. Some people with just use prior knowledge and claim it’s a lemma. No I didn’t say lemming. A lemma can be thought of as a baby theorem. Let’s not assuming any prior knowledge and develop it ourself. Why not? We hard-core programmers who love Linux and like to keep it close to the metal. We have big brains and it shows. Big egos too maybe.

So here we go strap yourself in.

Let’s calculate the average index value. Since it starts at 1 and goes to n we can just calculate the average value as a function of the index n.

We all know how to calculate average right. You just sum stuff up and divide by the total number of items you summed up. So in our case with our list of 10 items we would have the average number of comparisons needed to determine if our element is in the list as follows: \(\frac{1+2+3+4+5+6+7+8+9+10}{10}\). I’ll save your from having to compute this. The answer is 5.5. But we are so used to doing averages we often jump to the solution as I just did. But what is an average. It turns out the statistic department has created a handy definition, so lets use it. It may not look familiar to you, but it produces the same averages that we are so used to. The average, in statistics, is called the expectation and is defined as follows:

\[avg = \sum_{i=1}^n P_i * i\]

Where i is the index value and \(P_i\) is the probability of finding your element at the given index value. It turns out that \(P_i = 1/n\) – it is equally likely to be at any index value. Think about it, if you have a list with one element, the probability you could determine if your element was in the list after traversing to element one is 1. It would take you only one comparison to make your determination. It either is in the list and located at index location 1 or it is not. In other words with a probability of 1 (100%) your could determine if it was in the list.

What about a two element list? Well, we check element 1. Is it in the list, if so we got lucky and it only took one comparison. What is the probability that our element will be at index location 1? With a list of size 2 we know the probability will be \(\frac{1}{2}=0.5\) or (50%). Does our little formula produce the same thing. Let’s see,

\[ (P_1 * 1 + P_2 * 2)=1+\frac{1}{2}=1.5 \]

See, as we expected. A two element list will have an average index of 1.5. But you are saying we don’t have a value at 1.5. The index values are only at 1 and 2. So what the heck is this 1.5 funny business. We that is the nature of discrete math. We will deal with that later though. For now just know that the little brainy statisticians develop a definition that seems to work and matches are common sense or intuition.

If it’s not at element 1 we go to element 2. If it is at element 2 we can determine if it is in the list or not as well.

We can write the above equation – the definition of expectation – as,

\[ avg = \frac{1}{n}\sum_{i=1}^n i \]

Since the \(P_i\) term is the same for all indices (i.e., equally likely ) and is simply \(\frac{1}{n}\).

But the sum \(\sum_{i=1}^n i = \frac{n(n+1)}{2}\) (see proof below if interested) so we can write the equation above as,

\[ avg = \frac{1}{n}\frac{n(n+1)}{2} = \frac{n+1}{2} \]

Let’s make n really big, \(\frac{n+1}{2}\) then this equation would be approximately equal to \(\frac{n}{2}\).

Say \(\frac{1000+1}{2}=500.5\), which is almost the same as \(\frac{1000}{2}=5000\). So the latter equation is pretty good estimate – especially when n is large.

With regard to big-O notation, this would simply be written as O(n). But wait your telling me now that I have to even ignore the 2 in the denominator. Yep. It doesn’t really add much value in this case. See we are mostly interested in relative changes. The 2 in the denominator would affect relative changes so for simplicity we can just ignore it.

In other words, if you double n, you double the number of comparisons. Hence this is why it’s simply written as O(n).

Well. There you have it. If your so interested, the proof of our little “lemma” we used is below. (I consider nice dinner party fodder. Tell someone you can add all the number from 1 to 100 in 2 seconds using a calculator. The answer would be \(\frac{100(100+1)}{2}=5050\).) If you have other things to do, I wouldn’t blame you if you skip or change the channel. Maybe your girlfriend or wife is beckoning you to watch Dancing with the Stars with her. I’m so sorry you have to look at sexy half-naked girls on TV with her. You’re relieved … but come back tomorrow!
Aside:
Proof of \(\sum k=\frac{n(n+1)}{2}\) using a direct algebraic proof.

\[ \sum_{k=1}^n=1+2+3+…+(n-1)+n \]

We can write this as,

\[ \sum_{k=1}^n=n+(n-1)+…+2+1 \]

adding these two equations we get,

\[ 2\sum_{k=1}^n=(n+1)+(n+1)+…+(n+1)+(n+1) \]

or

\[ 2\sum_{k=1}^n=n(n+1) \]

so we have,

\[ \sum_{k=1}^n=\frac{n(n+1)}{2} \]

Share: