Lookups – understanding the binary search algorithm

Introduction

Ask ten regular Excel users what their favourite formula function is, eight will probably reply with ‘VLOOKUP’. Ask the same eight if they ever use the approximate match flavour and you will be lucky to get confirmation from one!

You may be wondering why it is necessary to learn the approximate match method when you’ve been quite comfortable using the exact match method for years. There are two reasons:

  • The approximate match method is much faster. The exact match method is often the main reason why your workbooks take ages to calculate.
  • There are other things that you can do with lookup’s!

Excel exposes four lookup functions, namely LOOKUP(), VLOOKUP(), HLOOKUP() and MATCH(). The purpose of this article is to describe how the approximate match method works, otherwise known as the binary search algorithm. Each of the functions referenced supports Excel’s binary search algorithm. This article will not, however, describe detailed behaviour and usage guidance for each of the individual functions.

Linear Search Algorithm

The linear search algorithm, also known as a sequential search is a method for finding a target value within a list, be it ordered or not. It sequentially checks each element of the list for a target value until a match is found or until all elements have been checked. Thus it can be said that the amount of time time it takes a linear search to find an item is proportionate to the position of the target value in the list.

Visualisation of the linear search algorithm where 11 is the target value.

linear search algorithm

Linear search is practical when the list has only a few elements, when performing a single search in an unordered list or for searching a list that cannot be ordered.

Microsoft Excel refers to the linear method as the Exact Match method.

Binary Search Algorithm

The binary search algorithm, also known as a half-interval search is a method for finding a target in an ordered list. It compares a target value to the middle element of a list. If they do not match then the half where the target value cannot exist is eliminated and it repeats the process until it makes a match, or until there are no more elements to compare. It works almost in exactly the same manner that you and I would look for an entry in a directory or dictionary.

Visualisation of the binary search algorithm where 4 is the target value.

binary search algorithm

Binary search is only possible where the list can be ordered according to the target value field, as the primary sort field. The subsequent lessons refers to the binary search algorithm using a list ordered ascending. LOOKUP(), VLOOKUP() and HLOOKUP() only support this algorithm using ascending ordered lists. MATCH(), however, can be used to process both acsending and descending ordered lists.

Microsoft also refer to this method as the Approximate Match method. This is aptly named because should the target value not exist in the list, the binary search will yield a match of the last element remaining after all half-interval eliminations have taken place. For binary search over an ascending order list this will always be the largest item less than the target value.

Visualisation of the binary search algorithm where the target value is not found in the list.

binary search algorithm - approximate match

Microsoft Excel lookup functions return #N/A if the target value is less than the first value in the list.

Visualisation of the binary search algorithm where the target value is less than the first item.

binary search algorithm - no match

Value Range Lookups

An aged analysis is when one categorises balances according to the age of the item. This is a typical exercise when analysing debtors and creditors. For instance, one may have a statement of purchase invoices and rather than review each invoice in turn, one may want to evaluate risk according to age ranges of the debt.

To undertake such an activity requires that one calculate the number of days that each invoice is overdue, and then perform some sort of conditional calculation to determine which age range category each item belongs to. Study the following exhibit:

A B C D E
1 Invoice Number Invoice Date Invoice Amount Days Overdue Age Category
2 INV003746 20/02/2016 $                384.99 129 121 + days
3 INV008655 23/03/2016 $                385.10 97 91 – 120 days
4 INV003443 09/04/2016 $                790.15 80 61 – 90 days
5 INV006588 14/04/2016 $                     8.76 75 61 – 90 days
6 INV007450 28/04/2016 $                265.20 61 61 – 90 days
7 INV008280 28/04/2016 $                849.78 61 61 – 90 days
8 INV001330 04/05/2016 $                     3.96 55 31 – 60 days
9 INV007685 07/05/2016 $                763.64 52 31 – 60 days
10 INV004728 08/05/2016 $                907.03 51 31 – 60 days
11 INV006811 09/05/2016 $                228.87 50 31 – 60 days
12 INV001455 11/05/2016 $                  12.90 48 31 – 60 days
13 INV007335 20/05/2016 $                     6.24 39 31 – 60 days
14 INV009779 31/05/2016 $                     4.03 28 1 – 30 days
15 INV004267 05/06/2016 $                     6.93 23 1 – 30 days
16 INV001209 18/06/2016 $                     5.67 10 1 – 30 days
17 INV007774 20/06/2016 $                     1.44 8 1 – 30 days
18 INV009575 23/06/2016 $                317.41 5 1 – 30 days
19 INV006282 29/06/2016 $                  80.25 0 Current
20 INV009488 03/07/2016 $                  73.22 0 Current
21 INV001120 15/07/2016 $                742.09 0 Current

Purchase Invoices

Worksheet Formulas

Cell Formula
D2 =MAX(0,TODAY()-30-B2)

The formula in D2 (and copied down to the last row) calculates how many days overdue each invoice is. It assumes a standard 30 days payment terms and it returns zero if the invoice is not yet due.

The next challenge is to calculate what age category each invoice belongs to. A somewhat typical approach is to use nested IF statements, such as:

Worksheet Formulas

Cell Formula
E2 =IF(D2>120,”121 days +”,IF(D2>90,”91 to 120 days”,IF(D2>60,”61 to 90 days”,IF(D2>30,”31 to 60 days”,IF(D2>0,”0 to 30 days”,”Current”)))))

There are two problems with this formula.

  1. It is very inflexible. If one wanted to add a new category, e.g. 121-180 days, one would have to amend the formula. Adding or removing categories incurs unnecessary effort.
  2. It is limited. Some older versions of Excel will only allow a maximum of 7 nested functions. Today that limit is 64. Another limit is the length of the formula. Ok so if you are using a recent version of Excel, you are unlikely to hit the limit, but it’s ugly, right?!

A better approach is to use a lookup and specifically to use the approximate match method. This involves constructing a list of lookup values and an associated title for each. Provided that the list is placed in a range, adding or removing age categories becomes a straight forward process of updating the list, not the formula. Assume that we have the following lookup table, in a separate sheet, called “rngAgeBuckets”.

A B
1 Age Title
2 0 Current
3 1 1 – 30 days
4 31 31 – 60 days
5 61 61 – 90 days
6 91 91 – 120 days
7 121 121 + days

Age Buckets

This then allows one to use, in this exhibit, an approximate match method VLOOKUP (which uses the binary search algorithm), such as:

Worksheet Formulas

Cell Formula
E2 =VLOOKUP(D2,rngAgeBuckets,2,TRUE)

The image below describes the binary search calculation order. In a nutshell, the VLOOKUP used in E2 matches the largest item in the lookup table that is less than the target value, and then return the adjacent value from the 2nd column, ‘Title’. Changes to the categories are easy to manage, one merely updates the table in ‘Age Buckets’.

Visualisation of the binary search algorithm used for aged analysis.

binary search algorithm - aged analysis

Exact Match using Binary Search

Users tend not to use the approximate match method for one of two reasons. Either they are not aware that one can construct an approximate match method lookup formula to yield exact match results only, or they do not believe that they can maintain the lookup table’s sort order. With regards to the latter, sort order can be a genuine inhibitor. Using the binary search algorithm requires that the table be ordered according to the target value key / field. Thus one cannot use the approximate match method if the table is ordered according to a different field. A classic case is when the multiple keys are used for multiple lookups.

Sometimes the concern is merely that one does not know how to maintain the sort order. A key solution to this, often, is to import the information into the workbook by using a query. Sort order can be ensured by making sure that the query definition includes sort the appropriate rules. If, however, the data is manually updated then one tends to need to revert to a VBA solution, perhaps utilising an event such as the Worksheet_Change event.

The code below is a demonstration of VBA code one can implement in the desired sheet module to sort inputs written to a table occupying columns A:C.

As already pointed out, an approximate match lookup uses the binary search algorithm, which returns the largest item less that the target value, if the target value does not exist within the list. Worksheet functions such as VLOOKUP(), HLOOKUP() and MATCH() can use either linear or binary algorithm (the final argument specifies which algorithm to use). VLOOKUP() and HLOOKUP() tend to be used to return a value adjacent to the found result, and MATCH() is used to yield the position of the found result in a list. Few are aware, however one can force an exact match result with the binary search method. The logic is as follows:

  1. First use an approximate LOOKUP(), VLOOKUP() or HLOOKUP() to return the found value for a target value.
  2. Compare the result back to the target value. If the key value exists in the table then they will be the same.
  3. If #2 is True then one can invoke the desired lookup formula, using the binary search method, to yield the desired result.
  4. If #2 is False then one can return #N/A instead, or even invoke an linear search lookup approach, if the additional caution is required.

Lets begin by breaking down this process in Excel. We will produce a compound formula afterwards. Assume we have a list of employee ID’s and the associated employee names, in a range called ‘rng_EmpLookup’, such as:

A B
1 Employee ID Full Name
2 000001 Christopher Boyer
3 000002 Kadeem Greer
4 000003 Imani Diaz
5 000004 Sara Caldwell
6 000005 Dorothy Ashley
7 000006 Tucker Flores
8 000007 Isaiah Vargas
9 000008 Lenore Craft
10 000009 Cassandra Harding
11 000010 Kaitlin Peterson
12 000011 Tucker Terry

And now assume we have a requirement to identify three employee names from their employee ID’s, one of which does not exist in the employee lookup table (i.e ID 000016). The following table demonstrates the exact-match-lookup-using-binary-search-algorithm though process:

A B C D E
1 Employee ID Find Employee in lookup table Is it the employee? It it is the return the full name Otherwise return #N/A
2 000008 000008 TRUE Lenore Craft FALSE
3 000011 000011 TRUE Tucker Terry FALSE
4 000016 000011 FALSE FALSE #N/A

 

Worksheet Formulas

Cell Formula
B2 =VLOOKUP(A2,rng_EmpLookup,1,TRUE)

The formula in B2 (and copied down to B4) merely tells us which Employee ID is found in the lookup table for each Employee ID in the results table. If the employee does not exist in the lookup table then the approximate match result is given.

Worksheet Formulas

Cell Formula
C2 =B2=A2

The formula in C2 (and copied down to C4) returns TRUE if the lookup result is the same as the target value, otherwise returns FALSE. A FALSE result indicates that the Employee ID does not exist in the lookup table.

Worksheet Formulas

Cell Formula
D2 =IF(C2,VLOOKUP(A2,rng_EmpLookup,2,TRUE))

The formula in D2 (and copied down to D4) will execute an approximate match (binary search) VLOOKUP to yield the Employee Name, but only if the employee exists in the lookup table.

Worksheet Formulas

Cell Formula
E2 =IF(NOT(C2),NA())

The formula in E2 (and copied down to E4) will return #NA() if the employee is not found in the lookup table.

The entire process can be achieved using a single formula ad single result column, as demonstrated below:

A B
1 Employee ID Employee Name
2 000008 Lenore Craft
3 000011 Tucker Terry
4 000016 #N/A

Demo

 

Worksheet Formulas

Cell Formula
B2 =IF(VLOOKUP(A2,rng_EmpLookup,1,TRUE)=A2,VLOOKUP(A2,rng_EmpLookup,2,TRUE),NA())

The formula demonstrated in B2 is a fairly common construct to ensure an exact match using binary search lookups. You may be wondering “is it worth it”. The simple answer is YES. If you are heavily dependent on lookup functions and your lists are hundreds of rows (or more), then yes absolutely. The binary search algorithm is hundreds of times more efficient, so even implementing two lookups to yield a single result should have considerable benefit.

If there is a risk that the lookup table sort order could be out of kilter then one could construct the formula to test a binary search result. If the test passes then return the result using a binary lookup, otherwise revert to the linear lookup method. For example:

Worksheet Formulas

Cell Formula
B2 =IF(VLOOKUP(A2,rng_EmpLookup,1,TRUE)=A2,VLOOKUP(A2,rng_EmpLookup,2,TRUE),VLOOKUP(A2,rng_EmpLookup,2,FALSE))

Dynamic Ranges with Binary Search

Another great application for binary lookups is to find the first or last item in a table. For example, one may wish to know the position of the last item in a table, i.e. the last row, in order to construct a dynamic named range. The beauty of this method is that sort order isn’t a necessary requirement to yield the correct result.

Study the following table:

A B
1 Yellow 54
2 Green 3
3 Blue 8
4 Black
5 Purple
6 White

A quick glance and you will observe that the last row in column A that occupies a text value is 6. The last row in column A that occupies a numeric value is 3. We can calculate these same results using a binary search, specifically by using the MATCH() worksheet function, which tells us the index position of item in a given list.

This technique involves feeding the lookup an insanely huge target value, such that the half-interval search will always conclude that if the value is to be found, it is to be found in the latter (right-most) half. The process continues until, eventually, a single item remains, the last item in the list.

One must particular attention to the type of information held in the list. One must compare apples with apples, or rather numbers with numbers and text with text. This means that the hugely large target value must conform with the inspected data type.

We define the target values as follows:

Name Formula
xlBigNum =10^308
xlBigText =REPT(“Z”,255)

Going back to the table exhibit of colours and numbers, we can calculate the last rows as follows:

Worksheet Formulas

Formula Result
=MATCH(xlBigText,A:A,1) 6
=MATCH(xlBigNum,B:B,1) 3

This technique can be used for a range of other purposes. MATCH() is great because it also support binary search in descending ordered lists, by changing the final match type argument from 1 (in the previous exhibit) to -1. The image below demonstrates the thought process.

Visualisation of finding the last item in a list, regardless of sort order.

binary last item method

One uses the MATCH() binary method merely to identify the last row in a table. Once known, the row index number can be used as part of a range defining formula. For example, the following describes how to define a dynamic named range for the list of colours:

Worksheet Formulas

Name Formula
drng_Colours =$A$1:INDEX($A:$A,MATCH(xlBigText,$A:$A,1),1)

And the same reference can be used in OFFSET() based dynamic named ranges too.

References