Two VLOOKUPS Are Better Than One

You’ll Never Look At Excel The Same Once You See This Tip!

Hi there! Ever have to use VLOOKUP to match data on one spreadsheet to another? If you have more than, 50,000 records in either spreadsheet, it’s probably going to be slow. Painfully slow if you have 200,000 or more records! What if I told you it doesn’t have to be slow? That even 1,000,000 records can be looked up in under 5 seconds?

Yes, it’s possible. Let me go into the how before I go into the why.

Why two VLOOKUPs are better than one.

Let’s set up some data comparing the hypothetical salaries of the principal actors in the two Guardians of the Galaxy movies. Let’s use the following data, and we are going to place the VLOOKUP function into the GOTG1 tab, looking up SALARY by ACTOR from the GOTG2 tab. This is, of course, just to illustrate the idea.

ACTOR CHARACTER SALARY
Chris Pratt Peter Quill 1,500,000
Zoe Saldana Gamora 100,000
Dave Bautista Drax 200,000
Vin Diesel Groot (voice) 150,000
Bradley Cooper Rocket (voice) 1,500,000
Lee Pace Ronan 175,000
Michael Rooker Yondu Udonta 250,000
Karen Gillan Nebula 140,000
Djimon Hounsou Korath 75,000
John C. Reilly Corpsman Dey 250,000

GOTG1

ACTOR CHARACTER SALARY
Bradley Cooper Rocket (voice) 2,500,000
Chris Pratt Peter Quill / Star-Lord 12,000,000
Chris Sullivan Taserface 150,000
Dave Bautista Drax 2,000,000
Elizabeth Debicki Ayesha 100,000
Karen Gillan Nebula 2,000,000
Kurt Russell Ego 1,500,000
Michael Rooker Yondu 1,700,000
Pom Klementieff Mantis 150,000
Sylvester Stallone Stakar Ogord 500,000
Vin Diesel Baby Groot (voice) 250,000
Zoe Saldana Gamora 2,000,000

GOTG2

First thing we need to do is order the GOTG2 tab by the lookup value, which is ACTOR in this case. I recommend doing this by creating a filter on the data and using the filter to do the sort – Sort A to Z or Smallest to Largest. The sort can take some time depending on  To save space, the GOTG2 table above is pre-sorted.

Now to add the special sauce! Normally we would do something like this:

  • =vlookup( A2, GOTG2!A:C,3, FALSE )

The FALSE argument tells it to return only exact matches. Our vlookups are going to use the TRUE argument to return approximate matches. But don’t worry, the magic formula will ensure that the match is exact before returning any results:

  • =IF( VLOOKUP( A2, GOTG2!A:C,1,TRUE )=A2,VLOOKUP( A2, GOTG2!A:C,3,TRUE ), ERROR.TYPE(7))

This formula is looking up the key (ACTOR) from the GOTG2 tab and comparing it against the ACTOR on the GOTG1 tab. If it matches, it uses a similar VLOOKUP to return the SALARY, otherwise it returns the “#N/A”, which is error type 7. This imitates what a VLOOKUP returns when an exact match is requested but cannot be found. The reason we have to compare the results of the first VLOOKUP against our key (AKA lookup_value) is that in approximate match mode, vlookup will return the closest match in case an exact match could not be found. “What a lovely and convenient behaviour” – said no Excel user, ever.

We can compare the regular VLOOKUP with the MAGIC version to see they return the same results:

ACTOR CHARACTER SALARY VLOOKUP MAGIC
Chris Pratt Peter Quill 1,500,000 12,000,000 12000000
Zoe Saldana Gamora 100,000 2,000,000 2000000
Dave Bautista Drax 200,000 2,000,000 2000000
Vin Diesel Groot (voice) 150,000 250,000 250000
Bradley Cooper Rocket (voice) 1,500,000 2,500,000 2500000
Lee Pace Ronan 175,000 #N/A #N/A
Michael Rooker Yondu Udonta 250,000 1,700,000 1700000
Karen Gillan Nebula 140,000 2,000,000 2000000
Djimon Hounsou Korath 75,000 #N/A #N/A
John C. Reilly Corpsman Dey 250,000 #N/A #N/A

 

Now you’re completely convinced and want to understand what makes this version so much faster! And why did we need to sort that data? The answer lies embedded in the tomes of Knuth’s “The Art of Computer Programming”. But I will summarize it for you.

Internals

When you are looking for a word in the dictionary, let’s pick “finance”, you’ll probably intuitively open the dictionary a third of the way in. From there, based on the words, you will skip a chunk of pages at a time based on where you landed in the dictionary and how “far” it is from the word you’re looking up. My point is, you didn’t have to open the dictionary on the first page and go through page by page, word by word, searching for “finance”. Now imagine if the dictionary wasn’t sorted. Then you would literally have to read every word on every page in order to find “finance”. This is called a linear search, and it’s what you’re telling Excel to do when the last argument of VLOOKUP is FALSE.

Now you’ve probably realized why we sorted the data we’re searching through. It allows Excel to use a faster search algorithm. One of the most common is called Binary Search, which is basically a divide and conquer algorithm, and it works as follows:

Let’s take our example from above and find “finance” out of all the words of the dictionary of 160,000 words. The algorithm looks at the middle word in the list, let’s say it coincidentally turns out to be “middle”, and compare it against “finance”. “Finance” comes before “middle”, so the algorithm discards the latter half of the list since “finance” is obviously in that half. That’s 80,000 words that don’t need to be searched. It then does the same with the first half, each time cutting the remaining list in half. To find a word in a list of 160,000 words requires only 18 of these operations. [Note: this is the Log, base 2, of 160000, which you can check in Excel using function =Log(160000,2)]

Using the approximate match version of VLOOKUP tells Excel to use a binary search algorithm to search through the data!

Let’s do some math now, comparing the complexity (number of computations) with 100k rows looking up against 160k words:

  • Exact match (linear search/FALSE argument): 100,000 rows * 160,000 words = 16,000,000,000, or 16 billion comparisons
  • Approximate match (binary search/TRUE argument): 100,000 rows * 18 comparisons = 1,800,000 = 1.8 million comparisons

1000x faster VLOOKUPS! If you’re not excited by now, you either don’t ever do this type of work, or you don’t understand how blazingly fast this is.

I encourage any of you who have to wait for more than 10 seconds on VLOOKUP calculations to try this tip out. If you need help, come and see me or Roger and we’ll be happy to help.

Des Lee – Systems & Processes

Leave a comment