Excel’s VLOOKUP Function for Crime and Intelligence Analysts

VLOOKUP is one of those functions that causes a lot of grief for analysts. I think it’s because the function is kind of complex to implement and most people don’t use it often enough to commit to memory exactly how it works. In this blog post I am going to explain exactly how to use VLOOKUP with the aid of two analyst-orientated examples.

An Introductory Example

At its core VLOOKUP (the V stands for vertical by the way) is used for looking up one value based on another … vertically. To demonstrate what I mean have a look at the screen shot below in which a list of Universal Crime Report (UCR) codes (column A) is listed side-by-side with the description of the crime (column B).

vlookup_1

Together column A and column B constitute a “look up table” because you can use a known value in column A to find the corresponding value in column B (and vice versa). But why do we need look up tables? We need them because it’s likely that when you export a list of crime stats from your database that you’ll see the codes rather than the descriptions. And while the UCR code is shorter and easier to manage there will eventually come a time when you want to replace the code “1110” with the friendlier “Murder 1st Degree” (maybe when you are preparing a final report to send around) and that’s when you use VLOOKUP to look up the description based on the code

The key thing to remember about VLOOKUP is that when you’re using it you always have two different sets of data: the first is the data you are working on and the second is the lookup table. The screen shot below shows a sheet with both sets of data; the working data is on the left and the lookup table is on the right in grey. Notice that the data being worked on has an occurrence number and a UCR code but no Crime.

vlookup_2But how to use it? Have a look at the above screenshot again, specifically cell C2, and note that the function in that cell is =VLOOKUP(B2, $D$1:$E$16,2,FALSE). Let’s break that function down into pieces.

  • For the first input to VLOOOKUP, we’re instructing Excel to use the value in cell B2 (the known UCR code) to find the description to put into cell C2.
  • For the second input we’re telling VLOOKUP to look for the value of B2 in the cell range $F$1:$G:$16 which is the entire range occupied by the lookup table. (As an aside, you’ll notice that the range values all have dollar signs in front of each letter and number. This is because when we fill this function down we don’t Excel to act ‘intelligently’ and start shifting the references. Using the dollar sign locks the reference so that no matter how we fill it, it will always point to the same set of cells.)
  • For the third input we use the value 2. Notice that our look up table has only two columns (D and E) in this example, one that is the code and one that is the crime. Since we already know the code (from column B) what we want is the description and since the description is in the second of the two columns in the lookup table we specify that we want the result to come from column 2.
  • The final input to the function is ‘FALSE’. By entering FALSE we are telling Excel that we want an exact match. That means that if Excel can’t find UCR code 1110 it’s going to return #N/A. There are situations where you might not want exact matches but they are less common and sticking with FALSE will usually get you what you want 90% of the time.

With the function input it’s just a matter of filling the function down to populate all of the cells in column C with the crime description based on the UCR code. If you click on any of the other cells in column C, as I did in the example below, you’ll see that the lookup table, locked as it is by the dollar sign $ notation, is always referencing the same range of cells. I bring this up again because the VLOOKUP mistake I see most often is that people don’t lock the range and get weird results when they fill the formula down. Make sure to use those dollar signs!

vlookup_3

A More Advanced Example

While the example above is the most common usage of VLOOKUP I frequently use it for another purpose: to compare two lists of things (e.g. occurrence numbers, suspect names) to determine if the lists are the same and highlight those elements that are different. The following screen shot demonstrates how I lay out my EXCEL file to perform this work.

vlookup_4

The two columns of occurrence numbers in column A and C are almost the same and in fact they only differ by 3 values. To identify those three values we can use VLOOKUP to look up one set of occurrence numbers against the other and see what happens. The above screen shot shows the function in cell B5 to be =VLOOKUP(A5, $C2:$C$20, 1, FALSE). Just like before we are checking each value against a table of values but this time it just so happens that the lookup table only has one column (instead of the two above) and that’s why we have a “1” as the third input.

If you fill the function down the occurrence numbers get filled in and where VLOOKUP can’t find a match for the cell in column A in column C Excel outputs #N/A. That last sentence may seem overly precise but it is meant to illustrate an important point: the VLOOKUP detailed above is only checking the values in A against the look up table and therefore will only catch those values in A that aren’t in C.  To get the full picture you also need to perform the reverse lookup, that is, catch the values that are in C but not in A. To do that you can put the following function call =VLOOKUP(C2, $A2:$A$20, 1, FALSE) in cell D2 and fill down. Now, just like before, we’ll see matching occurrence numbers if the values are in both columns and #N/As for those values that are in C but not in A. Taken together these two VLOOKUPs will help identify how the two lists differ.

Getting Even More Advanced

One more thing, I really don’t like Excel’s blaring #N/A error value. It’s ugly. Also, I don’t know about you, but I don’t really care about when the occurrence numbers match, I’m more interested when they don’t match and I’d prefer not to see anything for the positive matches. Let’s see what we can do about fixing this up.

To accomplish this, we’ll need to use the VLOOKUP function in conjunction with another Excel function, ISERROR, that we haven’t talked about. I’m not going to pretend this isn’t getting complicated, because it is, but I’m doing it anyway. In cell B2 I replace the old VLOOKUP function call with this new one:

=IF(ISERROR(VLOOKUP(A2, $C$2:$C$20,1,FALSE)), “Diff”, “”)

I know that looks intimidating but let’s break it down. The part about VLOOKUP we already know, it’s the same as above, but what we’re doing now is feeding the output of VLOOKUP to the function ISERROR. ISERROR’s only purpose is to return a TRUE or FALSE based on whether an error occurs within the VLOOKUP and, as we know, 3 errors occur, one for the each of those #N/As. For each of those three #N/As the ISERROR function will return TRUE and for all the rest it will return FALSE. This behaviour is very handy because the IF function checks for TRUE or FALSE when trying to figure out what to do. In this case, if it gets a TRUE from ISERROR (that is, it’s true that there was an error) then it prints the word “DIFF” but if it gets a FALSE, it doesn’t print anything because there was no error.

I know that’s a brainful but have a look at the screen shot below and you can see that each of the cells that was previously an #N/A is now a “DIFF”, because those are the ones that caused a VLOOKUP error that in turn caused ISERROR to be TRUE which caused the IF function to print the value for TRUE. All the rest, the ones without VLOOKUP errors, display nothing, just as I wanted.

vlookup_5

Phew, that’s a lot of Excel-fu for a post that started up just about VLOOKUP. I hope the post (or at least the first two parts) give you the confidence to use VLOOKUP more often in your daily work. And I hope that the third part, as intimidating as it looks (but it’s really not, just follow the logic of the TRUEs and FALSEs)  will give you ideas about how you can use the function to accomplish more complex tasks. Good luck!

Understanding the Difference Between Mean and Median

As analysts we are often interested in averages. But an average is a tricky thing because the word average actually covers several related concepts. In this blog post I am going to discuss two different concepts of average – the Mean and the Median – and why it’s crucial to understand their differences.

The Mean is what we typically mean when we use the term average. The Mean is calculated by adding everything up and dividing by the number of items. The resulting value is supposed to represent the centre of all the values. People intuitively understand the Mean because a lifetime of being exposed to the concept of the bell curve (or the normal distribution as it is called in math circles) has taught them that the big peak in the middle of the curve is the average.

The Median is less well known but just as easy to conceptualize. Consider all of the items in your collection and sort them from smallest to largest. The Median is then the value of the item that sits in the middle of the sorted list. In other words, it’s the value of the item that splits the collection in half with an equal number of items above and below it.

When dealing with data that follows the bell curve, that is data that produces a symmetric and unimodal (single peak) distribution, the Mean and the Median have the same value and we don’t have any problems. The issue is that a lot of data we’re likely to encounter in our day-to-day work— such as call response times, time spent on patrol, man hours spent on an occurrence—are not accurately represented by a symmetric bell curve but instead follow a skewed distribution that is weighted more to one side than the other.

But what does this mean for averages? It means that for data that is positively skewed (the bulk of the data is on the left side of the chart) that the Mean is going to be larger than the Median and vice versa for negatively skewed data. The problem arises when the difference between the Mean and Median gets large enough to meaningfully change the stat that you are reporting.

For example, the number of man-hours spent on an occurrence is the kind of data that is going to be positively skewed. This is because of the nature of the data: 0 is the lowest value that can exist but the upper limit is not bounded. If you create a histogram of the data you’re likely to find a whole lot of occurrences with a small to medium amount of time being spent (the peak on the left) but also a few occurrences with a lot of time being spent (the long tail on the right). I’ve created the following simplified data set below to illustrate the scenario.

You can see in the screen shot that the occurrence that took 24 hours is clearly an extreme example but it has a significant impact on the Mean. If a Commander came to you and asked for the average number of man-hours spent on occurrences do you feel that the Mean of 8.8 would be representative, considering that only one occurrence actually took longer than that? Clearly the Median value of 5 hours is more representative of man-hours and is a more accurate average.

This result comes about because the Mean is sensitive to the inclusion of extreme values because of the way it is calculated while the Median, which is relatively indifferent to the inclusion of extreme values because it only concerns itself with the middle value for a data set, is more likely to provide an appropriate number that is stable in the face of outliers.

The screen shot also conveniently illustrates the function calls used to calculate both the Mean and Median in Excel. Frustratingly, Microsoft has opted to use the function name AVERAGE for the Mean further erroneously cementing them as synonyms. Thankfully the Median is a straightforward call to the MEDIAN function.

You’ll notice a third function in the list: SKEW. I discuss above about the positive and negative skewness of data and the SKEW function is useful for coming to grips with the direction and magnitude of a data set’s skewness. In the occurrence man-hours example I state that the data is positively skewed and that’s reflected by the positive (greater than 0) value returned by the SKEW function and that indicates that the Mean likely overestimates the Median. Contrarily, if SKEW returns a value that is less than 0 you’ll know that the data is negatively skewed and that the Mean likely underestimates the Median. Finally, if the value is close to 0 your data set likely follows the bell curve or normal distribution and the Mean and Median will be nearly the same.

The key take away from this post is that it is important to not just blindly take the average (by which I mean the Mean) of a stat without first considering how the data is distributed. So much law enforcement data is skewed one way or the other that it makes sense to take a minute and run the MEDIAN and SKEW functions in Excel to make sure you’re reporting the most appropriate number to your superiors.

IALEIA SWOC November Excel Workshop

I just got back from the November Excel Workshop put on by the South Western Ontario Chapter of the International Association of Law Enforcement Intelligence Analysts. I attended both to present and to learn and I had a great time doing both. I want to give a heartfelt thank you to Mick, Manny and Hazel for inviting me to present on Excel topics for law enforcement analysis and to all the analysts who took time out from their schedules to come and make the profession stronger. I hope we can do it again in the future.

At the workshop I delivered two presentations and I produced a PDF document of the material for each. The first, 30 Excel Functions All Analysts Should Know, is an introductory level document detailing useful Excel functions with examples and screen shots. The second PDF is titled Advanced_Functions_Add_Ins_and_VBA and is a mix of intermediate to advanced level Excel topics that specifically details how to use Excel’s FREQUENCY function, Microsoft’s Analysis ToolPak for t-Tests and how to create a simple User Defined Function in Excel. In the coming weeks I will expand on the exercises in the PDFs as blog posts and also blog about those sections that didn’t make the cut for the workshop.

Both lessons also had some supporting Excel files for performing the exercises: 30_Functions_Exercises and Advanced_Functions. Please note, all this data is fictional and is in no way sensitive. In the Advanced Functions workshop I make use of a Add-In named PatternUI.xlam and it can be downloaded from here: http://blogs.office.com/b/microsoft-excel/archive/2007/11/16/chart-pattern-fills.aspx

Once again, thanks to everyone that came out today for making the workshop a success.

Cumulative Distribution Function Charts in Excel

Most analysts know what a histogram is: it’s a chart where each piece of data is put into one of several bins and then the bin counts are plotted in order to gain insight into how the data is distributed. Histograms are useful as they have an intuitive appeal because it’s easy to see how the data is distributed just by looking at the chart. Of course, they aren’t perfect. The whole binning process is pretty subjective and depending on how the bins are selected the results can be different, which is deceptive. If only there was an alternative that could address these shortcomings…

Well, there is: histograms have an underutilized sibling known as the Cumulative Distribution Function (CDF) chart which compliments and, I think, improves on the histogram by doing away with some of the histogram’s problems and offering some useful innovations of its own. Namely, the ability to read off the chart what percentage of the data falls between two points, something that is not possible with a histogram. In this post I will discuss why I think you should make the CDF part of your analytical toolbox and also demonstrate how easy it is to create a CDF in Excel.

Convince me

First, two example charts that describe my dataset: a completely made up list of 30,000 calls-for-service and how long they last from initial call to the closing off of the incident. The first chart is a standard histogram and the second is the corresponding CDF. For both charts the horizontal axis is the same: it’s hours. For the histogram the vertical axis is a count of how many calls are in each bin while the vertical axis of the CDF is a measure of percentage (why percentage will become clear in a bit).

cdf_1

So the histogram is the same old boring story. We have a normal distribution that is skewed to the left (fatter on the left than the right). Looks like most calls take about five hours and only a few go above 12 hours. Yawn.

cdf_2

Ok, the CDF looks great. But what’s it good for? What can it tell us that the histogram can’t? Remember that I mentioned you can use a CDF to determine what percentage of the data falls between two points? Let’s explore that now.

Suppose I want to know how many calls take less than 5 hours (that is, between 0 and 5 hours). I can read that right off the graph by going across the horizontal axis until I reach 5 and then going up until I hit the trend line. I then read across to the vertical axis and see that the value is about 60%. What this means is that 60% of all calls take 5 hours or less. That’s a much firmer number than ‘the most calls are around 5 hours’. Similarly, what if the question was how many calls take between 5 and 10 hours. Well, first read off the 10 hour value, which is about 99% and we know the 5 hour value is 60% so just take the difference 99-60 to arrive at 33% of all calls take between 5 and 10 hours. Let’s see you do that with a histogram (hint: you can’t do it).

I’m convinced. How do I make one in Excel?

Creating a CDF is very straightforward and the following five steps describe the entire process. I’ve included a quick Excel screen grab to illustrate what the first few rows of the data will look like when you’re done (it’ll make more sense when you’re done the steps).

cdf_3

Step One: Start with your data in Column A in Excel sorted from smallest value to largest. My data had a lot of zero values, so that is why the column starts off with a series of zeros.

Step Two: In Column C (yes I intentionally skipped B) put the number 1 in C1, 2 in C2, 3 in C3, etc.  Here’s a useful hint to speed this up: use =C1+1 for C2 and fill down.

Step Three: In B1 enter how many data points you have. I had 30,000 so I dropped that number in there.

Step Four: For B2 use the formula =C2/$B$1 and fill down. The idea is that you want to divide the value in Column C by the total count. When you get to the bottom of your data the last value should be 1.

Step Five: Now that the data has been prepped highlight all of columns A and B and insert a new Scatter chart. What you’ll get is the monotonically increasing (i.e. always going up) trend line. You can edit the styling so that the graph is a line rather than points and make whatever other changes make it look it pretty. I recommend fixing the vertical axis at 1 (since the data can never go above 100%) and the minimum horizontal axis at 0, if that’s appropriate. Voila, now you have a CDF chart.

But how do CDFs address the histogram’s shortcomings? Did you notice any bins during the calculation? Nope. CDFs use the value of every data point and therefore do away with binning and the loss of information binning introduces. And because there are no bins the shape of the CDF can’t change like it can in a histogram so that source of subjectivity is neatly side-stepped. And you want to know what else is great? It’s easy to compare two data sets with CDFs by simply plotting them on the same chart. Comparing two histograms on the same plot is not particularly enlightening besides noting how the peaks compare.

In conclusion, CDFs are like grilled cheese: great and easy. And like grilled cheese everyone should try one at lunch (or whenever you next do a frequency analysis). The CDF is a great compliment to traditional histogram analysis and will hopefully become a valuable tool in your graphic analysis toolbox.

An olive branch to histograms

Making a histogram is actually more work than creating a CDF. But if you really want to create one you can either use Microsoft’s Analysis ToolPak (what happened to the ‘c’ MS?) or perform some more elaborate Excel JuJu. Good luck.

The Poor Man’s Proximity Tool: The Haversine Formula

The haversine formula is a method for finding the great-circle distance between a pair of latitude-longitude coordinates. In crime analysis I use the formula for finding incidents that have taken place within a particular distance, for example within 500 meters, of a location.

My approach uses Excel to calculate the haversine distance between a reference location and the latitude-longitude of every other location in my spreadsheet. I then sort the results to find the locations that are within the distance that I’m interested in. I’ve worked the following example in Excel to illustrate the method.

Imagine that you have a list of 1000 occurrences with the latitudes and longitudes of each of the occurrence locations. Now also imagine that you have a reference location and that you want to determine how many of the 1000 incidents occurred within 500 meters of the reference location. I set my spreadsheet up like this:

haversine_1

Note that in column B I have entered all of the latitudes and in column C I have entered all the longitudes. I have also entered my reference latitude and longitude in cells E2 and E3 so that they can be referenced by the haversine formula. Also note that both latitudes and longitudes are decimal degrees as opposed to degree minutes seconds.

In column D I enter the Excel-ified version of the haversine formula which I have reproduced below. It gets a bit complex because Excel likes to deal with radians as opposed to degrees when using the COS and SIN functions so it is necessary to use Excel’s RADIANS function to convert the latitudes and longitudes. The final multiplication by 6371 represents the Earth’s radius in kilometres.

haversine_2

A note for pedants: I realize that the Earth is not a perfect sphere and that means that the calculated distances will be somewhat off if you are dealing with large distances. But, for the distances we are concerned with–distances of 10s of kilometres–the impact is negligible and can be ignored.

By filling column D with the expression I calculate the separation distance between the reference point and each incident. Now, if the distance column is sorted from smallest to largest I can easily see the incidents that occurred within 500 meters (or 0.5 kilometres) of the reference location.

haversine_3

I find this method useful for determining the proximity of all sorts of things. For example, if the reference coordinates are for an intersection the method can be used to find all motor vehicle collisions that occurred within 100 meters. Or say there was a break and enter, a proximity search can be done against field contacts that occurred within 1 kilometre of the occurrence address. Really anything that has a recorded latitude and longitude can be used. Best of all, it doesn’t require a GIS, just Excel and the formula.

 

Creating Heat Maps from Excel Pivot Tables with Conditional Formatting

I frequently use Excel’s Pivot Table functionality to examine the relationships between variables. While it is a common tool in the analyst’s toolbox I don’t like the default visualization options. This blog post will discuss an approach for visualizing pivot table data using Excel’s built-in conditional formatting functionality to create heat maps.

A Heat Map is a chart that uses colour to visualize a two-dimensional matrix of values. Since pivot tables make the creation of two-dimensional output so easy the heat map is the perfect tool for visualizing a pivot table. The process is straightforward and I demonstrate it with an example.

I start with some fictitious data that has day-of-week and hour-of-day properties. Using the standard approach I create a pivot table that has day-of-week as the columns and the hour-of-day as the rows.

sample_pivot_table1

Next, I highlight the pivot table data and copy it to a new worksheet (this is for future formatting). I then highlight the entire data range and select the Conditional Formatting button in the ‘Home’ ribbon. Under Conditional Formatting I select Color Scales and one of the colour ramps that are available.

sample_conditional_formatting

This immediately colour codes the entire range and creates an attractive heat map that will dynamically adjust its formatting if the cell values change. With a small amount of additional formatting (some bolding, a border, centre-aligned text) a very nice chart can be produced that can be added into a report.

sample_heat_map1Contrast the heat map with the following chart that plots the same data using multiple line charts. I find the heat map much more compact, readable and intuitive.

sample_line_chart

Conditional Formatting offers a quick way to produce heat maps for the kind of two-dimensional datasets analysts produce everyday. I have found that heat maps are an effective way to visualize a large volume of data in a manner that is easily understood by non-analysts. Best of all, if you have Excel 2007 or higher the tools are already installed on your computer.

Bonus Tip
You may come across the situation where you want to keep the underlying heat map cell colours but delete the numbers in the cells. Unfortunately, deleting the numbers also deletes the colour since this is how conditional formatting works. Frustratingly, copying and pasting in Excel also doesn’t work, not even with ‘Paste Formatting’, as it’s not possible to maintain the format without the numbers. Despite these headaches, I did find a trick for keeping the colours without the numbers.

First, highlight the whole heat map and copy it. Now open Microsoft Word and copy the heat map into a Word document.

Second, select the heat map in Word and copy it. Now open up a new Excel document and paste into the first cell. The heat map should be pasted in its entirety back into Excel and if you delete the numbers the colours should still be there.

Update: Reader Frank Fery writes that it is possible to remove the numbers from a heat map without resorting to the copy-to-Word trick. He provided a link to superuser.com that states that if you change the custom format of the cell to ;;; it will hide the numbers.  I still prefer the Word way though because it allows me to actually lay out different numbers on top of the heat map. Why would you want this? As an example, I once created a heat map where the surface represented calls-for-service volume but the numbers on the chart were numbers of officers working. Thanks for the feedback Frank.