The Art of Filtering Duplicates in Excel

Every day analysts export data from myriad sources in an effort to consolidate raw data into useable intelligence. In order to make good decisions it is important that the data be as accurate as possible and one aspect of accuracy is ensuring that our information contains no duplicate data. When working with only a couple dozen rows of data sussing out duplicates is easy but when you have thousands of rows of data visual inspection just doesn’t cut it and it becomes necessary to employ some automated techniques. In this blog post I’m going to detail some of the Excel tools and strategies available for handling duplicate data.

The Basic Example
The simplest example is the straight removal of duplicate rows. Below is a stripped down data set with 3 columns and 5 rows of data.

duplicates_0

Clearly the fourth and fifth rows are duplicates. To filter this data we’ll use Excel’s built in ‘Remove Duplicates’ tool that can be found on the ‘Data’ ribbon in the ‘Data Tools’ section (see screen grab below).

duplicates_1

To use the tool first highlight the data in spreadsheet and then click the ‘Remove Duplicates’ button and this will call up the ‘Remove Duplicates’ window. There are some options here but in the simple case, the default values are sufficient. Click the ‘OK’ button and Excel will delete the duplicate rows leaving only unique rows.

duplicates_2

Filtering on Columns
In the screen shot of the ‘Remove Duplicates’ window you can see that each of the data columns has a checkbox beside it. These checkboxes allow you to search for duplicates using only the checked columns as the matching criteria. This may seem a bit esoteric so an example will help demonstrate how useful filtering on certain columns is.

In the screen shot below is a date set that contains some occurrence numbers and addresses connected with the occurrences.

duplicates_3

Imagine that from this data set we want only the unique occurrence numbers and we don’t care about the street addresses. To get just the numbers again highlight all the columns and click the ‘Remove Duplicates’ button. But this time, when the ‘Remove Duplicates’ window comes up, unclick the ‘Address’ and ‘Type’ checkboxes and click ‘OK’.

duplicates_4

The result is that Excel will filter out three rows and leave two, one for each of the occurrence numbers in the data set.

duplicates_5

Filtering Sequenced Data
One question you may have about filtering by specific columns is how does Excel decide what rows to keep and what rows to delete? It’s actually pretty straightforward: Excel keeps the first row it finds and dumps the rest and we can exploit this behaviour to achieve a helpful result.

Have a look at the data set below that lists three occurrences and a series of disposition shifts for each occurrence.

duplicates_6

You’ll notice that over time the disposition of an occurrence shifts as new information comes in. For example, many occurrences start as a ‘Default’ but shift to Report To Follow (RTF) or Unfounded (UNF) as the incident evolves. This is important because when calculating crime stats it’s crucial to know the final disposition for an occurrence so that it can properly reported (especially unfoundeds).

To get the last recorded disposition we can first sort the data by ‘Occurrence’ ascending and then by ‘DateTime’ descending. If we then highlight the data and use ‘Remove Duplicates’ (with only the ‘Occurrence’ column checked) we can exploit Excel’s programming to remove all rows except the first one for each occurrence number. This, very conveniently, leaves us with only the final dispositions for each occurrence number thanks to our clever sorting of the data.

duplicates_7

Rolling Your Own
All of the above examples use Excel’s built-in ‘Remove Duplicates’ tool but it is possible to manually identify duplicates. Why would you want to do this? Primarily because Excel deletes duplicate rows and doesn’t given you the opportunity to review them. Let’s look at an example using the data from the Filtering Sequenced Data example.

The first thing to do is sort the data based on the duplicate column. While it is possible to use multiple columns (which require multiple sorts) this example will just filter on the ‘Occurrence’ column. Look in the screen shot below and you’ll notice that I added a new column and named it ‘Dupe’. Also notice that in the first cell under ‘Dupe’ (cell B2 in this example) I put a 0 while in the subsequent cells I used Excel’s IF function to check the current row’s occurrence number against the previous one. As can be seen in the function bar of the screen shot the formula in cell B3 is =IF(A3=A2, 1,0).

duplicates_8

What this function is doing is examining the occurrence number in cell A3 and determining whether it is the same as the occurrence number in cell A2. Recall that the IF function evaluates statements (in this case whether A3=A2) and if the statement is TRUE it returns the second input (1 in this case) and if it is FALSE it returns the third input (0 in this case). With the statement in place in B3 I then filled the function down to B6 and Excel intelligently populated each cell of ‘Dupe’ as in the screen shot above.

You’ll notice that the IF functions have accurately identified which rows are duplicates by putting a 1 in the ‘Dupe’ column. Just like with Excel’s ‘Remove Duplicate’ tool only subsequent records are flagged as duplicate. This means that if you sort the data by both ‘Occurrence’ and ‘DateTime’ you can obtain the same results as in the Filtering Sequenced Data section.

One thing to note though, if you re-sort your data the ‘Dupe’ column will break. This is because the IF function always checks the current row against the previous one and if the occurrence numbers are not in the proper order the results won’t make any sense. To lock the results in you should copy and paste as values the ‘Dupe’ column. This will replace the function calls with their respective 1s or 0s and now you can sort the data just like you otherwise would. I occasionally use this method to first review the duplicates and then sort ‘Dupe’ so that I can delete all the rows with 1. The end result is a list of unique rows.

Wrap Up
Filtering duplicates is a fact of analytical life and it’s important to be proficient at culling excess data so that stats aren’t double counted. I hope this post was helpful in identifying the various ways to get a handle on dupes.

Excel’s VLOOKUP Function for Crime and Intelligence Analysts

VLOOKUP is one of those functions that causes a lot of grief for analysts. I think it’s because the function is kind of complex to implement and most people don’t use it often enough to commit to memory exactly how it works. In this blog post I am going to explain exactly how to use VLOOKUP with the aid of two analyst-orientated examples.

An Introductory Example

At its core VLOOKUP (the V stands for vertical by the way) is used for looking up one value based on another … vertically. To demonstrate what I mean have a look at the screen shot below in which a list of Universal Crime Report (UCR) codes (column A) is listed side-by-side with the description of the crime (column B).

vlookup_1

Together column A and column B constitute a “look up table” because you can use a known value in column A to find the corresponding value in column B (and vice versa). But why do we need look up tables? We need them because it’s likely that when you export a list of crime stats from your database that you’ll see the codes rather than the descriptions. And while the UCR code is shorter and easier to manage there will eventually come a time when you want to replace the code “1110” with the friendlier “Murder 1st Degree” (maybe when you are preparing a final report to send around) and that’s when you use VLOOKUP to look up the description based on the code

The key thing to remember about VLOOKUP is that when you’re using it you always have two different sets of data: the first is the data you are working on and the second is the lookup table. The screen shot below shows a sheet with both sets of data; the working data is on the left and the lookup table is on the right in grey. Notice that the data being worked on has an occurrence number and a UCR code but no Crime.

vlookup_2But how to use it? Have a look at the above screenshot again, specifically cell C2, and note that the function in that cell is =VLOOKUP(B2, $D$1:$E$16,2,FALSE). Let’s break that function down into pieces.

  • For the first input to VLOOOKUP, we’re instructing Excel to use the value in cell B2 (the known UCR code) to find the description to put into cell C2.
  • For the second input we’re telling VLOOKUP to look for the value of B2 in the cell range $F$1:$G:$16 which is the entire range occupied by the lookup table. (As an aside, you’ll notice that the range values all have dollar signs in front of each letter and number. This is because when we fill this function down we don’t Excel to act ‘intelligently’ and start shifting the references. Using the dollar sign locks the reference so that no matter how we fill it, it will always point to the same set of cells.)
  • For the third input we use the value 2. Notice that our look up table has only two columns (D and E) in this example, one that is the code and one that is the crime. Since we already know the code (from column B) what we want is the description and since the description is in the second of the two columns in the lookup table we specify that we want the result to come from column 2.
  • The final input to the function is ‘FALSE’. By entering FALSE we are telling Excel that we want an exact match. That means that if Excel can’t find UCR code 1110 it’s going to return #N/A. There are situations where you might not want exact matches but they are less common and sticking with FALSE will usually get you what you want 90% of the time.

With the function input it’s just a matter of filling the function down to populate all of the cells in column C with the crime description based on the UCR code. If you click on any of the other cells in column C, as I did in the example below, you’ll see that the lookup table, locked as it is by the dollar sign $ notation, is always referencing the same range of cells. I bring this up again because the VLOOKUP mistake I see most often is that people don’t lock the range and get weird results when they fill the formula down. Make sure to use those dollar signs!

vlookup_3

A More Advanced Example

While the example above is the most common usage of VLOOKUP I frequently use it for another purpose: to compare two lists of things (e.g. occurrence numbers, suspect names) to determine if the lists are the same and highlight those elements that are different. The following screen shot demonstrates how I lay out my EXCEL file to perform this work.

vlookup_4

The two columns of occurrence numbers in column A and C are almost the same and in fact they only differ by 3 values. To identify those three values we can use VLOOKUP to look up one set of occurrence numbers against the other and see what happens. The above screen shot shows the function in cell B5 to be =VLOOKUP(A5, $C2:$C$20, 1, FALSE). Just like before we are checking each value against a table of values but this time it just so happens that the lookup table only has one column (instead of the two above) and that’s why we have a “1” as the third input.

If you fill the function down the occurrence numbers get filled in and where VLOOKUP can’t find a match for the cell in column A in column C Excel outputs #N/A. That last sentence may seem overly precise but it is meant to illustrate an important point: the VLOOKUP detailed above is only checking the values in A against the look up table and therefore will only catch those values in A that aren’t in C.  To get the full picture you also need to perform the reverse lookup, that is, catch the values that are in C but not in A. To do that you can put the following function call =VLOOKUP(C2, $A2:$A$20, 1, FALSE) in cell D2 and fill down. Now, just like before, we’ll see matching occurrence numbers if the values are in both columns and #N/As for those values that are in C but not in A. Taken together these two VLOOKUPs will help identify how the two lists differ.

Getting Even More Advanced

One more thing, I really don’t like Excel’s blaring #N/A error value. It’s ugly. Also, I don’t know about you, but I don’t really care about when the occurrence numbers match, I’m more interested when they don’t match and I’d prefer not to see anything for the positive matches. Let’s see what we can do about fixing this up.

To accomplish this, we’ll need to use the VLOOKUP function in conjunction with another Excel function, ISERROR, that we haven’t talked about. I’m not going to pretend this isn’t getting complicated, because it is, but I’m doing it anyway. In cell B2 I replace the old VLOOKUP function call with this new one:

=IF(ISERROR(VLOOKUP(A2, $C$2:$C$20,1,FALSE)), “Diff”, “”)

I know that looks intimidating but let’s break it down. The part about VLOOKUP we already know, it’s the same as above, but what we’re doing now is feeding the output of VLOOKUP to the function ISERROR. ISERROR’s only purpose is to return a TRUE or FALSE based on whether an error occurs within the VLOOKUP and, as we know, 3 errors occur, one for the each of those #N/As. For each of those three #N/As the ISERROR function will return TRUE and for all the rest it will return FALSE. This behaviour is very handy because the IF function checks for TRUE or FALSE when trying to figure out what to do. In this case, if it gets a TRUE from ISERROR (that is, it’s true that there was an error) then it prints the word “DIFF” but if it gets a FALSE, it doesn’t print anything because there was no error.

I know that’s a brainful but have a look at the screen shot below and you can see that each of the cells that was previously an #N/A is now a “DIFF”, because those are the ones that caused a VLOOKUP error that in turn caused ISERROR to be TRUE which caused the IF function to print the value for TRUE. All the rest, the ones without VLOOKUP errors, display nothing, just as I wanted.

vlookup_5

Phew, that’s a lot of Excel-fu for a post that started up just about VLOOKUP. I hope the post (or at least the first two parts) give you the confidence to use VLOOKUP more often in your daily work. And I hope that the third part, as intimidating as it looks (but it’s really not, just follow the logic of the TRUEs and FALSEs)  will give you ideas about how you can use the function to accomplish more complex tasks. Good luck!

Understanding the Difference Between Mean and Median

As analysts we are often interested in averages. But an average is a tricky thing because the word average actually covers several related concepts. In this blog post I am going to discuss two different concepts of average – the Mean and the Median – and why it’s crucial to understand their differences.

The Mean is what we typically mean when we use the term average. The Mean is calculated by adding everything up and dividing by the number of items. The resulting value is supposed to represent the centre of all the values. People intuitively understand the Mean because a lifetime of being exposed to the concept of the bell curve (or the normal distribution as it is called in math circles) has taught them that the big peak in the middle of the curve is the average.

The Median is less well known but just as easy to conceptualize. Consider all of the items in your collection and sort them from smallest to largest. The Median is then the value of the item that sits in the middle of the sorted list. In other words, it’s the value of the item that splits the collection in half with an equal number of items above and below it.

When dealing with data that follows the bell curve, that is data that produces a symmetric and unimodal (single peak) distribution, the Mean and the Median have the same value and we don’t have any problems. The issue is that a lot of data we’re likely to encounter in our day-to-day work— such as call response times, time spent on patrol, man hours spent on an occurrence—are not accurately represented by a symmetric bell curve but instead follow a skewed distribution that is weighted more to one side than the other.

But what does this mean for averages? It means that for data that is positively skewed (the bulk of the data is on the left side of the chart) that the Mean is going to be larger than the Median and vice versa for negatively skewed data. The problem arises when the difference between the Mean and Median gets large enough to meaningfully change the stat that you are reporting.

For example, the number of man-hours spent on an occurrence is the kind of data that is going to be positively skewed. This is because of the nature of the data: 0 is the lowest value that can exist but the upper limit is not bounded. If you create a histogram of the data you’re likely to find a whole lot of occurrences with a small to medium amount of time being spent (the peak on the left) but also a few occurrences with a lot of time being spent (the long tail on the right). I’ve created the following simplified data set below to illustrate the scenario.

You can see in the screen shot that the occurrence that took 24 hours is clearly an extreme example but it has a significant impact on the Mean. If a Commander came to you and asked for the average number of man-hours spent on occurrences do you feel that the Mean of 8.8 would be representative, considering that only one occurrence actually took longer than that? Clearly the Median value of 5 hours is more representative of man-hours and is a more accurate average.

This result comes about because the Mean is sensitive to the inclusion of extreme values because of the way it is calculated while the Median, which is relatively indifferent to the inclusion of extreme values because it only concerns itself with the middle value for a data set, is more likely to provide an appropriate number that is stable in the face of outliers.

The screen shot also conveniently illustrates the function calls used to calculate both the Mean and Median in Excel. Frustratingly, Microsoft has opted to use the function name AVERAGE for the Mean further erroneously cementing them as synonyms. Thankfully the Median is a straightforward call to the MEDIAN function.

You’ll notice a third function in the list: SKEW. I discuss above about the positive and negative skewness of data and the SKEW function is useful for coming to grips with the direction and magnitude of a data set’s skewness. In the occurrence man-hours example I state that the data is positively skewed and that’s reflected by the positive (greater than 0) value returned by the SKEW function and that indicates that the Mean likely overestimates the Median. Contrarily, if SKEW returns a value that is less than 0 you’ll know that the data is negatively skewed and that the Mean likely underestimates the Median. Finally, if the value is close to 0 your data set likely follows the bell curve or normal distribution and the Mean and Median will be nearly the same.

The key take away from this post is that it is important to not just blindly take the average (by which I mean the Mean) of a stat without first considering how the data is distributed. So much law enforcement data is skewed one way or the other that it makes sense to take a minute and run the MEDIAN and SKEW functions in Excel to make sure you’re reporting the most appropriate number to your superiors.