Every day analysts export data from myriad sources in an effort to consolidate raw data into useable intelligence. In order to make good decisions it is important that the data be as accurate as possible and one aspect of accuracy is ensuring that our information contains no duplicate data. When working with only a couple dozen rows of data sussing out duplicates is easy but when you have thousands of rows of data visual inspection just doesn’t cut it and it becomes necessary to employ some automated techniques. In this blog post I’m going to detail some of the Excel tools and strategies available for handling duplicate data.
The Basic Example
The simplest example is the straight removal of duplicate rows. Below is a stripped down data set with 3 columns and 5 rows of data.
Clearly the fourth and fifth rows are duplicates. To filter this data we’ll use Excel’s built in ‘Remove Duplicates’ tool that can be found on the ‘Data’ ribbon in the ‘Data Tools’ section (see screen grab below).
To use the tool first highlight the data in spreadsheet and then click the ‘Remove Duplicates’ button and this will call up the ‘Remove Duplicates’ window. There are some options here but in the simple case, the default values are sufficient. Click the ‘OK’ button and Excel will delete the duplicate rows leaving only unique rows.
Filtering on Columns
In the screen shot of the ‘Remove Duplicates’ window you can see that each of the data columns has a checkbox beside it. These checkboxes allow you to search for duplicates using only the checked columns as the matching criteria. This may seem a bit esoteric so an example will help demonstrate how useful filtering on certain columns is.
In the screen shot below is a date set that contains some occurrence numbers and addresses connected with the occurrences.
Imagine that from this data set we want only the unique occurrence numbers and we don’t care about the street addresses. To get just the numbers again highlight all the columns and click the ‘Remove Duplicates’ button. But this time, when the ‘Remove Duplicates’ window comes up, unclick the ‘Address’ and ‘Type’ checkboxes and click ‘OK’.
The result is that Excel will filter out three rows and leave two, one for each of the occurrence numbers in the data set.
Filtering Sequenced Data
One question you may have about filtering by specific columns is how does Excel decide what rows to keep and what rows to delete? It’s actually pretty straightforward: Excel keeps the first row it finds and dumps the rest and we can exploit this behaviour to achieve a helpful result.
Have a look at the data set below that lists three occurrences and a series of disposition shifts for each occurrence.
You’ll notice that over time the disposition of an occurrence shifts as new information comes in. For example, many occurrences start as a ‘Default’ but shift to Report To Follow (RTF) or Unfounded (UNF) as the incident evolves. This is important because when calculating crime stats it’s crucial to know the final disposition for an occurrence so that it can properly reported (especially unfoundeds).
To get the last recorded disposition we can first sort the data by ‘Occurrence’ ascending and then by ‘DateTime’ descending. If we then highlight the data and use ‘Remove Duplicates’ (with only the ‘Occurrence’ column checked) we can exploit Excel’s programming to remove all rows except the first one for each occurrence number. This, very conveniently, leaves us with only the final dispositions for each occurrence number thanks to our clever sorting of the data.
Rolling Your Own
All of the above examples use Excel’s built-in ‘Remove Duplicates’ tool but it is possible to manually identify duplicates. Why would you want to do this? Primarily because Excel deletes duplicate rows and doesn’t given you the opportunity to review them. Let’s look at an example using the data from the Filtering Sequenced Data example.
The first thing to do is sort the data based on the duplicate column. While it is possible to use multiple columns (which require multiple sorts) this example will just filter on the ‘Occurrence’ column. Look in the screen shot below and you’ll notice that I added a new column and named it ‘Dupe’. Also notice that in the first cell under ‘Dupe’ (cell B2 in this example) I put a 0 while in the subsequent cells I used Excel’s IF function to check the current row’s occurrence number against the previous one. As can be seen in the function bar of the screen shot the formula in cell B3 is =IF(A3=A2, 1,0).
What this function is doing is examining the occurrence number in cell A3 and determining whether it is the same as the occurrence number in cell A2. Recall that the IF function evaluates statements (in this case whether A3=A2) and if the statement is TRUE it returns the second input (1 in this case) and if it is FALSE it returns the third input (0 in this case). With the statement in place in B3 I then filled the function down to B6 and Excel intelligently populated each cell of ‘Dupe’ as in the screen shot above.
You’ll notice that the IF functions have accurately identified which rows are duplicates by putting a 1 in the ‘Dupe’ column. Just like with Excel’s ‘Remove Duplicate’ tool only subsequent records are flagged as duplicate. This means that if you sort the data by both ‘Occurrence’ and ‘DateTime’ you can obtain the same results as in the Filtering Sequenced Data section.
One thing to note though, if you re-sort your data the ‘Dupe’ column will break. This is because the IF function always checks the current row against the previous one and if the occurrence numbers are not in the proper order the results won’t make any sense. To lock the results in you should copy and paste as values the ‘Dupe’ column. This will replace the function calls with their respective 1s or 0s and now you can sort the data just like you otherwise would. I occasionally use this method to first review the duplicates and then sort ‘Dupe’ so that I can delete all the rows with 1. The end result is a list of unique rows.
Filtering duplicates is a fact of analytical life and it’s important to be proficient at culling excess data so that stats aren’t double counted. I hope this post was helpful in identifying the various ways to get a handle on dupes.