Performing a Chain Analysis in Excel

Chain Analysis is the name I’ve given to the act of finding sequential, potentially related phone calls in phone records. The job is to identify if a particular number called a second number and if that second number called a third number within a specific time frame, say 10 minutes. I call these sequences of calls a chain and it can be used to identify whether someone might be issuing orders or instructions via third parties. The analysis can be done in Excel and while it involves some moderately-complicated Excel functions, once the spreadsheet is set up, subsequent analyses can be done easily.

Getting Started

To start, the first thing you need to do is arrange your data properly. In the screen shot below I have arranged some sample data where column A lists the calling number, column B lists the called number and column C lists the date and time of the call. In order for this technique to work properly it is important that the data set be sorted on the date time column.

For simplicity sake (and ease of reading) I’ve replaced the phone numbers with letters but really any string will work in the analysis. Likewise, I only have 7 rows of data but you can have a whole bunch more and the analysis will work just fine.

While we’ll need a few extra columns for functions there is no reason you can’t keep additional data in the rest of the columns. As long as you sort everything properly the rest of your information will tag along as you perform the analysis.

chain_1

Notice in the list that there are some chains that are easily identified without resorting to a function. Notice how, on row 3, AA calls CC and then CC calls DD a minute later. Similarly, on row 6, GG calls AA and then a minute later GG calls BB. This is also a chain to be identified. The trick now is to write a function that is able to flag these rows so that we don’t have to rely on picking them out manually.

The Look-Ahead Column

To flag the row we’re first going to create a new column that, for each row, “looks ahead” at the rows coming up. What we want the function to do is look ahead to see how many subsequent rows fall within the time frame window we’re interested in. In the discussion above I mentioned 10 minutes so, for each row, we want to count how many subsequent rows occur within 10 minutes of the date and time of the current row. The screen shot below captures the function that I wrote for cell D2.

chain_2

The very first thing to notice is that the function is wrapped in curly braces like this: {<function>}. This means that the function is what’s known as an array formula rather than a traditional scalar formula. To enter an array formula you click in the cell like you normally would but instead of hitting enter when you’re done you need to hit ctrl-shift-enter (at the same time) to tell Excel that it should be considered an array formula.

This particular function needs to be used as an array formula because we are asking the function to compare each subsequent cell to the current one in the IF statement, so we’re comparing an array of values rather than just a single one. Also, when you’re typing the function into the cell, don’t include the curly braces, Excel will add those after when you correctly use ctrl-shift-enter to complete the array function.

Let’s examine the IF statement next. The inner most part of the function is a request to check and see if the difference between cell C3 and cell C2 is less than 600 (our 5 minute cut-off). Since the date and time values in the C column are in Excel time we need to convert them to seconds before we do the comparison, which is why I multiply them by 24 (hours) and then 3600 (seconds) to get the difference into seconds. In the IF statement, if the difference between C3 and C2 is less than 600, then we return a 1 and if not we return a 0. Now, the tricky part is to wrap your head around the idea that, because this is an array function, we can use the expression C3:$C$8-C2 to do all of the comparisons at the same time. This means that we are doing C3-C2, C4-C2, C5-C2, etc, all the way to C8-C2. For each one of them we do the x24 x3600 conversion and comparison against 600 seconds. For each one of them we are getting either a 1 or a 0. The power of the array function method is that we can do all of those calculations at the same time, in the same cell.

Have a look at the SUM function that wraps the IF statement. Knowing that this IF statement is multivalued makes the SUM function make more sense. The role of the SUM is now to sum up all of the 1s that were returned by each of the IF statements. Have a look at the value of D2, it’s 3. It’s 3 because the date/time in C3, C4 and C5, are both within 10 minutes (600 seconds) of the value in C2.

Array formulas can be filled down just like regular functions so I have filled down the rest of the rows with this function. The key thing to remember is that you only need to compare the current row to the all the subsequent rows, and so the important bit of the function, the C3:$C$8-C2 bit, will always have the current row in the C2 position and the next row in the C3 position. As an example, in cell D4 the function is =SUM(IF((C5:$C$8-C4)*24*3600<600,1,0)) . Note that for row 4, C4 is used to subtract and the range of cells to check starts at C5.

Figuring Out the Look Ahead Row Number

The next step in the analysis is create a new column to hold the end row value for the look ahead. If the values in column D tell us how many rows to look ahead then the values in column E tell us what row that actually is. Have a look at the function in cell E2 in the screen grab below.

chain_3

This function uses the ROW() function to first tell us what row we are currently in. Then the value of D2 is added to that to arrive at the final value. For cell E2 the value is 5 since the ROW() function returns 2 and the value of D2 is 3. What this means is that we only need to look ahead to row 5, from row 2, to find all the phone calls that occurred within the 10 minute window. Anything beyond row 5 is more than 10 minutes after the call in row 2. The end row is an important component of the next part of the analysis as we need to be able to specify exactly what cell range we should be looking at for determining chains.

Flagging the Chained Calls

Have a look at the screen grab for the final step of the analysis, the step where each row is flagged as part of a chain or not.

chain_4

The heart of the function is the use of the MATCH function to compare the value in column B, the Callee column, with the values in the Caller column for all the rows up to the end row identified in column E. In order to do that we need to use the INDIRECT function, which is used here to assemble a cell range using the ROW() function and the value from column E. Inside the INDIRECT is (“A” & (ROW()+1) & “:A” & E3). If you evaluate this we can see that ROW()+1 becomes 4 (since ROW() is 3 and 3+1 is 4) and the value in E3 is 5, which leaves (“A” & 4 & “:A” & 5). If you slam all that together (which is what the &s are for) you end up with A4:A5, which is the range of cells we want to test our MATCH against.

Something else you need to know: when MATCH can’t find a match it returns NA, which is why the entire MATCH call is wrapped in an ISNA function call. If MATCH returns an NA value, then ISNA will return true, otherwise it returns false. I then feed that true or false into an IF statement (which is only interested in evaluating things to true or false anyway). If the value in the IF is true, that is, MATCH returned NA, then IF prints nothing, “”, to the cell. However, if MATCH returns a value and therefore ISNA returns false, then the IF statement will print “CHAINED” in the cell. For good measure, I wrap the whole thing in another IF statement to check and see if the look ahead rows in cell D3 are greater than 0. If not, there is no way there can be a chain, so the function just prints the empty string “” and we skip the whole MATCH, ISNA, IF process.

So let’s have a look at the results the function spit out. We can see that two chains have been identified. The first is on row 3 where AA called CC and then CC called DD within 10 minutes. This is what the CHAINED in cell F3 is identifying. The second CHAINED is for the chain that begins when CC calls DD and then DD calls FF (on row 5) within 10 minutes. The take away here is that chains are identified on the row in which they begin so it’s up to you to follow the chains from the starting point where they are identified.

Bonus Chains

The above example looks at Caller-Callee chains but almost the same function can be used to flag Caller-Caller chains, the ones where a caller calls a number and then within the time interval the same caller calls another number. In the screen shot below Caller-Caller chains have been identified for the second row and the sixth row and the only difference between the functions in column F and column G is that the MATCH in the column G version is being done on the values in column A (e.g. A2) rather than column B (e.g. B2). All the same logic applies but the change in the matching column changes the chaining subject.

chain_5

As can be seen in the chart, row 2 is a Caller-Caller chain because AA calls BB and then within 10 minutes AA calls CC. Similarly, on row 6, GG calls AA then GG calls BB with 10 minutes. Just like with the Caller-Callee chains above, the CHAINED flag is listed on the row where the chain begins and it’s up to the analyst to follow the chain through.

Conclusion

While the functions in this post can seem a little hairy at the start I think, with a little exploration, that you can become comfortable enough with them to trust the results. While it’s true that a custom VBA function could be written to perform this analysis I think it’s important to find solutions that first attempt to use the built-in Excel functions before resorting to custom code. I hope this post has demonstrated how a fairly sophisticated kind of analysis can be performed with just the tools that come in standard Excel. Good luck!

The Art of Filtering Duplicates in Excel

Every day analysts export data from myriad sources in an effort to consolidate raw data into useable intelligence. In order to make good decisions it is important that the data be as accurate as possible and one aspect of accuracy is ensuring that our information contains no duplicate data. When working with only a couple dozen rows of data sussing out duplicates is easy but when you have thousands of rows of data visual inspection just doesn’t cut it and it becomes necessary to employ some automated techniques. In this blog post I’m going to detail some of the Excel tools and strategies available for handling duplicate data.

The Basic Example
The simplest example is the straight removal of duplicate rows. Below is a stripped down data set with 3 columns and 5 rows of data.

duplicates_0

Clearly the fourth and fifth rows are duplicates. To filter this data we’ll use Excel’s built in ‘Remove Duplicates’ tool that can be found on the ‘Data’ ribbon in the ‘Data Tools’ section (see screen grab below).

duplicates_1

To use the tool first highlight the data in spreadsheet and then click the ‘Remove Duplicates’ button and this will call up the ‘Remove Duplicates’ window. There are some options here but in the simple case, the default values are sufficient. Click the ‘OK’ button and Excel will delete the duplicate rows leaving only unique rows.

duplicates_2

Filtering on Columns
In the screen shot of the ‘Remove Duplicates’ window you can see that each of the data columns has a checkbox beside it. These checkboxes allow you to search for duplicates using only the checked columns as the matching criteria. This may seem a bit esoteric so an example will help demonstrate how useful filtering on certain columns is.

In the screen shot below is a date set that contains some occurrence numbers and addresses connected with the occurrences.

duplicates_3

Imagine that from this data set we want only the unique occurrence numbers and we don’t care about the street addresses. To get just the numbers again highlight all the columns and click the ‘Remove Duplicates’ button. But this time, when the ‘Remove Duplicates’ window comes up, unclick the ‘Address’ and ‘Type’ checkboxes and click ‘OK’.

duplicates_4

The result is that Excel will filter out three rows and leave two, one for each of the occurrence numbers in the data set.

duplicates_5

Filtering Sequenced Data
One question you may have about filtering by specific columns is how does Excel decide what rows to keep and what rows to delete? It’s actually pretty straightforward: Excel keeps the first row it finds and dumps the rest and we can exploit this behaviour to achieve a helpful result.

Have a look at the data set below that lists three occurrences and a series of disposition shifts for each occurrence.

duplicates_6

You’ll notice that over time the disposition of an occurrence shifts as new information comes in. For example, many occurrences start as a ‘Default’ but shift to Report To Follow (RTF) or Unfounded (UNF) as the incident evolves. This is important because when calculating crime stats it’s crucial to know the final disposition for an occurrence so that it can properly reported (especially unfoundeds).

To get the last recorded disposition we can first sort the data by ‘Occurrence’ ascending and then by ‘DateTime’ descending. If we then highlight the data and use ‘Remove Duplicates’ (with only the ‘Occurrence’ column checked) we can exploit Excel’s programming to remove all rows except the first one for each occurrence number. This, very conveniently, leaves us with only the final dispositions for each occurrence number thanks to our clever sorting of the data.

duplicates_7

Rolling Your Own
All of the above examples use Excel’s built-in ‘Remove Duplicates’ tool but it is possible to manually identify duplicates. Why would you want to do this? Primarily because Excel deletes duplicate rows and doesn’t given you the opportunity to review them. Let’s look at an example using the data from the Filtering Sequenced Data example.

The first thing to do is sort the data based on the duplicate column. While it is possible to use multiple columns (which require multiple sorts) this example will just filter on the ‘Occurrence’ column. Look in the screen shot below and you’ll notice that I added a new column and named it ‘Dupe’. Also notice that in the first cell under ‘Dupe’ (cell B2 in this example) I put a 0 while in the subsequent cells I used Excel’s IF function to check the current row’s occurrence number against the previous one. As can be seen in the function bar of the screen shot the formula in cell B3 is =IF(A3=A2, 1,0).

duplicates_8

What this function is doing is examining the occurrence number in cell A3 and determining whether it is the same as the occurrence number in cell A2. Recall that the IF function evaluates statements (in this case whether A3=A2) and if the statement is TRUE it returns the second input (1 in this case) and if it is FALSE it returns the third input (0 in this case). With the statement in place in B3 I then filled the function down to B6 and Excel intelligently populated each cell of ‘Dupe’ as in the screen shot above.

You’ll notice that the IF functions have accurately identified which rows are duplicates by putting a 1 in the ‘Dupe’ column. Just like with Excel’s ‘Remove Duplicate’ tool only subsequent records are flagged as duplicate. This means that if you sort the data by both ‘Occurrence’ and ‘DateTime’ you can obtain the same results as in the Filtering Sequenced Data section.

One thing to note though, if you re-sort your data the ‘Dupe’ column will break. This is because the IF function always checks the current row against the previous one and if the occurrence numbers are not in the proper order the results won’t make any sense. To lock the results in you should copy and paste as values the ‘Dupe’ column. This will replace the function calls with their respective 1s or 0s and now you can sort the data just like you otherwise would. I occasionally use this method to first review the duplicates and then sort ‘Dupe’ so that I can delete all the rows with 1. The end result is a list of unique rows.

Wrap Up
Filtering duplicates is a fact of analytical life and it’s important to be proficient at culling excess data so that stats aren’t double counted. I hope this post was helpful in identifying the various ways to get a handle on dupes.