Performing a Statistical T-Test in Excel

The value of the t-Test is that it can tell you if a difference you are seeing in two measurements is legitimate or if the difference is likely to be just from randomness. In this blog post I am going to use the t-Test to determine if the difference in the number of daily motor vehicle collisions during the winter is meaningfully different then the number during the summer. While this is kind of a no-brainer example it will help illustrate the value of the t-Test and give you some ideas on how you might apply it to give some statistical rigour to your analytical findings.

Before we get into the test let me provide some background to help understand what it is we’re trying to do. In statistics we are always comparing two hypotheses. The first is named the null hypothesis and the second is named the alternate hypothesis. We almost always want to reject the null hypothesis and accept the alternate one because in that situation it means something interesting has occurred. For our motor vehicle collision example we want to determine if there is a meaningful difference in the number of daily motor vehicle collisions during the first quarter of the year (i.e. the winter) versus the third quarter of the year (i.e. the summer). Now, you may be thinking “obviously there’s a difference, the weather” but let’s, for the sake of education, assume that we need to demonstrate statistically that there is a difference. In this study the null hypothesis would be that “there is no difference between the quarters, the  number of collisions are the same” while the alternate hypothesis would be “yes, there is a statistically significant difference between the two quarters”.

The term “statistically significant” needs some explanation. When we want something to be statistically significant we want to be able to say, with a particular level of confidence, that the results we are seeing are not just due to chance. First we pick our “confidence level” or how sure we want to be and then, through the magic of statistics, we are provided a number that our results have to beat in that we can be that sure. In this example we’re going with the 95% confidence level which means that if we find a difference between the average number of MVCs during the winter and summer we want to be 95% sure it’s not due to chance. Or, in other words, we are willing to wrongly reject the null hypothesis only 5% of the time. We can up the confidence interval to 99% if we want but, as we’ll see, we’ll need to adhere to even stricter conditions. Whatever we choose just remember that the “significant” part in statistically significant doesn’t mean “important” (as significant typically means in everyday conversation) it just means “not due to randomness”.

How do we go about demonstrating that the two quarters have real, non-random differences in the number of daily motor vehicle collisions? The first thing we do randomly select from the first and third quarters 30 different days worth of MVC counts. The screen grab below shows my randomly selected MVC data.

 ttest_0

Notice that we there are two columns of data, one marked Winter and one marked Summer and each has 30 entries. Also notice that at the bottom of each column is a number labeled “Mean”. In those cells I have used Excel’s AVERAGE function to find the mean of all the numbers (add them all up, divide by 30). And behold, the means are different. That proves the quarters are different, right? Not quite. We still have to deal with the peskiness that is statistical significance. It’s possible, after all, that it’s just through chance those means are different. As discussed above we want to be 95% sure that they aren’t.

To prolong your agony I’m actually going to show you two ways to perform the t-Test: the (relatively) quick way using Microsoft’s Data Analysis ToolPak Add-in and the longer, manual way that uses built in Excel functions.

First, using the Add-In. 

The Analysis ToolPak is a free Add-In provided by Microsoft as part of the default installation of Excel. Add-Ins are a kind of Excel file, typically with a .xlam extension, that package together a bunch of related functionality. To get started with the Analysis ToolPak Add-In we need to check if it is installed. You may or may not have it already installed but to check if you do you click the “Data” tab in Excel and look at the far right. If you see “Data Analysis” as an option, it’s already active (see screen shot).

advanced_1

If you don’t see it that means we need to turn it on. First, click the green “File’”tab (or the Excel jewel if you’re in 2007) to call up the Save/Open/Close menu. Look around the option named “Options’”and click it. This will open up a window with a menu on the left. Click on the menu option named “Add-Ins” on the left hand side. The screen will change and at the bottom, beside the word “Manage”, will be a select box and a button labeled“Go”. Click the“Go”button to open yet another window and you will see a series of check- boxes. Fingers crossed that one of them says “Analysis ToolPak” and is unchecked. Click the checkbox and click ok. Now go back to Data tab, look at the far right and you’ll see “Data Analysis’”. You’re set.

Now that the ToolPak is installed click the “Data Analysis” button and a modal window will open up. Click “t-Test: Two-Sample Assuming Unequal Variances” in the list and click the “OK” button and this will open up a new window (check the screen shot below) with some options. For the “Variable 1 Range” click the little arrow button and select all of the “Winter” data, which is in B2:31. For “Variable 2 Range” select all the “Summer” data from C2:C31. For the “Hypothesized Mean Difference” enter 0 (since we’re trying to prove they are the same, the difference of the mean should be 0) and everything else can stay the same. However, notice that there is field named “Alpha” that has the value 0.05. It’s not a coincidence that 0.05 = 1 – 0.95. Alpha is another way of asking how sure we want to be.

ttest_1

Click the “OK” button and Excel will open up a new worksheet populated with a bunch of labels and values just like the one in the screen shot below.

t-test-new-1

So what are we looking at? Well, there are the means we saw. Next comes the variances, which are measures of how spread out the values are, and they are quite different (which is why we needed to use the test that assumed unequal variances) and a little further down we see our “t Stat” and it has a value of about 4.35. Skip down two lines to the value labeled “t Critical one-tail” and note that it is about 1.68. That’s good news for us because the rule is that if your “t Stat” is larger than your “t Critical” value then the null hypothesis can be rejected and our results are therefore statistically significant. This means that we can now state, with the power of math backing us up, that we are 95% sure that the daily average number of MVCs in the winter are greater than the daily average number of MVCs in the summer.

Second, the manual way. 

So we used a built-in tool to do a bunch of heavy lifting for us (as tools should) and we read an answer off a table. It works, but how did it work? How can we calculate these numbers if the ToolPak isn’t around but we still, for some reason, have access to Excel?

First, have a look at this screen shot because it will help to explain each of the steps I used to perform a manual unequal variance t-Test.

t-test-new-2

On the left are the two original columns of data, one for winter and one for summer. Notice that there are 30 entries for each season but that I used “freeze panes” so that I can show the bottom of the data (starting in row 32). For both seasons I calculated three additional variables, the mean (using Excel’s AVERAGE function), the standard deviation (using Excels SDEV function) and the count of the number of data points (using Excel’s COUNT function). These will all be needed in the coming calculations.

On the right hand side of the screen shot I produced my own little table and I’m going to walk through each of these variables and explain the thinking behind them.

The first needed variable is the Sum of the Squared Deviations for which I used Excel’s DEVSQ function as a short cut. Think of the variable like this: you have a mean value for  a season and you have 30 different data points that are either going to be a little bit more or a little bit less than the average (as data points usually are). For each data point, subtract its value from the mean and square it. Now sum those squared differences all up and that’s the sum of the squared deviations (or differences). Notice that I added together the sum of the squared deviations for both the winter and summer seasons together into a single value.

Why did we need the sum of the squared deviations? Because we want to calculate the pooled sample variance. Why pooled? Because we put both the winter and the summer values together. But what’s variance? The average of the squared differences from the mean. This is easy to calculate, just divide the sum of the squared deviations by the number of points we have in our two seasons, minus 2 (1 for each season) to account for Bessel’s correction which is a correction for the bias in the estimation of the population variance (just trust me, subtract 2, Excel does).

The next variable is the Standard Error of Difference in Means and to calculate it we take the square root of the pooled sample variance times the sum of one over the counts for the winter and summer data points (i.e. 1/30 + 1/30). But what is the standard error of difference in means? It’s just the average expected difference between the means for two samples that have this many data points. We’re trying to determine, after all, if two means are significantly different from one another and, given the size of the two sample groups, we should expect that there would be some difference. Now, imagine that we run this little study over and over again with other, randomly selected groups of MVCs, each pair would have their own difference between the means. That sounds like a lot of work, so instead of running this study over and over again we can instead use the formula for the standard error of difference to answer the question: “what’s the expected average of that difference?”.

But what do we need that for? To calculate our t-statistic. And to perform that calculation we just divide the actual difference in the means (winter mean minus summer mean) by the standard error of difference in means to end up with approximately 4.35, just like the ToolPak did.

But we’re not done, we still need to determine our critical t-score at the 95% confidence interval. To do that we cheat a bit and use Excel’s built in TINV function. The first input is the probability that we want to test, in this case it’s 0.05 (i.e. 1-0.95) or the alpha from the ToolPak. The eagle-eyed will note that in the calculation in the screen shot 0.05 is actually multiplied by 2 and this is because we’re interested in replicating the “one-tailed” result from the ToolPak and since TINV is “two-tailed” we need to double up the probability to account for that.

TINV also takes a second parameter, degrees of freedom, and usually degrees of freedom is pretty straightforward (it’s just the sum of the two counts, minus 2, like for the pooled variance) except when we’re dealing with an unequal variance scenario. In order to handle this scenario we need to break out the more elaborate calculation defined by Moser and Stevens 1992 and spelled out in the ugly Excel formula:

=(1/B34 + F8/C34)^2/(1/(B34^2*(B34-1)) + F8^2/(C34^2*(C34-1)))

Where the value in cell F8 is calculated as =C32^2/B32^2 or the ratio of the squared standard deviations for each season. God, it’s a mess, I know. But just dissect the formula and you’ll see that it’s just a complicated formula using the counts for the two seasons in multiple places. One thing you may notice is that the resulting degrees of freedom is not an integer and that we have the approximate value 48.58. The ToolPak results above have a degrees of freedom of 49. This is because Microsoft rounds. You can choose to round or not, the results are pretty much the same.

With degrees of freedom we can finally run the TINV function and when we do we get a critical t-stat of 1.68, just like the ToolPak (although the ToolPak number is slightly different further into the decimals because I didn’t round my degrees of freedom). And just like with the ToolPak, our t-stat is way bigger than that critical t-value so we can reject the null hypothesis and assume that we have statistically significant results at the 95% confidence level. Yes!

And with that we are finally done our odyssey of manually calculating the unequal variance t-test. We’ve calculated the same results as the ToolPak and we got right into the nitty-gritty of it. Well done.

Conclusion

If you made it this far (and didn’t just skip the manual section) then bravo, that was a real grind. Either way, I hope you learned something about the t-statistic and how it can be used to add some rigour to pronouncements about means. I do, somewhat unfortunately, find that a lot of the time the t-stat plays spoiler in that decision makers see a difference in averages and conclude that something is working when the reality is that, statistically speaking, there is no significant difference. Nevertheless, it’s important to be able to back up your analysis with some math and t-stats are a great way to provide that oomph. Good luck.

 

A Quick and Dirty Phone Toll Technique Using Excel Pivot Tables

Analysts are often asked to perform phone toll analysis, which is the act of analyzing call records to see who called whom, when and how frequently. There are commercial tools for performing such an analysis but I’ve found Excel can easily get the smaller jobs done. In a typical phone toll analysis you have some numbers of interest, suspect numbers or the numbers of known associates, that you want to study in depth. The technique I am going to demonstrate below is not applicable to that kind of analysis. Rather, I am going to detail how to use Excel pivot tables to perform a different kind of phone toll analysis useful for when you are interested in number discovery.

The scenario is as follows: there have been three incidents with the same MO and it is likely that the same suspects are committing the crimes. You don’t have any suspect descriptions but you do have the exact location and time of each of the incidents. Working on a hunch that the suspects are using their phones during or shortly after the crimes your detectives secure a warrant for all cell activity in each of the three incident locations for each of the three occurrence intervals from all the major cell carriers. They then email you this data and ask you to do an analysis.

The first step is to sort out the data. Every carrier organizes their data differently but the common elements between them all are called number, calling number and the date and time of the call (sometimes combined as date time). To normalize the data first create a new spreadsheet with columns for date, called number, calling number, carrier and  location. These last two will need to be manually added to the spreadsheet based on data from the carrier and are used to differentiate the data between carriers. The intention of this step is to create a single ‘super sheet’ that combines all the calls made across all the carriers into one big list.

The stripped down super sheet in the screen grab below lists only 12 calls across three carriers but it’s quite possible that in a real life example you could be looking at thousands upon thousands of calls (depending on the size of your time interval and the size of your cell phone market).

toll_1

After normalizing the data the second step is to make two copies of the super sheet (leaving the original as a source file). In the first copy, delete the ‘Called’ column and in the second copy delete the ‘Calling’ column. Finally copy all the data from the second sheet and paste it at the bottom of the data in the first copy and rename the column with the phone numbers in it as ‘Number’. The point of this step is to wipe out the difference between called and calling number since, when we’re fishing for common numbers, we don’t want to just rely on one or the other in case during one incident the suspect was the caller but in another the suspect was the called number.

In the screen short below I pasted in the content from the ‘Calling’ sheet to the bottom of the data in the ‘Called’ sheet and even though you can’t see the column named in the shot I renamed column 2 to ‘Number’.

toll_2

The third step is to create a pivot table from the combined work sheet with the single ‘Number’ column. Highlight the whole thing and insert a pivot table using the PivotTable button under the ‘Insert’ tab (see screen shot below).

toll_3

In the new pivot table work sheet drag the ‘Number’ column to the ‘Row Labels’ box and the ‘Location’ field to the ‘Column Labels’ box. Finally, drag the ‘Date’ field to be the ‘Values’ box to get something similar to the screen shot below.

toll_4

What you’ll likely see (depending on how many sources numbers you have) is a very long pivot table with a lot of blank spaces. What you need to do is copy the entire table and paste its values into a new worksheet so that it can be manipulated. After pasting, highlight the data and apply a data filter so that each column has the drop down arrows for filtering out values.

toll_5

Now it’s time to search for common numbers. If, as we hope, the same crew committed all the crimes and they used their phones during the different crimes then there should be numbers that have non-zero values in more than one column. In order to more easily identify the numbers use the filter drop down on the first location column (123 Main Street in the example) and deselect the (Blank) option. This should shrink the list considerably. Now, do the same thing on the second column be deselecting the (Blank) option. Any phone numbers that remain were used in both locations during the incidents.

toll_6

By re-selecting (Blank) for the second column and de-selecting (Blank) in the third column you can check for common numbers between the first and third locations as well. And finally, by re-selecting (Blank) in all the columns and then de-selecting it for two and three you can find common numbers for those two addresses. You only need to do those three combinations as all other combinations (e.g. three then two, three then one, etc) are just reversals of what’s already been done and you’ll get the same results.

By noting down what numbers are common and then applying a filter to the original super sheet that maintained the called/calling split it is possible to find the carrier. You can then examine the original carrier documents to get more details and provide your detectives with a list of numbers they can follow up on.

toll_7

One note, try googling the common numbers that you find before you send them off to your detectives. Often common numbers are things like the carrier’s voice mail number or a public service phone number for the municipality. The good thing is that if the number is one that the general public might use it will likely be heavily promoted on websites and show up near the top of search results. By pre-identifying common numbers you can save your detectives the potential embarrassment of asking a carrier for details on their voice mail service.

The Art of Filtering Duplicates in Excel

Every day analysts export data from myriad sources in an effort to consolidate raw data into useable intelligence. In order to make good decisions it is important that the data be as accurate as possible and one aspect of accuracy is ensuring that our information contains no duplicate data. When working with only a couple dozen rows of data sussing out duplicates is easy but when you have thousands of rows of data visual inspection just doesn’t cut it and it becomes necessary to employ some automated techniques. In this blog post I’m going to detail some of the Excel tools and strategies available for handling duplicate data.

The Basic Example
The simplest example is the straight removal of duplicate rows. Below is a stripped down data set with 3 columns and 5 rows of data.

duplicates_0

Clearly the fourth and fifth rows are duplicates. To filter this data we’ll use Excel’s built in ‘Remove Duplicates’ tool that can be found on the ‘Data’ ribbon in the ‘Data Tools’ section (see screen grab below).

duplicates_1

To use the tool first highlight the data in spreadsheet and then click the ‘Remove Duplicates’ button and this will call up the ‘Remove Duplicates’ window. There are some options here but in the simple case, the default values are sufficient. Click the ‘OK’ button and Excel will delete the duplicate rows leaving only unique rows.

duplicates_2

Filtering on Columns
In the screen shot of the ‘Remove Duplicates’ window you can see that each of the data columns has a checkbox beside it. These checkboxes allow you to search for duplicates using only the checked columns as the matching criteria. This may seem a bit esoteric so an example will help demonstrate how useful filtering on certain columns is.

In the screen shot below is a date set that contains some occurrence numbers and addresses connected with the occurrences.

duplicates_3

Imagine that from this data set we want only the unique occurrence numbers and we don’t care about the street addresses. To get just the numbers again highlight all the columns and click the ‘Remove Duplicates’ button. But this time, when the ‘Remove Duplicates’ window comes up, unclick the ‘Address’ and ‘Type’ checkboxes and click ‘OK’.

duplicates_4

The result is that Excel will filter out three rows and leave two, one for each of the occurrence numbers in the data set.

duplicates_5

Filtering Sequenced Data
One question you may have about filtering by specific columns is how does Excel decide what rows to keep and what rows to delete? It’s actually pretty straightforward: Excel keeps the first row it finds and dumps the rest and we can exploit this behaviour to achieve a helpful result.

Have a look at the data set below that lists three occurrences and a series of disposition shifts for each occurrence.

duplicates_6

You’ll notice that over time the disposition of an occurrence shifts as new information comes in. For example, many occurrences start as a ‘Default’ but shift to Report To Follow (RTF) or Unfounded (UNF) as the incident evolves. This is important because when calculating crime stats it’s crucial to know the final disposition for an occurrence so that it can properly reported (especially unfoundeds).

To get the last recorded disposition we can first sort the data by ‘Occurrence’ ascending and then by ‘DateTime’ descending. If we then highlight the data and use ‘Remove Duplicates’ (with only the ‘Occurrence’ column checked) we can exploit Excel’s programming to remove all rows except the first one for each occurrence number. This, very conveniently, leaves us with only the final dispositions for each occurrence number thanks to our clever sorting of the data.

duplicates_7

Rolling Your Own
All of the above examples use Excel’s built-in ‘Remove Duplicates’ tool but it is possible to manually identify duplicates. Why would you want to do this? Primarily because Excel deletes duplicate rows and doesn’t given you the opportunity to review them. Let’s look at an example using the data from the Filtering Sequenced Data example.

The first thing to do is sort the data based on the duplicate column. While it is possible to use multiple columns (which require multiple sorts) this example will just filter on the ‘Occurrence’ column. Look in the screen shot below and you’ll notice that I added a new column and named it ‘Dupe’. Also notice that in the first cell under ‘Dupe’ (cell B2 in this example) I put a 0 while in the subsequent cells I used Excel’s IF function to check the current row’s occurrence number against the previous one. As can be seen in the function bar of the screen shot the formula in cell B3 is =IF(A3=A2, 1,0).

duplicates_8

What this function is doing is examining the occurrence number in cell A3 and determining whether it is the same as the occurrence number in cell A2. Recall that the IF function evaluates statements (in this case whether A3=A2) and if the statement is TRUE it returns the second input (1 in this case) and if it is FALSE it returns the third input (0 in this case). With the statement in place in B3 I then filled the function down to B6 and Excel intelligently populated each cell of ‘Dupe’ as in the screen shot above.

You’ll notice that the IF functions have accurately identified which rows are duplicates by putting a 1 in the ‘Dupe’ column. Just like with Excel’s ‘Remove Duplicate’ tool only subsequent records are flagged as duplicate. This means that if you sort the data by both ‘Occurrence’ and ‘DateTime’ you can obtain the same results as in the Filtering Sequenced Data section.

One thing to note though, if you re-sort your data the ‘Dupe’ column will break. This is because the IF function always checks the current row against the previous one and if the occurrence numbers are not in the proper order the results won’t make any sense. To lock the results in you should copy and paste as values the ‘Dupe’ column. This will replace the function calls with their respective 1s or 0s and now you can sort the data just like you otherwise would. I occasionally use this method to first review the duplicates and then sort ‘Dupe’ so that I can delete all the rows with 1. The end result is a list of unique rows.

Wrap Up
Filtering duplicates is a fact of analytical life and it’s important to be proficient at culling excess data so that stats aren’t double counted. I hope this post was helpful in identifying the various ways to get a handle on dupes.

Excel’s VLOOKUP Function for Crime and Intelligence Analysts

VLOOKUP is one of those functions that causes a lot of grief for analysts. I think it’s because the function is kind of complex to implement and most people don’t use it often enough to commit to memory exactly how it works. In this blog post I am going to explain exactly how to use VLOOKUP with the aid of two analyst-orientated examples.

An Introductory Example

At its core VLOOKUP (the V stands for vertical by the way) is used for looking up one value based on another … vertically. To demonstrate what I mean have a look at the screen shot below in which a list of Universal Crime Report (UCR) codes (column A) is listed side-by-side with the description of the crime (column B).

vlookup_1

Together column A and column B constitute a “look up table” because you can use a known value in column A to find the corresponding value in column B (and vice versa). But why do we need look up tables? We need them because it’s likely that when you export a list of crime stats from your database that you’ll see the codes rather than the descriptions. And while the UCR code is shorter and easier to manage there will eventually come a time when you want to replace the code “1110” with the friendlier “Murder 1st Degree” (maybe when you are preparing a final report to send around) and that’s when you use VLOOKUP to look up the description based on the code

The key thing to remember about VLOOKUP is that when you’re using it you always have two different sets of data: the first is the data you are working on and the second is the lookup table. The screen shot below shows a sheet with both sets of data; the working data is on the left and the lookup table is on the right in grey. Notice that the data being worked on has an occurrence number and a UCR code but no Crime.

vlookup_2But how to use it? Have a look at the above screenshot again, specifically cell C2, and note that the function in that cell is =VLOOKUP(B2, $D$1:$E$16,2,FALSE). Let’s break that function down into pieces.

  • For the first input to VLOOOKUP, we’re instructing Excel to use the value in cell B2 (the known UCR code) to find the description to put into cell C2.
  • For the second input we’re telling VLOOKUP to look for the value of B2 in the cell range $F$1:$G:$16 which is the entire range occupied by the lookup table. (As an aside, you’ll notice that the range values all have dollar signs in front of each letter and number. This is because when we fill this function down we don’t Excel to act ‘intelligently’ and start shifting the references. Using the dollar sign locks the reference so that no matter how we fill it, it will always point to the same set of cells.)
  • For the third input we use the value 2. Notice that our look up table has only two columns (D and E) in this example, one that is the code and one that is the crime. Since we already know the code (from column B) what we want is the description and since the description is in the second of the two columns in the lookup table we specify that we want the result to come from column 2.
  • The final input to the function is ‘FALSE’. By entering FALSE we are telling Excel that we want an exact match. That means that if Excel can’t find UCR code 1110 it’s going to return #N/A. There are situations where you might not want exact matches but they are less common and sticking with FALSE will usually get you what you want 90% of the time.

With the function input it’s just a matter of filling the function down to populate all of the cells in column C with the crime description based on the UCR code. If you click on any of the other cells in column C, as I did in the example below, you’ll see that the lookup table, locked as it is by the dollar sign $ notation, is always referencing the same range of cells. I bring this up again because the VLOOKUP mistake I see most often is that people don’t lock the range and get weird results when they fill the formula down. Make sure to use those dollar signs!

vlookup_3

A More Advanced Example

While the example above is the most common usage of VLOOKUP I frequently use it for another purpose: to compare two lists of things (e.g. occurrence numbers, suspect names) to determine if the lists are the same and highlight those elements that are different. The following screen shot demonstrates how I lay out my EXCEL file to perform this work.

vlookup_4

The two columns of occurrence numbers in column A and C are almost the same and in fact they only differ by 3 values. To identify those three values we can use VLOOKUP to look up one set of occurrence numbers against the other and see what happens. The above screen shot shows the function in cell B5 to be =VLOOKUP(A5, $C2:$C$20, 1, FALSE). Just like before we are checking each value against a table of values but this time it just so happens that the lookup table only has one column (instead of the two above) and that’s why we have a “1” as the third input.

If you fill the function down the occurrence numbers get filled in and where VLOOKUP can’t find a match for the cell in column A in column C Excel outputs #N/A. That last sentence may seem overly precise but it is meant to illustrate an important point: the VLOOKUP detailed above is only checking the values in A against the look up table and therefore will only catch those values in A that aren’t in C.  To get the full picture you also need to perform the reverse lookup, that is, catch the values that are in C but not in A. To do that you can put the following function call =VLOOKUP(C2, $A2:$A$20, 1, FALSE) in cell D2 and fill down. Now, just like before, we’ll see matching occurrence numbers if the values are in both columns and #N/As for those values that are in C but not in A. Taken together these two VLOOKUPs will help identify how the two lists differ.

Getting Even More Advanced

One more thing, I really don’t like Excel’s blaring #N/A error value. It’s ugly. Also, I don’t know about you, but I don’t really care about when the occurrence numbers match, I’m more interested when they don’t match and I’d prefer not to see anything for the positive matches. Let’s see what we can do about fixing this up.

To accomplish this, we’ll need to use the VLOOKUP function in conjunction with another Excel function, ISERROR, that we haven’t talked about. I’m not going to pretend this isn’t getting complicated, because it is, but I’m doing it anyway. In cell B2 I replace the old VLOOKUP function call with this new one:

=IF(ISERROR(VLOOKUP(A2, $C$2:$C$20,1,FALSE)), “Diff”, “”)

I know that looks intimidating but let’s break it down. The part about VLOOKUP we already know, it’s the same as above, but what we’re doing now is feeding the output of VLOOKUP to the function ISERROR. ISERROR’s only purpose is to return a TRUE or FALSE based on whether an error occurs within the VLOOKUP and, as we know, 3 errors occur, one for the each of those #N/As. For each of those three #N/As the ISERROR function will return TRUE and for all the rest it will return FALSE. This behaviour is very handy because the IF function checks for TRUE or FALSE when trying to figure out what to do. In this case, if it gets a TRUE from ISERROR (that is, it’s true that there was an error) then it prints the word “DIFF” but if it gets a FALSE, it doesn’t print anything because there was no error.

I know that’s a brainful but have a look at the screen shot below and you can see that each of the cells that was previously an #N/A is now a “DIFF”, because those are the ones that caused a VLOOKUP error that in turn caused ISERROR to be TRUE which caused the IF function to print the value for TRUE. All the rest, the ones without VLOOKUP errors, display nothing, just as I wanted.

vlookup_5

Phew, that’s a lot of Excel-fu for a post that started up just about VLOOKUP. I hope the post (or at least the first two parts) give you the confidence to use VLOOKUP more often in your daily work. And I hope that the third part, as intimidating as it looks (but it’s really not, just follow the logic of the TRUEs and FALSEs)  will give you ideas about how you can use the function to accomplish more complex tasks. Good luck!

Understanding the Difference Between Mean and Median

As analysts we are often interested in averages. But an average is a tricky thing because the word average actually covers several related concepts. In this blog post I am going to discuss two different concepts of average – the Mean and the Median – and why it’s crucial to understand their differences.

The Mean is what we typically mean when we use the term average. The Mean is calculated by adding everything up and dividing by the number of items. The resulting value is supposed to represent the centre of all the values. People intuitively understand the Mean because a lifetime of being exposed to the concept of the bell curve (or the normal distribution as it is called in math circles) has taught them that the big peak in the middle of the curve is the average.

The Median is less well known but just as easy to conceptualize. Consider all of the items in your collection and sort them from smallest to largest. The Median is then the value of the item that sits in the middle of the sorted list. In other words, it’s the value of the item that splits the collection in half with an equal number of items above and below it.

When dealing with data that follows the bell curve, that is data that produces a symmetric and unimodal (single peak) distribution, the Mean and the Median have the same value and we don’t have any problems. The issue is that a lot of data we’re likely to encounter in our day-to-day work— such as call response times, time spent on patrol, man hours spent on an occurrence—are not accurately represented by a symmetric bell curve but instead follow a skewed distribution that is weighted more to one side than the other.

But what does this mean for averages? It means that for data that is positively skewed (the bulk of the data is on the left side of the chart) that the Mean is going to be larger than the Median and vice versa for negatively skewed data. The problem arises when the difference between the Mean and Median gets large enough to meaningfully change the stat that you are reporting.

For example, the number of man-hours spent on an occurrence is the kind of data that is going to be positively skewed. This is because of the nature of the data: 0 is the lowest value that can exist but the upper limit is not bounded. If you create a histogram of the data you’re likely to find a whole lot of occurrences with a small to medium amount of time being spent (the peak on the left) but also a few occurrences with a lot of time being spent (the long tail on the right). I’ve created the following simplified data set below to illustrate the scenario.

You can see in the screen shot that the occurrence that took 24 hours is clearly an extreme example but it has a significant impact on the Mean. If a Commander came to you and asked for the average number of man-hours spent on occurrences do you feel that the Mean of 8.8 would be representative, considering that only one occurrence actually took longer than that? Clearly the Median value of 5 hours is more representative of man-hours and is a more accurate average.

This result comes about because the Mean is sensitive to the inclusion of extreme values because of the way it is calculated while the Median, which is relatively indifferent to the inclusion of extreme values because it only concerns itself with the middle value for a data set, is more likely to provide an appropriate number that is stable in the face of outliers.

The screen shot also conveniently illustrates the function calls used to calculate both the Mean and Median in Excel. Frustratingly, Microsoft has opted to use the function name AVERAGE for the Mean further erroneously cementing them as synonyms. Thankfully the Median is a straightforward call to the MEDIAN function.

You’ll notice a third function in the list: SKEW. I discuss above about the positive and negative skewness of data and the SKEW function is useful for coming to grips with the direction and magnitude of a data set’s skewness. In the occurrence man-hours example I state that the data is positively skewed and that’s reflected by the positive (greater than 0) value returned by the SKEW function and that indicates that the Mean likely overestimates the Median. Contrarily, if SKEW returns a value that is less than 0 you’ll know that the data is negatively skewed and that the Mean likely underestimates the Median. Finally, if the value is close to 0 your data set likely follows the bell curve or normal distribution and the Mean and Median will be nearly the same.

The key take away from this post is that it is important to not just blindly take the average (by which I mean the Mean) of a stat without first considering how the data is distributed. So much law enforcement data is skewed one way or the other that it makes sense to take a minute and run the MEDIAN and SKEW functions in Excel to make sure you’re reporting the most appropriate number to your superiors.

IALEIA SWOC November Excel Workshop

I just got back from the November Excel Workshop put on by the South Western Ontario Chapter of the International Association of Law Enforcement Intelligence Analysts. I attended both to present and to learn and I had a great time doing both. I want to give a heartfelt thank you to Mick, Manny and Hazel for inviting me to present on Excel topics for law enforcement analysis and to all the analysts who took time out from their schedules to come and make the profession stronger. I hope we can do it again in the future.

At the workshop I delivered two presentations and I produced a PDF document of the material for each. The first, 30 Excel Functions All Analysts Should Know, is an introductory level document detailing useful Excel functions with examples and screen shots. The second PDF is titled Advanced_Functions_Add_Ins_and_VBA and is a mix of intermediate to advanced level Excel topics that specifically details how to use Excel’s FREQUENCY function, Microsoft’s Analysis ToolPak for t-Tests and how to create a simple User Defined Function in Excel. In the coming weeks I will expand on the exercises in the PDFs as blog posts and also blog about those sections that didn’t make the cut for the workshop.

Both lessons also had some supporting Excel files for performing the exercises: 30_Functions_Exercises and Advanced_Functions. Please note, all this data is fictional and is in no way sensitive. In the Advanced Functions workshop I make use of a Add-In named PatternUI.xlam and it can be downloaded from here: http://blogs.office.com/b/microsoft-excel/archive/2007/11/16/chart-pattern-fills.aspx

Once again, thanks to everyone that came out today for making the workshop a success.

Year-over-year Crime Stat Reporting with ISO Week Date

Does your organization report a weekly crime stat? Is part of that report a measure of how the week compares to the same week the year before? If so, it’s important that you understand how ISO week dates work so that your report offers an accurate comparison between this year and prior years. In this post I am going to first discuss what an ISO week date is and then I am going to explain how it helps create better crime stat reports.

First, what is an ISO week date? Hopefully you’re familiar with the International Organization for Standardization (ISO). They publish a lot of standards and ISO 8601 is the standard that deals with a calendar system that gives each week of the year a number. For example, this article was written on October 21, 2012, which, in ISO date week notation, is written 2012 W42 4. Breaking the date down: the first number is the year, the second number is the week of the year (42 in this case) and the third number is the day of the week (the 4th day is a Thursday because the system states that Monday is the first day of the week). Most of the time a year has 52 weeks, sometimes it has 53 weeks to handle leap years. According to Wikipedia the system is most often used in government and business for keeping fiscal years.

This is pretty straightforward to understand but the tricky parts comes when determining the first week of the year as the first week introduces slight discrepancies between the ISO system and the traditional Gregorian calendar that people are used too. To wit: ISO 8601 defines the first week as the week with the year’s first Thursday in it. Using 2012 as an example, the first Thursday of 2012 was January 5th and, recalling that the ISO week begins on Monday, this means that the first day of the ISO year, that is 2012 W01 1, was January 2nd, 2012. But what about January 1st, 2012? It was actually the last part of week 52 of 2011 or 2011 W52 7. Yes, I know, it’s weird, but I think its utility outweighs its weirdness.

If you’ve made it this far in the post you’re probably thinking: “This seems overly complicated, why should this concern someone reporting crime stats?” A valid question. Consider a weekly crime report that covers October 8th through October 14th from 0000 hours to midnight. That’s a whole week and conveniently it is also week 41 of 2012. Now let’s say that for comparison our theoretical crime report also tabulates the stats for October 8th through October 14th for 2011. Is there a problem with this? Does it matter that the 2011 numbers run from October 8th (a Saturday in 2011) to October 14th (a Friday in 2011) instead of Monday to Sunday like it does in 2012? Is it enough that we capture one of each weekday in a weekly crime report? Is it relevant that the Saturday from 2011 is part of ISO week 40 while the Saturday from 2012 is from week 41?

I think this last point is key. Calls for service have a definite seasonal trend which means that for any particular day of the week, Saturdays for example, the calls for service will vary according to the week of the year. This means that, historically speaking, the number of calls for service on the Saturday in week 41 are likely going to be consistently different than the number of calls for service on the Saturday in week 40 and the same goes for every other day of the week. Basically, days are not created equal and that means that if you want to compare apples to apples for crime reports you should really compare equivalent time periods. The easiest way to do this is, you guessed it, by using ISO week dates and comparing week 41 from 2012 to the week 41 from 2011. In our example, compare October 8th through October 15th 2012 to October 10th to October 16th 2011.

But is this reasonable? Why is the week the unit of analysis here? In non-leap years October 8th through 14th is always the 281st through 287th day of the year, isn’t there consistency in that that gets thrown out if we adopt the weekly approach? Perhaps but I don’t think day-of-year consistency is more useful than the weekly one. Most analysts recognize that calls for service have a distinct weekly pattern—there are more calls for service on Friday and Saturday then other days of the week—that is imposed upon the larger seasonal trend. By adopting the ISO week date system we can align reporting with this natural frequency found in the crime data and by locking into the frequency we can exploit it to make comparisons easier.

So what’s the bottom line? Because calls for service show a seasonal trend you should endeavor to compare the same weeks when performing a year over year analysis. And because calls for service show a weekly trend it makes sense to lock your reporting to a standard week that makes comparisons easy and straightforward.  The best way to accomplish both of these goals is to adopt the established ISO week date system for weekly crime reports.

Cumulative Distribution Function Charts in Excel

Most analysts know what a histogram is: it’s a chart where each piece of data is put into one of several bins and then the bin counts are plotted in order to gain insight into how the data is distributed. Histograms are useful as they have an intuitive appeal because it’s easy to see how the data is distributed just by looking at the chart. Of course, they aren’t perfect. The whole binning process is pretty subjective and depending on how the bins are selected the results can be different, which is deceptive. If only there was an alternative that could address these shortcomings…

Well, there is: histograms have an underutilized sibling known as the Cumulative Distribution Function (CDF) chart which compliments and, I think, improves on the histogram by doing away with some of the histogram’s problems and offering some useful innovations of its own. Namely, the ability to read off the chart what percentage of the data falls between two points, something that is not possible with a histogram. In this post I will discuss why I think you should make the CDF part of your analytical toolbox and also demonstrate how easy it is to create a CDF in Excel.

Convince me

First, two example charts that describe my dataset: a completely made up list of 30,000 calls-for-service and how long they last from initial call to the closing off of the incident. The first chart is a standard histogram and the second is the corresponding CDF. For both charts the horizontal axis is the same: it’s hours. For the histogram the vertical axis is a count of how many calls are in each bin while the vertical axis of the CDF is a measure of percentage (why percentage will become clear in a bit).

cdf_1

So the histogram is the same old boring story. We have a normal distribution that is skewed to the left (fatter on the left than the right). Looks like most calls take about five hours and only a few go above 12 hours. Yawn.

cdf_2

Ok, the CDF looks great. But what’s it good for? What can it tell us that the histogram can’t? Remember that I mentioned you can use a CDF to determine what percentage of the data falls between two points? Let’s explore that now.

Suppose I want to know how many calls take less than 5 hours (that is, between 0 and 5 hours). I can read that right off the graph by going across the horizontal axis until I reach 5 and then going up until I hit the trend line. I then read across to the vertical axis and see that the value is about 60%. What this means is that 60% of all calls take 5 hours or less. That’s a much firmer number than ‘the most calls are around 5 hours’. Similarly, what if the question was how many calls take between 5 and 10 hours. Well, first read off the 10 hour value, which is about 99% and we know the 5 hour value is 60% so just take the difference 99-60 to arrive at 33% of all calls take between 5 and 10 hours. Let’s see you do that with a histogram (hint: you can’t do it).

I’m convinced. How do I make one in Excel?

Creating a CDF is very straightforward and the following five steps describe the entire process. I’ve included a quick Excel screen grab to illustrate what the first few rows of the data will look like when you’re done (it’ll make more sense when you’re done the steps).

cdf_3

Step One: Start with your data in Column A in Excel sorted from smallest value to largest. My data had a lot of zero values, so that is why the column starts off with a series of zeros.

Step Two: In Column C (yes I intentionally skipped B) put the number 1 in C1, 2 in C2, 3 in C3, etc.  Here’s a useful hint to speed this up: use =C1+1 for C2 and fill down.

Step Three: In B1 enter how many data points you have. I had 30,000 so I dropped that number in there.

Step Four: For B2 use the formula =C2/$B$1 and fill down. The idea is that you want to divide the value in Column C by the total count. When you get to the bottom of your data the last value should be 1.

Step Five: Now that the data has been prepped highlight all of columns A and B and insert a new Scatter chart. What you’ll get is the monotonically increasing (i.e. always going up) trend line. You can edit the styling so that the graph is a line rather than points and make whatever other changes make it look it pretty. I recommend fixing the vertical axis at 1 (since the data can never go above 100%) and the minimum horizontal axis at 0, if that’s appropriate. Voila, now you have a CDF chart.

But how do CDFs address the histogram’s shortcomings? Did you notice any bins during the calculation? Nope. CDFs use the value of every data point and therefore do away with binning and the loss of information binning introduces. And because there are no bins the shape of the CDF can’t change like it can in a histogram so that source of subjectivity is neatly side-stepped. And you want to know what else is great? It’s easy to compare two data sets with CDFs by simply plotting them on the same chart. Comparing two histograms on the same plot is not particularly enlightening besides noting how the peaks compare.

In conclusion, CDFs are like grilled cheese: great and easy. And like grilled cheese everyone should try one at lunch (or whenever you next do a frequency analysis). The CDF is a great compliment to traditional histogram analysis and will hopefully become a valuable tool in your graphic analysis toolbox.

An olive branch to histograms

Making a histogram is actually more work than creating a CDF. But if you really want to create one you can either use Microsoft’s Analysis ToolPak (what happened to the ‘c’ MS?) or perform some more elaborate Excel JuJu. Good luck.

Being Careful with Averages

In crime analysis we are used to calculating averages and when I say average I mean the “mean”—e.g. add up ten numbers and divide by ten—that kind of average. But in crime analysis we are also frequently interested in examining rates or ratios. For example: how many hours of patrol time are spent responding to calls. When we combine the two there is the potential to be tripped up since, if we are talking ratios, we need to be very careful about how we calculate an average. Depending on the approach we take we’ll get different numbers.

An example will help illustrate the problem.

Let’s say that your CAD system keeps track of how many hours an officer spends on patrol and, of that time, how much of it is spent responding to calls. The following table illustrates a small sample of this kind of data.

mean_1

From the data in the table I calculated a ‘busy-ness’ ratio for each officer; this is Call Time divided by Patrol Time. From those values I in turn calculated the average Busy-ness, which is 0.51, or 51% busy. But wait, let’s add up all the Patrol Time (78.6 hours) and all the Call Time (40.6 hours) and find the ratio. (40.6/78.6)=0.52 or 52% busy. These numbers are different because of the mathematical principle that says that the ratio of averages (0.52) does not necessarily equal the average of ratios (0.51). But which one is ‘correct’? Well, they both are, and the one you go with all depends on what question you are trying to answer.

For the ratio of averages (0.52) we are asking a question about the entire police service: “how much patrol time did we spend on calls?” For this question 52% is the correct answer because it takes into account the total time spent on patrol and the total time spent on calls. If an officer spends 3 hours on patrol or 12 hours, all of that is taken into account in the final number.

The average of ratios (0.51) is trickier as it answers a question about officers going on patrol: “how much of their patrol can an officer expect to spend on calls?” The difference here arises because the average of ratios treats all values contributing to the mean as having equal weight and this is a fine assumption when the question is focused on how the averages behave. In other words, the length of the shift doesn’t matter, only how much of the shift was spent on calls.

But why am I telling you this since the difference seems to be pretty small? It’s because there are situations where the difference can be significant. For example, look at the modified table below where by adjusting the patrol times I was able to increase the difference between the averages to 4%. Making a 4% mistake on something like officer busy-ness can be the difference between hiring more officers and not. That’s not a mistake I would want hanging around my neck.

mean_2

The take away from this post is to be mindful when taking averages because so many of the numbers crime analysts look at are ratios. Whether we’re looking at officer busy-ness or per capita car thefts or tickets issued per traffic blitz it is important to understand the question that the average is supposed to answer.

Bonus commentary for pedants
But wait, you might be saying, how is the 0.52 number a ratio of averages? No average was even calculated; we just summed the numbers and divided the Patrol Time by the Call Time. That’s a ratio of sums. Yes, that’s true, but if you had wanted to be a real stickler you could have found the average Patrol Time (78.6/8 = 9.825) and the Call Time (40.6/8 = 5.075) and divided those numbers and arrived at the same answer (5.075/9.825) = 0.52. It works this way because the divisor in both the averages is the same (it’s 8 representing the 8 shifts included in the calculation) and therefore it cancels out when the ratio is taken.

Testing the Significance of Day-of-Week Analysis with the Chi-Square Statistical Test

A common weapon in the crime analyst arsenal is the day-of-week analysis. One problem with the technique is that many people accept the results uncritically and assume that if, for example, there have been several more incidents on Friday than any other day then Friday is the best day for a focused action. The problem is that day-of-week analysis is seldom accompanied by a confidence interval or evaluated to find out if the results are statistically significant. Without performing the statistical follow-up it’s impossible to know if the results actually represent anything other than randomness.

In order to evaluate statistical significance for something like a day-of-week analysis we need to use the Chi-Square Statistical Test. The chi-square test is what’s known as a non-parametric test which is a fancy way to say that it will work with nominal data (that is categories) and that the hypothesis being measured does not require normal distribution or any variance assumptions (a bonus for testing things like a day-of-week breakdown). The test works by performing a goodness-of-fit analysis between the observed data and an expected distribution. The point is to determine if the differences between the observed and expected are caused by something other than chance.

To make things more concrete let me introduce the actual data I will be evaluating. The following image lists some (fictional) motor vehicle collision (MVC) counts over the course of several weeks. Looking at the data with an uncritical eye might lead some to assume that Fridays are the worst day for collisions and that we should do a traffic blitz on that day. The purpose of the exercise is to see if that’s a valid assumption.

screen-shot-2012-06-22-at-11-17-08-pm

So this is our “observed data” but what about the “expected distribution”? In order to answer we need to identify another property of the statistical significance test: the null hypothesis. In any statistical test we need two hypotheses: the research hypothesis and the null hypothesis. The research hypothesis is what we’re hoping to show, in this case, that the day of week means something for MVCs. The null hypothesis is the opposite and in our example it means that the day of the week doesn’t factor into when collisions take place. Since the null hypothesis states that the day doesn’t matter then we should expect the same number of collisions each day. If the total number of collisions is 140 then the expected value for each day is 140/7 = 20. With this new info have a look at the following image that lists both the observed and theoretical distributions.

chi_1
The chi-square calculation is pretty straight forward: subtract each observed value from its expected value, square it, and then divide by the expected value. So for Sunday the value would be (14-20)^2/20 = 1.8., for Monday it’s: (16-20)^2/20 = 0.8 and so on for each of the 7 days. Then you sum up each of the calculated values to arrive at a total score (8.45 in this case), as in the image below.

chi_4

The next step is to check the score against what’s known as a chi-square critical value table, which is pre-computed tables of critical values for different confidence intervals and degrees of freedom. The tables are all over the Internet but I used this one for this example. I wrote above about confidence intervals, and I’m going to pick the 95% interval just because, which means that my alpha value (look at the table) is 0.05 (i.e. 1 – 0.95). The second value is degrees of freedom which is a measure of how many elements are being tested, minus one. For this example, 7 days in the week, minus one, means we have 6 degrees of freedom (df). Knowing these two pieces of information we can look up the chi-square critical value. Go down to the row for df=6 and across to the column for alpha=0.05 and the number is 12.592.

12.592 is greater than 8.45. That means we cannot reject the null hypothesis. That means we can’t say that the observed MVCs across the days of the week is due to anything other than chance. It means that we can’t say that any particular day is better than another to do a traffic blitz because, statistically, the observed data could have happened by chance. This is a troubling truth if we’ve been telling officers to focus on Fridays.

So this post is kind of a bummer if you’re used to directing officers based on day-of-week analysis. A lot of officers don’t want to hear about statistical significance (well, actually, none, I don’t know any who would want to hear about it) but the hard truth is that math is telling us that what we think is a pattern (e.g. more MVCs on Fridays) is often not distinguishable from chance. On the bright side, when we do get to reject the null hypothesis we are able to objectively demonstrate that, in fact, yes, something is going on that warrants attention. And of course the chi-square test is applicable to all sorts of categorical data sets, many of which demonstrate statistical significance, so hopefully you can add it to your toolbox for use in the future.

Important Bonus Math:
There are a few caveats when working with chi-square. First, it’s a data hungry method, the more values you have, the better your test will be. Second, it is skewed to rejecting the null hypothesis. I don’t know exactly why, that’s what a smart stat guy wrote, but it means that if you are forced to accept the null hypothesis, you’re really in bad shape. Third, don’t run the test if you have an expected distribution with less than 5 in each of the cells. This means, don’t let the expected value be less than 5 for each of days of the week, that’s too little data (see the first caveat) and your results won’t be good.