Using SUBTOTAL for Dynamic Sums

This is a quick trick, but a real timesaver. Every day I encounter the situation where I have a filtered list of data and I want to sum up one of the columns. A simple example can be seen in the screen grab below, where I have some fictional officers across four platoons with their monthly ticket totals.

Your first instinct might be to use =SUM(D4:D23) as I did in cell D2 and that will calculate the sum for all the rows of data. The problems arise when you want to calculate the sum for a subset of the records, for example, the sum of tickets for just Platoon 1.

Whoops, even though I filtered the Unit column down to just Platoon 1, the sum remains the same.

To the rescue is the function SUBTOTAL(). Instead of SUM, in the following screen grab I’ve used the formula =SUBTOTAL(9,D4:D23).

As you can see, the value is the same as SUM. The magic happens when I start using the filters.

The SUBTOTAL updates itself depending on what records are visible in the range. Perfect.

You may be asking “Why the 9?” and that’s because SUBTOTAL can handle a variety of functions, not just summing. The function to be used is specified by the function_num parameter in the first slot, and 9 corresponds with SUM. You could just as well use 1 for AVERAGE or 2 for COUNT, etc. The following screen grab shows the options that are available.

My advice? Get used to always using SUBTOTAL. If you filter data sets as often as I do, it’ll change your life.

Download the sample spreadsheet for this post.

Comparing Event Counts with the Poisson Means Test

An exceedingly common task for crime analysts is to count the number of events that occur in one time period (e.g. day, week, month, YTD) and compare them to a prior time period. The change between these two numbers if often the source of much strategy and project-making in law enforcement. What does it mean that we had more motor vehicle collisions this week compared to last? What should we do about the increased number of mental health apprehensions YTD? What are we doing right to make robberies go down this month? These types of questions, and the actions that are taken from the discussions they generate, are the daily bread of law enforcement. Unfortunately, one of the things that is seldom done is to check if the change in the event counts is statistically significant. That is, is there really a difference between these numbers or can we just chalk it up to randomness?

This is where the Poisson Means Test comes in to play. It is a statistical test that can be used to determine if the difference between two counts is statistically significant or not. In prior posts I have written about statistical significance being about deciding between two hypotheses. The first (or null) hypothesis says that there is no difference between the two sets of events. The second hypothesis (the exciting one) says that, yes, these numbers are different enough that we would only expect to see this level of difference by chance every, say, 1 out of 20 times (when dealing with a 95% confidence interval). In this case, what we want to be able to do is pick a confidence interval (I’m sticking with 95%) and look at two different event counts (let’s say robberies YTD this year and last) and say that ‘yes, something is meaningfully different about these two numbers’, according to the statistical test. And we want to be able to do this in Excel.

But before we get into Excel a little background on Poisson. A Poisson process is a process that generates independent events over a period of time. Because the events are independent the time between consecutive events is considered random. It’s a perfect process for modelling things like criminal occurrences because notionally they are unrelated and we can’t predict with specificity when the phone is next going to ring. I’m not going to get into any math in this post as it’s not super important for understanding how to use the test so just know that Poisson is about counting the number of times something happens in a time period.

In order to compare two sets of events we need to define a couple of terms. The first set of events has a total count C1 that occurred over time intervals T1. The count C1 is the total number of events but there is some flexibility in defining the time intervals and that flexibility has to do with the other event count you want to compare. In our example I have the number of robberies in Jan-Feb-Mar of 2015 and 2014. I could say that T1=1 quarter, or I could say T1=3 months or I could say that T1=90 days, it really doesn’t matter because T2, the time for the events counted in C2, is exactly the same, since we’re doing a YTD comparison between two years. However, not all the use cases are like that, it is very possible to compare Jan-Feb-Mar 2015 to April 2015 in which case you’d want T1 to be 3 (months) and T2 to be 1 (month). Basically, you pick the largest common time period you have between the two data sets and used that as the basis for T.

Enough chit-chat, an example will make it things clearer.

poisson_1

I am sure this looks common to everyone: two year’s worth of crime data, the difference in the events and the percent increase. Can you just hear the operations inspector demanding  to know why robberies are up 21% and the detectives scrambling for answers? The problem is that we don’t know if that change is because of anything real or is just due to the randomness of the data. But we can use the Poisson Means Test to help us figure it out and in this example, using the nomenclature above, we have C1=44, T1=1 (because it’s one quarter), C2=56 and T2=1. What we need to do next is send those numbers through a function to tell us whether the difference is statistically significant or not. I sure hope Excel has a function for that … but it does not. However, thanks to some talented programmers I found a solution.

I first learned about the Poisson Means Test from the blog of Evan Miller, a programmer and statistician who has developed a great piece of statistical software for Mac named Wizard. Evan has a blog post on the subject of event counting where he explicitly tells readers to never make decisions on plus/minus differences without first checking the Poisson Means Test. To help facilitate this he has created a great online tool for checking two event counts to see if there is a meaningful difference between the counts.

The online tool works great but I need to be able to run these kinds of tests in bulk in Excel (where all crime data resides) and I wanted to see if I could make Excel do the test. Following Evan’s notes in his blog post and peeking under the hood at his tool’s javascript I discovered that most of the heavy lifting for the tool is performed by a javascript library known as jstat. Jstat provides numerous special functions including the, ahem, ‘Inverse of the Lower Regularized Incomplete Gamma Function’.  I took the time to figure out what that means so you don’t have to and what I learned is that Excel doesn’t offer a function to do it. But, Excel does offer VBA, and the jstat source code is available on GitHub, so I ported all the javascript functions I needed into VBA user defined functions and made them available through the following GIST.

Now, I don’t expect you to understand this code because I don’t understand a lot of it. I know what it is doing but there is a lot of code in here that I suspect comes from some flavor of a Numerical Recipes book. The point is, we don’t need to know how it works, the geniuses at jstat already figured it out, I just need to make sure that it works in VBA and I didn’t mess it up during the translation. I’m pretty confident in my coding after a lot of testing and you can test it if you want by trying Evan’s tool and my Excel macro together and seeing if you get the same results (you will). You can also, if you are adventurous, try out the R library zipfR, which has a function named Rgamma.inv, which does the same thing.

So how do we use the VBA to do something useful? Check the screen shot below where I have added a fifth column to my table named P-value. This is your usual statistical p-value that, for a 95% confidence interval, needs to have a value less than 0.05 in order to indicate statistical significance. As you can see in the image, I have called the custom function PoissonPValue from the VBA code and passed in the four values I identified above: C1, T1, C2, T2. The output of the function call is the value is 0.271, which means that, statistically speaking, there is no difference between 44 and 56, so that 21% increase is nothing to get worked up about (certainly not something to change strategy over).

poisson_2

I know that this is a hard truth to swallow because it runs counter to the thinking of most law enforcement personnel. They want to believe that if a number goes up then something is to blame and that something can be handled to make the number go back down. However, from personal experience I can attest, much of the week-to-week and month-to-month differences are not statistically significant. The reality is, randomness happens, and unless we want to be chasing our tails running after statistical ghosts, we would be well advised to concentrate our efforts on differences that the stats indicate are real.

To wrap up, the Poisson Means Test is an incredibly valuable tool that you should put in your toolbox. It’s not just for your weekly/monthly/annual crime report either. You can use it to test for meaningful differences between any two sets of events. So, for example, say someone ran a project over the last two months and wants to know if you can see an impact in the numbers. If you test the events from a period before against the 2-month project period you can see if there is a stat-sig difference between the counts. Now, that’s not to say that the project is responsible but at least you can demonstrate statistically there is a difference. Anecdotally, I have found new applications for the test every week and I like it because it adds a further layer of rigour to our analysis and I think that’s important. I hope you do too.

Prioritizing Using Exponential Decay

I had an officer ask the other day if I could identify the top 10 subjects in our records management system that were involved in mental health occurrences. On the surface this is a straightforward request: identify the people with the most activity but it gets trickier when you factor in the passage of time. For example, is someone who had 3 incidents over 6 months ago more of a concern than someone who had 2 incidents in the last 8 weeks? Clearly recency should factor into how a person is prioritized but history should also weigh in. To address questions like this we can use an exponential decay equation in Excel to preferentially weight recent incidents while still giving some weight to the past. In this post I will outline how to apply the method.

The first step is to obtain some data. Depending on the nature of the incident I will set an arbitary historical cut-off; in this example we’ll set it a year. This then means that we need to extract every mental health occurrence over the last year along with the date of the occurrence and a unique person identifier.

exponential_1

For each occurrence we then calculate the age of the incident in days by subtracting the date of the occurrence from today. This is done using Excel’s NOW() function to get the current date and the INT() function to trim off the time since we are working strictly with days. the following screen shot shows the formula in the function bar.

exponential_2

Next we want to create a column for the exponential decay formula. Most people are familiar with the idea of exponential decay, it is a mathematical formula where a quantity decreases at a rate proportional to its current value. In practice that means that the value drops off quickly and first and then more slowly as the value shrinks. The chart below shows a graph of exponential decay for a number of different constants. As can be seen in the chart we can tune how fast the values drop off by adjusting the constant.

640px-plot-exponential-decay-svg

In Excel, the formula for exponential decay is =EXP(-constant*time) where we get to specify the constant and time, in our case, is the age of the incident. Since we’re dealing with days and we only have a year of data we can decide how much we want past incidents to influence a person’s priority. For example, if we say that a mental health incident that happened today is worth 1 point, how much is an incident that happened 6 months ago worth? If we say that it’s work 0.5 points then we can calculate our constant as: -1*LN(0.5)/180. Where LN() is Excel’s natural logarithm function, 0.5 is point value we want to calculate and 180 is the (approximate) number of days in 6 months. This calculation results in a value of 0.003851 which we can use as our constant.

exponential_3

Back in our data worksheet we can finally create our new column that calculates the decay score for each incident based on its age. We use the exponential decay formula =EXP(-1*0.003851*age) to calculate the decay score for each incident as in the following screen shot. Note that the decay value will always be less than 0 but never reach 0.

exponential_4

Now that we have a decay score for each occurrence we need to create a pivot table of all the data. Highlight everything and select ‘PivotTable’ from the ‘Insert’ tab. This will open a modal window and you can just hit the ‘OK’ button to create your pivot worksheet.

exponential_5

On the pivot table worksheet drag the Identifier field to the Row Labels box and the Decay Score field to the Values box. Click on ‘Count of Decay Score’ in the Values box and select the ‘Value Field Settings’ option at the bottom of the pop-up window. Change the selected option from Count to Sum and click ‘OK’.

exponential_6

Almost done. Back in your pivot table right-click the Sum of Decay Score column and sort ‘Largest to Smallest’. This will sort the persons by their aggregated decay scores and the people at the top are the highest priority individuals.

exponential_7

This method works because the exponential function embodies the principles we were originally trying to capture: it gives more weight (a higher score) to recent incidents but it also gives some score to historic incidents as well. For every occurrence a person is involved with they have the potential to add up to 1 point to their total score (if the incident happened just today) but as the incidents age they influence the total score less. In our example above the person with the 3 incidents over six months ago has, at most, a score of 1.5 (since 0.5 is the 6 month mark) while our 2 incidents in the last 8 weeks person is likely to have a score a little less than 2. So this system weights these people the same by balancing the influence of recent and history. The best part is that we can tune how much weight we give to the two factors by adjusting the exponential decay constant. If we decide that an incident 6 months is only worth 0.1 point rather than 0.5, which greatly privileges recency over history, we can adjust the constant and rescore everyone. On the other hand, if we wanted to make a 6 month incident worth 0.9 points, we could recalculate the scores and privilege history over recency.

I like this method of prioritizing because it’s simple to implement in Excel (no plug-ins!), it’s logically justifiable and it allows the impact of recency and history to vary according to the expertise of the analyst. I hope you find it useful and good luck prioritizing!

Performing a Chain Analysis in Excel

Chain Analysis is the name I’ve given to the act of finding sequential, potentially related phone calls in phone records. The job is to identify if a particular number called a second number and if that second number called a third number within a specific time frame, say 10 minutes. I call these sequences of calls a chain and it can be used to identify whether someone might be issuing orders or instructions via third parties. The analysis can be done in Excel and while it involves some moderately-complicated Excel functions, once the spreadsheet is set up, subsequent analyses can be done easily.

Getting Started

To start, the first thing you need to do is arrange your data properly. In the screen shot below I have arranged some sample data where column A lists the calling number, column B lists the called number and column C lists the date and time of the call. In order for this technique to work properly it is important that the data set be sorted on the date time column.

For simplicity sake (and ease of reading) I’ve replaced the phone numbers with letters but really any string will work in the analysis. Likewise, I only have 7 rows of data but you can have a whole bunch more and the analysis will work just fine.

While we’ll need a few extra columns for functions there is no reason you can’t keep additional data in the rest of the columns. As long as you sort everything properly the rest of your information will tag along as you perform the analysis.

chain_1

Notice in the list that there are some chains that are easily identified without resorting to a function. Notice how, on row 3, AA calls CC and then CC calls DD a minute later. Similarly, on row 6, GG calls AA and then a minute later GG calls BB. This is also a chain to be identified. The trick now is to write a function that is able to flag these rows so that we don’t have to rely on picking them out manually.

The Look-Ahead Column

To flag the row we’re first going to create a new column that, for each row, “looks ahead” at the rows coming up. What we want the function to do is look ahead to see how many subsequent rows fall within the time frame window we’re interested in. In the discussion above I mentioned 10 minutes so, for each row, we want to count how many subsequent rows occur within 10 minutes of the date and time of the current row. The screen shot below captures the function that I wrote for cell D2.

chain_2

The very first thing to notice is that the function is wrapped in curly braces like this: {<function>}. This means that the function is what’s known as an array formula rather than a traditional scalar formula. To enter an array formula you click in the cell like you normally would but instead of hitting enter when you’re done you need to hit ctrl-shift-enter (at the same time) to tell Excel that it should be considered an array formula.

This particular function needs to be used as an array formula because we are asking the function to compare each subsequent cell to the current one in the IF statement, so we’re comparing an array of values rather than just a single one. Also, when you’re typing the function into the cell, don’t include the curly braces, Excel will add those after when you correctly use ctrl-shift-enter to complete the array function.

Let’s examine the IF statement next. The inner most part of the function is a request to check and see if the difference between cell C3 and cell C2 is less than 600 (our 5 minute cut-off). Since the date and time values in the C column are in Excel time we need to convert them to seconds before we do the comparison, which is why I multiply them by 24 (hours) and then 3600 (seconds) to get the difference into seconds. In the IF statement, if the difference between C3 and C2 is less than 600, then we return a 1 and if not we return a 0. Now, the tricky part is to wrap your head around the idea that, because this is an array function, we can use the expression C3:$C$8-C2 to do all of the comparisons at the same time. This means that we are doing C3-C2, C4-C2, C5-C2, etc, all the way to C8-C2. For each one of them we do the x24 x3600 conversion and comparison against 600 seconds. For each one of them we are getting either a 1 or a 0. The power of the array function method is that we can do all of those calculations at the same time, in the same cell.

Have a look at the SUM function that wraps the IF statement. Knowing that this IF statement is multivalued makes the SUM function make more sense. The role of the SUM is now to sum up all of the 1s that were returned by each of the IF statements. Have a look at the value of D2, it’s 3. It’s 3 because the date/time in C3, C4 and C5, are both within 10 minutes (600 seconds) of the value in C2.

Array formulas can be filled down just like regular functions so I have filled down the rest of the rows with this function. The key thing to remember is that you only need to compare the current row to the all the subsequent rows, and so the important bit of the function, the C3:$C$8-C2 bit, will always have the current row in the C2 position and the next row in the C3 position. As an example, in cell D4 the function is =SUM(IF((C5:$C$8-C4)*24*3600<600,1,0)) . Note that for row 4, C4 is used to subtract and the range of cells to check starts at C5.

Figuring Out the Look Ahead Row Number

The next step in the analysis is create a new column to hold the end row value for the look ahead. If the values in column D tell us how many rows to look ahead then the values in column E tell us what row that actually is. Have a look at the function in cell E2 in the screen grab below.

chain_3

This function uses the ROW() function to first tell us what row we are currently in. Then the value of D2 is added to that to arrive at the final value. For cell E2 the value is 5 since the ROW() function returns 2 and the value of D2 is 3. What this means is that we only need to look ahead to row 5, from row 2, to find all the phone calls that occurred within the 10 minute window. Anything beyond row 5 is more than 10 minutes after the call in row 2. The end row is an important component of the next part of the analysis as we need to be able to specify exactly what cell range we should be looking at for determining chains.

Flagging the Chained Calls

Have a look at the screen grab for the final step of the analysis, the step where each row is flagged as part of a chain or not.

chain_4

The heart of the function is the use of the MATCH function to compare the value in column B, the Callee column, with the values in the Caller column for all the rows up to the end row identified in column E. In order to do that we need to use the INDIRECT function, which is used here to assemble a cell range using the ROW() function and the value from column E. Inside the INDIRECT is (“A” & (ROW()+1) & “:A” & E3). If you evaluate this we can see that ROW()+1 becomes 4 (since ROW() is 3 and 3+1 is 4) and the value in E3 is 5, which leaves (“A” & 4 & “:A” & 5). If you slam all that together (which is what the &s are for) you end up with A4:A5, which is the range of cells we want to test our MATCH against.

Something else you need to know: when MATCH can’t find a match it returns NA, which is why the entire MATCH call is wrapped in an ISNA function call. If MATCH returns an NA value, then ISNA will return true, otherwise it returns false. I then feed that true or false into an IF statement (which is only interested in evaluating things to true or false anyway). If the value in the IF is true, that is, MATCH returned NA, then IF prints nothing, “”, to the cell. However, if MATCH returns a value and therefore ISNA returns false, then the IF statement will print “CHAINED” in the cell. For good measure, I wrap the whole thing in another IF statement to check and see if the look ahead rows in cell D3 are greater than 0. If not, there is no way there can be a chain, so the function just prints the empty string “” and we skip the whole MATCH, ISNA, IF process.

So let’s have a look at the results the function spit out. We can see that two chains have been identified. The first is on row 3 where AA called CC and then CC called DD within 10 minutes. This is what the CHAINED in cell F3 is identifying. The second CHAINED is for the chain that begins when CC calls DD and then DD calls FF (on row 5) within 10 minutes. The take away here is that chains are identified on the row in which they begin so it’s up to you to follow the chains from the starting point where they are identified.

Bonus Chains

The above example looks at Caller-Callee chains but almost the same function can be used to flag Caller-Caller chains, the ones where a caller calls a number and then within the time interval the same caller calls another number. In the screen shot below Caller-Caller chains have been identified for the second row and the sixth row and the only difference between the functions in column F and column G is that the MATCH in the column G version is being done on the values in column A (e.g. A2) rather than column B (e.g. B2). All the same logic applies but the change in the matching column changes the chaining subject.

chain_5

As can be seen in the chart, row 2 is a Caller-Caller chain because AA calls BB and then within 10 minutes AA calls CC. Similarly, on row 6, GG calls AA then GG calls BB with 10 minutes. Just like with the Caller-Callee chains above, the CHAINED flag is listed on the row where the chain begins and it’s up to the analyst to follow the chain through.

Conclusion

While the functions in this post can seem a little hairy at the start I think, with a little exploration, that you can become comfortable enough with them to trust the results. While it’s true that a custom VBA function could be written to perform this analysis I think it’s important to find solutions that first attempt to use the built-in Excel functions before resorting to custom code. I hope this post has demonstrated how a fairly sophisticated kind of analysis can be performed with just the tools that come in standard Excel. Good luck!

A Quick and Dirty Phone Toll Technique Using Excel Pivot Tables

Analysts are often asked to perform phone toll analysis, which is the act of analyzing call records to see who called whom, when and how frequently. There are commercial tools for performing such an analysis but I’ve found Excel can easily get the smaller jobs done. In a typical phone toll analysis you have some numbers of interest, suspect numbers or the numbers of known associates, that you want to study in depth. The technique I am going to demonstrate below is not applicable to that kind of analysis. Rather, I am going to detail how to use Excel pivot tables to perform a different kind of phone toll analysis useful for when you are interested in number discovery.

The scenario is as follows: there have been three incidents with the same MO and it is likely that the same suspects are committing the crimes. You don’t have any suspect descriptions but you do have the exact location and time of each of the incidents. Working on a hunch that the suspects are using their phones during or shortly after the crimes your detectives secure a warrant for all cell activity in each of the three incident locations for each of the three occurrence intervals from all the major cell carriers. They then email you this data and ask you to do an analysis.

The first step is to sort out the data. Every carrier organizes their data differently but the common elements between them all are called number, calling number and the date and time of the call (sometimes combined as date time). To normalize the data first create a new spreadsheet with columns for date, called number, calling number, carrier and  location. These last two will need to be manually added to the spreadsheet based on data from the carrier and are used to differentiate the data between carriers. The intention of this step is to create a single ‘super sheet’ that combines all the calls made across all the carriers into one big list.

The stripped down super sheet in the screen grab below lists only 12 calls across three carriers but it’s quite possible that in a real life example you could be looking at thousands upon thousands of calls (depending on the size of your time interval and the size of your cell phone market).

toll_1

After normalizing the data the second step is to make two copies of the super sheet (leaving the original as a source file). In the first copy, delete the ‘Called’ column and in the second copy delete the ‘Calling’ column. Finally copy all the data from the second sheet and paste it at the bottom of the data in the first copy and rename the column with the phone numbers in it as ‘Number’. The point of this step is to wipe out the difference between called and calling number since, when we’re fishing for common numbers, we don’t want to just rely on one or the other in case during one incident the suspect was the caller but in another the suspect was the called number.

In the screen short below I pasted in the content from the ‘Calling’ sheet to the bottom of the data in the ‘Called’ sheet and even though you can’t see the column named in the shot I renamed column 2 to ‘Number’.

toll_2

The third step is to create a pivot table from the combined work sheet with the single ‘Number’ column. Highlight the whole thing and insert a pivot table using the PivotTable button under the ‘Insert’ tab (see screen shot below).

toll_3

In the new pivot table work sheet drag the ‘Number’ column to the ‘Row Labels’ box and the ‘Location’ field to the ‘Column Labels’ box. Finally, drag the ‘Date’ field to be the ‘Values’ box to get something similar to the screen shot below.

toll_4

What you’ll likely see (depending on how many sources numbers you have) is a very long pivot table with a lot of blank spaces. What you need to do is copy the entire table and paste its values into a new worksheet so that it can be manipulated. After pasting, highlight the data and apply a data filter so that each column has the drop down arrows for filtering out values.

toll_5

Now it’s time to search for common numbers. If, as we hope, the same crew committed all the crimes and they used their phones during the different crimes then there should be numbers that have non-zero values in more than one column. In order to more easily identify the numbers use the filter drop down on the first location column (123 Main Street in the example) and deselect the (Blank) option. This should shrink the list considerably. Now, do the same thing on the second column be deselecting the (Blank) option. Any phone numbers that remain were used in both locations during the incidents.

toll_6

By re-selecting (Blank) for the second column and de-selecting (Blank) in the third column you can check for common numbers between the first and third locations as well. And finally, by re-selecting (Blank) in all the columns and then de-selecting it for two and three you can find common numbers for those two addresses. You only need to do those three combinations as all other combinations (e.g. three then two, three then one, etc) are just reversals of what’s already been done and you’ll get the same results.

By noting down what numbers are common and then applying a filter to the original super sheet that maintained the called/calling split it is possible to find the carrier. You can then examine the original carrier documents to get more details and provide your detectives with a list of numbers they can follow up on.

toll_7

One note, try googling the common numbers that you find before you send them off to your detectives. Often common numbers are things like the carrier’s voice mail number or a public service phone number for the municipality. The good thing is that if the number is one that the general public might use it will likely be heavily promoted on websites and show up near the top of search results. By pre-identifying common numbers you can save your detectives the potential embarrassment of asking a carrier for details on their voice mail service.

Testing the Significance of Day-of-Week Analysis with the Chi-Square Statistical Test

A common weapon in the crime analyst arsenal is the day-of-week analysis. One problem with the technique is that many people accept the results uncritically and assume that if, for example, there have been several more incidents on Friday than any other day then Friday is the best day for a focused action. The problem is that day-of-week analysis is seldom accompanied by a confidence interval or evaluated to find out if the results are statistically significant. Without performing the statistical follow-up it’s impossible to know if the results actually represent anything other than randomness.

In order to evaluate statistical significance for something like a day-of-week analysis we need to use the Chi-Square Statistical Test. The chi-square test is what’s known as a non-parametric test which is a fancy way to say that it will work with nominal data (that is categories) and that the hypothesis being measured does not require normal distribution or any variance assumptions (a bonus for testing things like a day-of-week breakdown). The test works by performing a goodness-of-fit analysis between the observed data and an expected distribution. The point is to determine if the differences between the observed and expected are caused by something other than chance.

To make things more concrete let me introduce the actual data I will be evaluating. The following image lists some (fictional) motor vehicle collision (MVC) counts over the course of several weeks. Looking at the data with an uncritical eye might lead some to assume that Fridays are the worst day for collisions and that we should do a traffic blitz on that day. The purpose of the exercise is to see if that’s a valid assumption.

screen-shot-2012-06-22-at-11-17-08-pm

So this is our “observed data” but what about the “expected distribution”? In order to answer we need to identify another property of the statistical significance test: the null hypothesis. In any statistical test we need two hypotheses: the research hypothesis and the null hypothesis. The research hypothesis is what we’re hoping to show, in this case, that the day of week means something for MVCs. The null hypothesis is the opposite and in our example it means that the day of the week doesn’t factor into when collisions take place. Since the null hypothesis states that the day doesn’t matter then we should expect the same number of collisions each day. If the total number of collisions is 140 then the expected value for each day is 140/7 = 20. With this new info have a look at the following image that lists both the observed and theoretical distributions.

chi_1
The chi-square calculation is pretty straight forward: subtract each observed value from its expected value, square it, and then divide by the expected value. So for Sunday the value would be (14-20)^2/20 = 1.8., for Monday it’s: (16-20)^2/20 = 0.8 and so on for each of the 7 days. Then you sum up each of the calculated values to arrive at a total score (8.45 in this case), as in the image below.

chi_4

The next step is to check the score against what’s known as a chi-square critical value table, which is pre-computed tables of critical values for different confidence intervals and degrees of freedom. The tables are all over the Internet but I used this one for this example. I wrote above about confidence intervals, and I’m going to pick the 95% interval just because, which means that my alpha value (look at the table) is 0.05 (i.e. 1 – 0.95). The second value is degrees of freedom which is a measure of how many elements are being tested, minus one. For this example, 7 days in the week, minus one, means we have 6 degrees of freedom (df). Knowing these two pieces of information we can look up the chi-square critical value. Go down to the row for df=6 and across to the column for alpha=0.05 and the number is 12.592.

12.592 is greater than 8.45. That means we cannot reject the null hypothesis. That means we can’t say that the observed MVCs across the days of the week is due to anything other than chance. It means that we can’t say that any particular day is better than another to do a traffic blitz because, statistically, the observed data could have happened by chance. This is a troubling truth if we’ve been telling officers to focus on Fridays.

So this post is kind of a bummer if you’re used to directing officers based on day-of-week analysis. A lot of officers don’t want to hear about statistical significance (well, actually, none, I don’t know any who would want to hear about it) but the hard truth is that math is telling us that what we think is a pattern (e.g. more MVCs on Fridays) is often not distinguishable from chance. On the bright side, when we do get to reject the null hypothesis we are able to objectively demonstrate that, in fact, yes, something is going on that warrants attention. And of course the chi-square test is applicable to all sorts of categorical data sets, many of which demonstrate statistical significance, so hopefully you can add it to your toolbox for use in the future.

Important Bonus Math:
There are a few caveats when working with chi-square. First, it’s a data hungry method, the more values you have, the better your test will be. Second, it is skewed to rejecting the null hypothesis. I don’t know exactly why, that’s what a smart stat guy wrote, but it means that if you are forced to accept the null hypothesis, you’re really in bad shape. Third, don’t run the test if you have an expected distribution with less than 5 in each of the cells. This means, don’t let the expected value be less than 5 for each of days of the week, that’s too little data (see the first caveat) and your results won’t be good.