## Testing the Significance of Day-of-Week Analysis with the Chi-Square Statistical Test

A common weapon in the crime analyst arsenal is the day-of-week analysis. One problem with the technique is that many people accept the results uncritically and assume that if, for example, there have been several more incidents on Friday than any other day then Friday is the best day for a focused action. The problem is that day-of-week analysis is seldom accompanied by a confidence interval or evaluated to find out if the results are statistically significant. Without performing the statistical follow-up it’s impossible to know if the results actually represent anything other than randomness.

In order to evaluate statistical significance for something like a day-of-week analysis we need to use the Chi-Square Statistical Test. The chi-square test is what’s known as a non-parametric test which is a fancy way to say that it will work with nominal data (that is categories) and that the hypothesis being measured does not require normal distribution or any variance assumptions (a bonus for testing things like a day-of-week breakdown). The test works by performing a goodness-of-fit analysis between the observed data and an expected distribution. The point is to determine if the differences between the observed and expected are caused by something other than chance.

To make things more concrete let me introduce the actual data I will be evaluating. The following image lists some (fictional) motor vehicle collision (MVC) counts over the course of several weeks. Looking at the data with an uncritical eye might lead some to assume that Fridays are the worst day for collisions and that we should do a traffic blitz on that day. The purpose of the exercise is to see if that’s a valid assumption.

So this is our “observed data” but what about the “expected distribution”? In order to answer we need to identify another property of the statistical significance test: the null hypothesis. In any statistical test we need two hypotheses: the research hypothesis and the null hypothesis. The research hypothesis is what we’re hoping to show, in this case, that the day of week means something for MVCs. The null hypothesis is the opposite and in our example it means that the day of the week doesn’t factor into when collisions take place. Since the null hypothesis states that the day doesn’t matter then we should expect the same number of collisions each day. If the total number of collisions is 140 then the expected value for each day is 140/7 = 20. With this new info have a look at the following image that lists both the observed and theoretical distributions.

The chi-square calculation is pretty straight forward: subtract each observed value from its expected value, square it, and then divide by the expected value. So for Sunday the value would be (14-20)^2/20 = 1.8., for Monday it’s: (16-20)^2/20 = 0.8 and so on for each of the 7 days. Then you sum up each of the calculated values to arrive at a total score (8.45 in this case), as in the image below.

The next step is to check the score against what’s known as a chi-square critical value table, which is pre-computed tables of critical values for different confidence intervals and degrees of freedom. The tables are all over the Internet but I used this one for this example. I wrote above about confidence intervals, and I’m going to pick the 95% interval just because, which means that my alpha value (look at the table) is 0.05 (i.e. 1 – 0.95). The second value is degrees of freedom which is a measure of how many elements are being tested, minus one. For this example, 7 days in the week, minus one, means we have 6 degrees of freedom (df). Knowing these two pieces of information we can look up the chi-square critical value. Go down to the row for df=6 and across to the column for alpha=0.05 and the number is 12.592.

12.592 is greater than 8.45. That means we cannot reject the null hypothesis. That means we can’t say that the observed MVCs across the days of the week is due to anything other than chance. It means that we can’t say that any particular day is better than another to do a traffic blitz because, statistically, the observed data could have happened by chance. This is a troubling truth if we’ve been telling officers to focus on Fridays.

So this post is kind of a bummer if you’re used to directing officers based on day-of-week analysis. A lot of officers don’t want to hear about statistical significance (well, actually, none, I don’t know any who would want to hear about it) but the hard truth is that math is telling us that what we think is a pattern (e.g. more MVCs on Fridays) is often not distinguishable from chance. On the bright side, when we do get to reject the null hypothesis we are able to objectively demonstrate that, in fact, yes, something is going on that warrants attention. And of course the chi-square test is applicable to all sorts of categorical data sets, many of which demonstrate statistical significance, so hopefully you can add it to your toolbox for use in the future.

Important Bonus Math:
There are a few caveats when working with chi-square. First, it’s a data hungry method, the more values you have, the better your test will be. Second, it is skewed to rejecting the null hypothesis. I don’t know exactly why, that’s what a smart stat guy wrote, but it means that if you are forced to accept the null hypothesis, you’re really in bad shape. Third, don’t run the test if you have an expected distribution with less than 5 in each of the cells. This means, don’t let the expected value be less than 5 for each of days of the week, that’s too little data (see the first caveat) and your results won’t be good.

## The Poor Man’s Proximity Tool: The Haversine Formula

The haversine formula is a method for finding the great-circle distance between a pair of latitude-longitude coordinates. In crime analysis I use the formula for finding incidents that have taken place within a particular distance, for example within 500 meters, of a location.

My approach uses Excel to calculate the haversine distance between a reference location and the latitude-longitude of every other location in my spreadsheet. I then sort the results to find the locations that are within the distance that I’m interested in. I’ve worked the following example in Excel to illustrate the method.

Imagine that you have a list of 1000 occurrences with the latitudes and longitudes of each of the occurrence locations. Now also imagine that you have a reference location and that you want to determine how many of the 1000 incidents occurred within 500 meters of the reference location. I set my spreadsheet up like this:

Note that in column B I have entered all of the latitudes and in column C I have entered all the longitudes. I have also entered my reference latitude and longitude in cells E2 and E3 so that they can be referenced by the haversine formula. Also note that both latitudes and longitudes are decimal degrees as opposed to degree minutes seconds.

In column D I enter the Excel-ified version of the haversine formula which I have reproduced below. It gets a bit complex because Excel likes to deal with radians as opposed to degrees when using the COS and SIN functions so it is necessary to use Excel’s RADIANS function to convert the latitudes and longitudes. The final multiplication by 6371 represents the Earth’s radius in kilometres.

A note for pedants: I realize that the Earth is not a perfect sphere and that means that the calculated distances will be somewhat off if you are dealing with large distances. But, for the distances we are concerned with–distances of 10s of kilometres–the impact is negligible and can be ignored.

By filling column D with the expression I calculate the separation distance between the reference point and each incident. Now, if the distance column is sorted from smallest to largest I can easily see the incidents that occurred within 500 meters (or 0.5 kilometres) of the reference location.

I find this method useful for determining the proximity of all sorts of things. For example, if the reference coordinates are for an intersection the method can be used to find all motor vehicle collisions that occurred within 100 meters. Or say there was a break and enter, a proximity search can be done against field contacts that occurred within 1 kilometre of the occurrence address. Really anything that has a recorded latitude and longitude can be used. Best of all, it doesn’t require a GIS, just Excel and the formula.

## Creating Heat Maps from Excel Pivot Tables with Conditional Formatting

I frequently use Excel’s Pivot Table functionality to examine the relationships between variables. While it is a common tool in the analyst’s toolbox I don’t like the default visualization options. This blog post will discuss an approach for visualizing pivot table data using Excel’s built-in conditional formatting functionality to create heat maps.

A Heat Map is a chart that uses colour to visualize a two-dimensional matrix of values. Since pivot tables make the creation of two-dimensional output so easy the heat map is the perfect tool for visualizing a pivot table. The process is straightforward and I demonstrate it with an example.

I start with some fictitious data that has day-of-week and hour-of-day properties. Using the standard approach I create a pivot table that has day-of-week as the columns and the hour-of-day as the rows.

Next, I highlight the pivot table data and copy it to a new worksheet (this is for future formatting). I then highlight the entire data range and select the Conditional Formatting button in the ‘Home’ ribbon. Under Conditional Formatting I select Color Scales and one of the colour ramps that are available.

This immediately colour codes the entire range and creates an attractive heat map that will dynamically adjust its formatting if the cell values change. With a small amount of additional formatting (some bolding, a border, centre-aligned text) a very nice chart can be produced that can be added into a report.

Contrast the heat map with the following chart that plots the same data using multiple line charts. I find the heat map much more compact, readable and intuitive.

Conditional Formatting offers a quick way to produce heat maps for the kind of two-dimensional datasets analysts produce everyday. I have found that heat maps are an effective way to visualize a large volume of data in a manner that is easily understood by non-analysts. Best of all, if you have Excel 2007 or higher the tools are already installed on your computer.

Bonus Tip
You may come across the situation where you want to keep the underlying heat map cell colours but delete the numbers in the cells. Unfortunately, deleting the numbers also deletes the colour since this is how conditional formatting works. Frustratingly, copying and pasting in Excel also doesn’t work, not even with ‘Paste Formatting’, as it’s not possible to maintain the format without the numbers. Despite these headaches, I did find a trick for keeping the colours without the numbers.

First, highlight the whole heat map and copy it. Now open Microsoft Word and copy the heat map into a Word document.

Second, select the heat map in Word and copy it. Now open up a new Excel document and paste into the first cell. The heat map should be pasted in its entirety back into Excel and if you delete the numbers the colours should still be there.

Update: Reader Frank Fery writes that it is possible to remove the numbers from a heat map without resorting to the copy-to-Word trick. He provided a link to superuser.com that states that if you change the custom format of the cell to ;;; it will hide the numbers.  I still prefer the Word way though because it allows me to actually lay out different numbers on top of the heat map. Why would you want this? As an example, I once created a heat map where the surface represented calls-for-service volume but the numbers on the chart were numbers of officers working. Thanks for the feedback Frank.