Revisiting Aoristic Analysis

Attention users of aoristic.policeanalyst.com. I have retired the web tool and replaced it with a downloadable Excel spreadsheet. Discussion and reasoning below.

Back in 2012 I wrote a post about Aoristic Analysis where I introduced a web-based tool for conducting aoristic analysis. The reason I built the tool as a web page was that the code allowed me to split events down to the second so that they could be properly apportioned across time units. The drawback of making a web based tool is that law enforcement analysts, typically a suspicious bunch, never liked the idea of submitting their data (which is only dates, by the way, it is completely context-less) to a web server. Fair enough, we’re a paranoid bunch.

I recently had an email discussion with someone who was using my tool but wished they could have some software to do the work locally. I resolved to create an Excel version of the tool that would work in seconds like the web site. I have completed that work and the result is now available for download here: Aoristic_v3.xlsm.

The spreadsheet is a macro-enabled XLSM and that’s because I use VBA to perform the aoristic analysis. There are three custom functions –  AoristicHourOfDay, AoristicDayOfWeek and AoristicMonthOfYear  – and all the code in the spreadsheet can be seen in the following GitHub GIST:

Understanding the Code

If you open the XSLM file you will notice that there are worksheets with example usage of each of the functions. Since all three functions follow a similar approach I will discuss the logic behind AoristicHourOfDay used on the ‘Hour of Day’ sheet.

aoristic_v1

The ‘Hour of Day’ sheet has in column B the start time for an event and in column C the end time for the event. Following that there are 24 columns, one for each hour of the day, where the AoristicHourOfDay function is called. Looking at the embedded code above you can see on line 23 the function declaration for AorisiticHourOfDay and that it accepts a start and end time as well as the particular hour of interest for calculation.

I will now go through the function line by line to explain the reasoning behind it:

Line 24, 25: The function begins by first determining how many hours and how many seconds have elapsed between the start and end time.

Line 27-30: A check to make sure that the end time is not before the start time. If it is, an error is raised and the function exits.

Line 32-35: Next, the amount of one ‘unit’ is determined by dividing 1 by the number of seconds. Because this code is interested in apportioning an event down to the second, it is necessary to be able to assign units down to the second. For example, say an event was 2 hours long (7200 seconds) then every second is worth 1/7200 of the total event.

Line 37: A variable captures the starting hour based on the event start time.

Line 39-41: The code checks to see if the total number of hours for the event is 1 and the hour passed into the function is the start hour. If both of these conditions are met then the hour should be assigned the entire event (i.e. a value of 1) and the function can safely exit. This is the trivial case.

Line 43-68: If these conditions are not met then the more elaborate assignment method is used. The aoristic variable for this hour is initialized to 0 and a variable for the current hour is initialized to the starting hour for the event. Next, a for loop is initiated that loops from 1 to the total number of hours calculated for the event.

There is a lot going in this for loop so I will break it down, section by section.

Line 45: Within the loop the code first checks to see if the current hour equals the hour passed into the function. If the hours match it means that we need to assign some portion of the event to the hour.

Line 47-51: Next, the code checks to see if the current hour is equal to the start hour and if it matches the code determines the remaining seconds in the hour and assigns that to the share variable. The reason this is done is because, for the start hour, we only want the balance of the hour assigned and to determine the balance of the hour we need to subtract the current number of minutes and seconds from 3600, the number of seconds in an hour.

Line 53-55: Next the code checks to see if the loop counter is greater than 1 and less than the total number of hours. If this is the case the hour is given 3600 seconds. The reasoning here is that if the hour matches and the event started at an earlier hour and ends at a later hour then the event existed across this particular hour and should therefore be assigned a full 3600 seconds.

Line 57-61: The final if statement checks to see if the loop counter is equal to the total number of hours. This is the same as asking if it matches the last hour and if it does we include only those seconds in the part of the hour before the event ended. This is kind of like the reverse of the reasoning for the first if statement.

Line 62: After each of the if statements are complete the variable for tracking the aoristic total for the hour is incremented by adding to it the value of the share multiplied by the unit calculated above.

Line 64-67: Outside the if statement the current hour variable is incremented and reset to 0 if it passes the 24 hour mark.

One thing people may be curious about is why I did this in loop instead of just checking with the if statements and that has to do with the fact that an event can span days and therefore the same hour could be hit again and again. The loop allows for this calculation in a way that the if statements on their own could not.

Wrapping Up

The sample spreadsheet has a lot of example scenarios that illustrate different edge cases for the code. The key check on each of them is that the sum total of the values for all 24 hours should sum to 1. Excel also makes it easy to calculate percentages by hand and compare them to what the code has spit out. Having said that, while I have tested it to my satisfaction there is always the possibility for error. If you come across any odd behaviour or incorrect results please let me know and I will fix it and put a new version up.

I think this Excel method is superior to the web based tool as it allows the same level of detail as the web page plus it allows analysts to look directly under the hood and see how the work is being done. Best of all, you can download the spreadsheet and run everything locally and not have to worry about submitting sensitive data over the web.

One final tip: if instead of the share of event you want to instead know total time spent in each hour, create a new column that contains the event duration in seconds and multiply that by each of the portions in each of the hours. This new table will then show how much time was spent in each hour for each event.

Medians and Excel Pivot Tables

One of the issues with pivot tables in Excel 2010 (and earlier versions) is that it isn’t possible to calculate median using the Value Field Settings. In this post I am going to first extol the virtues of the median (again) and then demonstrate a formula for calculating the median in a cross tabulation.

Medians Again?

Why do we care about the median and why can’t we just use the mean? The mean is the proper measure of central tendency when what we want to average is normally distributed. In those situations, adding everything up and dividing by the number of elements will give the value right at the peak of the distribution. However, when our distribution of values is heavily skewed the mean is less representative than the median. In those cases it makes more sense to take the middle value (i.e. the median) of all the values when they are ordered from smallest to largest.

The classic example of using the median is household income. There are a lot more small income households than large income households and those big households can really distort the mean. For example, let’s say that we have 7 households with incomes (in $10,000s) of 1, 1.4, 1.8, 2, 2.1, 2.4 and 13.2. The mean of those numbers is 3.5 while the median is 2. Clearly 2 is a better representation of the centre of the data than the mean, which has been dragged higher by the one large income.

A similar issue occurs with any data set that is zero based.  For example, say you want to do an analysis to determine how long offenders wait between their arrest and when they are fingerprinted. Many services schedule identification appointments days or even weeks after the arrest and it might be interesting to see what the average time is between arrest and appointment. The shortest possible wait is 0 days, the fingerprinting is done on the day of the arrest, and in general one would expect a shorter time rather than a longer one. While there may be some 28-day waits we don’t want these outliers to unduly skew our average so we choose to take the median rather than the mean.

An Example

Let’s take this example further by considering a spreadsheet were we have 1000 rows of arrest times and fingerprint times and the duration in days between them.

median_1

The easiest way to determine the median would be to use the formula =MEDIAN(D2:D1000) but that would only give us the median for the entire data set. When analysing our data with pivot tables we are often interested in cross tabulating based on a particular element of the data and in this example we want to see if average wait time is impacted by the day of the week of the arrest.

To begin the analysis we first insert a new pivot table and use it to determine average (by which Excel means mean) wait time for each day of the week.

median_2

Have a look at these mean values, particularly Saturday. What is up with that? I’ll tell you: there are errors in the data set. Someone has keyed the dates in incorrectly and there are outliers that are horribly skewing the results. This is the other benefit of working with the median: medians resist the impact of outrageous outliers in a way that the mean doesn’t. We could go back in and do some data cleansing but let’s leave it for now so we can see how the median will handle it.

If you look at the list of options in the Value Field Settings you’ll see that Median is not an option. Since pivot tables don’t provide the option for determining the median we need to do some copying and pasting and use a formula to make it happen. Whenever I am investigating the median I first copy-and-paste-values the results of the mean pivot table to a new work sheet so that I can use the labels in my function calls. The new worksheet looks like this.

median_3

To calculate the median for each of the days of the week it is necessary to pull only those rows of data in the Data worksheet for that match the particular day. Once I have those rows I use the MEDIAN function to calculate the specific median for that day of week. To do this I use the formula in cell D4 below:

median_4

It is important to notice that the formula in the screen shot is surrounded by curly braces. You don’t type the braces in because Excel adds them when you create what’s known as an Array Formula. Array formulas are different than regular Excel formulas because they work with a whole series of data values rather than a single data value. To tell Excel that you intend a formula to be an array formula you must press CTRL SHIFT ENTER (all three keys at once, like CTRL ALT DEL) at the end, rather than just hitting ENTER. You’ll know that you did it right because Excel will add the curly braces.

In this particular array formula what we’re doing is using an IF formula to check every one of the Day of Week values in column A of the Data worksheet to see if it equals Sunday, or the value in B4. If the value does equal Sunday we return the wait time from column D for that row. The IF statement will check every row and return a range of values (all of the ones for Sunday) to the MEDIAN function which in turn calculates the median wait time for Sunday. The formula can then be filled down for each of the days just like you would with a regular formula.

median_5

You can see in the screen grab that we have median values for each of the days of the week. Clearly the outlier that impacted Saturday has been handled by using the median but also note the other days (except Monday) now have significantly smaller values. This is further evidence that the median is the correct choice for this data as using the mean, even when the data is clean, would lead to erroneous overestimates.

Conclusion

The bad news is that Excel can’t calculate medians in a pivot table. The good news is that with a simple formula and the labels produced by the pivot table it is possible to calculate the median for different cross tabulations. It does require a little experimentation with array functions but this particular example is pretty tame and serves as a decent introduction to this very powerful technique.