## Forecasting with Simple Regression and Prediction Intervals

The police produce new data points on a weekly, monthly and annual basis and every time a new number comes out there is an instinctive response to compare the new number to the last number, figure out if it’s gone up or down and attach the dreaded up and down arrows. Those up and down arrows (usually red and green respectively) can spur a lot of discussion and a lot of work as people change practice or get chewed out by superiors in an effort to make a change. In this post I am going to demonstrate how to use simple regression and prediction intervals to move beyond the up and down arrows by using historic data and some math to capture the inherent variability in a data series.

One important note before we begin: the approach detailed below is for data without a seasonal trend. This means that monthly data is likely out since most police data has a lull in the winter and a peak in the summer. I primarily use this approach for forecasting annual, running 12-month or year-to-date data. There are more elaborate methods for forecasting data with a seasonal trend and I hope to write a post about that someday.

For this post I am going to use the annual Crime Severity Index that is produced by StatsCanada for every police service in the country. StatsCan recently published the 2016 data for a local police service and in the screen grab below I demonstrate the up-and-down arrow method of trend analysis (which, not coincidentally, is how StatsCan communicates it out to police services).

As you can see, between 2015 and 2016 the Crime Severity Index went up by 6.44% (going up is bad, hence the ominous red X). Not a catastrophic increase but an executive might still demand some answers or possible some action to reverse the trend.

But let’s look at the last 16 years’ worth of Crime Severity Index values for the same police service.

The line chart tells a different story, one where CSI has been dropping for over a decade and where this year there has been a subtle uptick. The chart helps people understand visually what is going on but we can go further than the graph and build a simple model to help us understand the numbers even better.

For our model we’re going to be using simple linear regression. Recall from your days of high school mathematics that a linear regression model has the form Y = mX + b where m is the slope of the line, b is the intercept of the line, X is the independent variable (i.e. the one we can choose) and Y is the dependent or response variable (i.e. what comes out of our model when we plug in our values).

Excel makes it very easy to build a linear regression model using a line chart. Simply right click on the chart line and select “Add Trendline…”, you can even have the equation displayed on the screen like so:

In this model, m has the value -2.0433 and B has the value of 57.375. X and Y remain variables because we can sub in an X and get a Y (or the CSI value) in return.

While right-clicking on the chart to add the trendline is quick you can also use Excel functions to derive each of the pieces. The following screen grab details the function used to determine both the slope m and the intercept B.

The first takeaway is that Excel’s SLOPE and INTERCEPT formulas can be used to calculate the m and B values that are found on the line chart.

The second thing to notice is that in the table of data points the Year goes from 1 to 17 rather than 2000 to 2016. This was done because that’s how the linear model on the chart works but you can also put in the years and get nearly the same result: the slope will be identical but the intercept will be larger (on the order of a year). In the following screen grab I’ve changed over to using years to make things easier, like so:

A linear model is useful because it can be extended into the future to produce a forecast. For example, with the model parameters defined above it is possible to forecast a CSI value for next year by setting X equal to 2017 and calculating Y:

The point forecast for the 2017 Crime Severity Index is therefore 27.27.

While forecasting the future has obvious benefits it doesn’t address the up-and-down arrow issue. To address that issue we first want to produce a forecast for this year, 2016, and compare it to 2016 actual. And if we’re going to produce a forecast for this year, we shouldn’t include this year’s data in the creation of the linear model because we need to pretend like we don’t know this year’s value yet. In the following screen grab I’ve simply dropped the 2016 data from the calculation of the slope and the intercept:

By dropping the 2016 data point and recalculating the slope and intercept with the other 16 years’ worth of data, the slope and intercept change slightly. Using these news values and plugging in 2016 for X produces the following forecast for 2016.

Looking at the predicted value of 22.05, it makes sense if you don’t know anything about 2016. Every year before now the values have been trending down and all the historical data points are pointed in that direction. From the math’s point of view, there is no reason to think that the CSI should go up in 2016.

But, the forecast is wrong and that’s to be expected because point forecasts are nearly always wrong and if you give a senior decision maker only a point forecast, you’re going to end up explaining why the true value, when it comes around, isn’t the same as the forecast and then you’re going to find that people don’t trust your forecasts anymore.

So why bother with this exercise? Because linear models and point forecasts are the first step towards prediction intervals and those really do have value.

A prediction interval is simply an interval that is placed around the Y forecast that says “expect the value to be in this range, X% of the time”. I use prediction intervals as “I’m not surprised” ranges because if I forecast a number and calculate its prediction intervals and then the real number shows up and it’s within the interval range, I’m not surprised. The math said it would be so. If the real number happens to be outside the interval, well that’s something to investigate.

The following screen grab shows the mathematical formula that is used to calculate prediction intervals. I know it looks intimidating but I’ll break all the pieces down.

Breaking down the formula, here’s what each of the variables mean:

• y-hat (as it’s called) is the point forecast from above
• t*n-2 is the t-score with degrees of freedom n-2
• s is the standard error of the residuals
• n is the number of data points we have in our model
• x* is the value we plugged into the linear model to get y-hat
• x-bar is the mean value of all the x data points
• sx is the standard deviation of all the x data points

You may be familiar with some of these concepts and unfamiliar with others, and that’s ok. The good news is that we can calculate each of these with Excel and arrive at the prediction intervals by combining them all into the formula above.

In the following screen grab I’ve expanded on the initial linear forecast from above and added in several new calculations that build up each of the variables needed for the prediction interval.

Each of the pieces of the prediction interval equation are calculated in column F and column G shows the formulas that are being used. Pay particular attention to cell F11 as it contains the point forecast discussed above. Cell F13 has the margin of error value calculated using the equation above and in our example it has the value of 6.197. It is a simple matter to add and subtract that margin of error from the point forecast to create the 95% prediction interval and arrive at the upper and lower bounds on the forecast of 28.242 and 15.847.

Recall earlier that the point forecast of 22.045 was a poor estimate since the actual value for 2016 is 24.80. Also recall, from way back at the start of this post, that CSI in 2016 was up 6.4% and the question was raised “is this an issue”? Looking at the prediction interval of [28.242, 15.847] it’s clear that the actual value for 2016 is well within the range that the history of values tells us to expect and therefore the actual value is neither surprising or an issue.

The beauty of prediction intervals is that they allow us to do away with the up-and-down arrow approach that could have us chasing our tails and instead gives us an indicator we can use to figure out if something is outside the norm. While the original calculation is somewhat complex it is a straightforward exercise to set up an Excel template where all you need to do is drop in the values and have the intervals spit out the other end. I use intervals at least once a day to get a handle of how concerned I should be over changes in data series and I hope you find value in adding this technique to your analyst toolbox.

## Using SUBTOTAL for Dynamic Sums

This is a quick trick, but a real timesaver. Every day I encounter the situation where I have a filtered list of data and I want to sum up one of the columns. A simple example can be seen in the screen grab below, where I have some fictional officers across four platoons with their monthly ticket totals.

Your first instinct might be to use =SUM(D4:D23) as I did in cell D2 and that will calculate the sum for all the rows of data. The problems arise when you want to calculate the sum for a subset of the records, for example, the sum of tickets for just Platoon 1.

Whoops, even though I filtered the Unit column down to just Platoon 1, the sum remains the same.

To the rescue is the function SUBTOTAL(). Instead of SUM, in the following screen grab I’ve used the formula =SUBTOTAL(9,D4:D23).

As you can see, the value is the same as SUM. The magic happens when I start using the filters.

The SUBTOTAL updates itself depending on what records are visible in the range. Perfect.

You may be asking “Why the 9?” and that’s because SUBTOTAL can handle a variety of functions, not just summing. The function to be used is specified by the function_num parameter in the first slot, and 9 corresponds with SUM. You could just as well use 1 for AVERAGE or 2 for COUNT, etc. The following screen grab shows the options that are available.

My advice? Get used to always using SUBTOTAL. If you filter data sets as often as I do, it’ll change your life.

## Quick and Easy Year-To-Date Visualization with Stacked Column Charts

A common request from customers is to ask about historic crime trends. They usually want to know something like “How many break and enters have we had in District Y since 2010?” and what they likely want to do is get a sense of how things are moving; are things trending up and should they start a project or are they trending down and they can calm the nerves of a local council person.

Getting the data is usually straightforward and the most helpful presentation of the numbers is a chart but there’s always the wrinkle of how to show this year’s data compared to previous years. Unless the customer is asking for the data on the day after New Year’s the count for the current year is going to be Year-to-Date (YTD) only and comparing this YTD numbers to previous years is a bit complicated, especially once the chart leaves your hands and they start sharing it around. Having run into this request many times I’ve developed the following visualization method that uses a stacked column chart to present YTD data in an intuitive, readable format.

In the example spreadsheet I’ve created for this post I’ve generated some random reported dates dating back to 2007. In columns B and C I’ve broken out the reported year and month for the incident using Excel’s built in YEAR() and MONTH() functions. Real life data will likely have a lot more columns but these are the only ones you really need to make this technique work.

Next I create a pivot table with the data and I drag the “Reported” field from column A into the Values area and the “Reported Year” field into the Rows area. This creates a standard count-by-year table that I can copy over to another worksheet to prepare my chart.

This data covers the full year but because the data I am working with was generated in August of 2016 I want to be able to pull out the January through August data from the previous years for the YTD comparison. To do that I drag the “Reported Month” field into the Filters area and select only months 1 through 8 in the Reported Month filter.

The resulting list of values represents the YTD values for each year and I copy those over to my working sheet. I organize the data in a table with columns for the year, the total for the year and the January through August values. I then create one additional column labelled “Sep – Dec” and subtract the “Jan – Aug” count from the annual total to fill this column. The final table looks like this:

With the data in place we can now chart. Select the last two columns of data and then insert a stacked column chart. You may need to fiddle a bit with the settings (you’ll want to add the year labels in for each column) but what you’ll see is a chart that simultaneously visualizes both the YTD values and the overall annual values.

In my sample chart I’ve selected strongly contrasting colours to aid the YTD comparison. By looking at just the blue portions of the columns it’s possible to see that 2016 has been a good year so far for this kind of incident. We’re down significantly from last year at this time and actually at the lowest YTD levels we’ve seen since 2009. If we consider the orange column we can see that besides 2013 and 2015 the number of incidents have been fairly steady and that 2016 is likely going to come in as a little below average.

Charting historic crime volume is an everyday request for most analysts. By leveraging the convenience of stacked column charts you can create a single visual that helps understand both annual and YTD trends.

## Halton Regional Police Service wins 2016 ComputerWorld Data+ Editors’ Choice Award

In the September issue of ComputerWorld the Halton Regional Police Service was awarded a 2016 Data+ Editors’ Choice Award for their innovative use of data and analytics to improve the effectiveness and efficiency of the police service.

For all the talk of big data and technology, I think this is the most important insight for organizations looking to adopt data analytics:

“Unless people can believe in the product, the solving of the technical challenges doesn’t mean much.”

## Comparing Event Counts with the Poisson Means Test

An exceedingly common task for crime analysts is to count the number of events that occur in one time period (e.g. day, week, month, YTD) and compare them to a prior time period. The change between these two numbers if often the source of much strategy and project-making in law enforcement. What does it mean that we had more motor vehicle collisions this week compared to last? What should we do about the increased number of mental health apprehensions YTD? What are we doing right to make robberies go down this month? These types of questions, and the actions that are taken from the discussions they generate, are the daily bread of law enforcement. Unfortunately, one of the things that is seldom done is to check if the change in the event counts is statistically significant. That is, is there really a difference between these numbers or can we just chalk it up to randomness?

This is where the Poisson Means Test comes in to play. It is a statistical test that can be used to determine if the difference between two counts is statistically significant or not. In prior posts I have written about statistical significance being about deciding between two hypotheses. The first (or null) hypothesis says that there is no difference between the two sets of events. The second hypothesis (the exciting one) says that, yes, these numbers are different enough that we would only expect to see this level of difference by chance every, say, 1 out of 20 times (when dealing with a 95% confidence interval). In this case, what we want to be able to do is pick a confidence interval (I’m sticking with 95%) and look at two different event counts (let’s say robberies YTD this year and last) and say that ‘yes, something is meaningfully different about these two numbers’, according to the statistical test. And we want to be able to do this in Excel.

But before we get into Excel a little background on Poisson. A Poisson process is a process that generates independent events over a period of time. Because the events are independent the time between consecutive events is considered random. It’s a perfect process for modelling things like criminal occurrences because notionally they are unrelated and we can’t predict with specificity when the phone is next going to ring. I’m not going to get into any math in this post as it’s not super important for understanding how to use the test so just know that Poisson is about counting the number of times something happens in a time period.

In order to compare two sets of events we need to define a couple of terms. The first set of events has a total count C1 that occurred over time intervals T1. The count C1 is the total number of events but there is some flexibility in defining the time intervals and that flexibility has to do with the other event count you want to compare. In our example I have the number of robberies in Jan-Feb-Mar of 2015 and 2014. I could say that T1=1 quarter, or I could say T1=3 months or I could say that T1=90 days, it really doesn’t matter because T2, the time for the events counted in C2, is exactly the same, since we’re doing a YTD comparison between two years. However, not all the use cases are like that, it is very possible to compare Jan-Feb-Mar 2015 to April 2015 in which case you’d want T1 to be 3 (months) and T2 to be 1 (month). Basically, you pick the largest common time period you have between the two data sets and used that as the basis for T.

Enough chit-chat, an example will make it things clearer.

I am sure this looks common to everyone: two year’s worth of crime data, the difference in the events and the percent increase. Can you just hear the operations inspector demanding  to know why robberies are up 21% and the detectives scrambling for answers? The problem is that we don’t know if that change is because of anything real or is just due to the randomness of the data. But we can use the Poisson Means Test to help us figure it out and in this example, using the nomenclature above, we have C1=44, T1=1 (because it’s one quarter), C2=56 and T2=1. What we need to do next is send those numbers through a function to tell us whether the difference is statistically significant or not. I sure hope Excel has a function for that … but it does not. However, thanks to some talented programmers I found a solution.

I first learned about the Poisson Means Test from the blog of Evan Miller, a programmer and statistician who has developed a great piece of statistical software for Mac named Wizard. Evan has a blog post on the subject of event counting where he explicitly tells readers to never make decisions on plus/minus differences without first checking the Poisson Means Test. To help facilitate this he has created a great online tool for checking two event counts to see if there is a meaningful difference between the counts.

The online tool works great but I need to be able to run these kinds of tests in bulk in Excel (where all crime data resides) and I wanted to see if I could make Excel do the test. Following Evan’s notes in his blog post and peeking under the hood at his tool’s javascript I discovered that most of the heavy lifting for the tool is performed by a javascript library known as jstat. Jstat provides numerous special functions including the, ahem, ‘Inverse of the Lower Regularized Incomplete Gamma Function’.  I took the time to figure out what that means so you don’t have to and what I learned is that Excel doesn’t offer a function to do it. But, Excel does offer VBA, and the jstat source code is available on GitHub, so I ported all the javascript functions I needed into VBA user defined functions and made them available through the following GIST.

Now, I don’t expect you to understand this code because I don’t understand a lot of it. I know what it is doing but there is a lot of code in here that I suspect comes from some flavor of a Numerical Recipes book. The point is, we don’t need to know how it works, the geniuses at jstat already figured it out, I just need to make sure that it works in VBA and I didn’t mess it up during the translation. I’m pretty confident in my coding after a lot of testing and you can test it if you want by trying Evan’s tool and my Excel macro together and seeing if you get the same results (you will). You can also, if you are adventurous, try out the R library zipfR, which has a function named Rgamma.inv, which does the same thing.

So how do we use the VBA to do something useful? Check the screen shot below where I have added a fifth column to my table named P-value. This is your usual statistical p-value that, for a 95% confidence interval, needs to have a value less than 0.05 in order to indicate statistical significance. As you can see in the image, I have called the custom function PoissonPValue from the VBA code and passed in the four values I identified above: C1, T1, C2, T2. The output of the function call is the value is 0.271, which means that, statistically speaking, there is no difference between 44 and 56, so that 21% increase is nothing to get worked up about (certainly not something to change strategy over).

I know that this is a hard truth to swallow because it runs counter to the thinking of most law enforcement personnel. They want to believe that if a number goes up then something is to blame and that something can be handled to make the number go back down. However, from personal experience I can attest, much of the week-to-week and month-to-month differences are not statistically significant. The reality is, randomness happens, and unless we want to be chasing our tails running after statistical ghosts, we would be well advised to concentrate our efforts on differences that the stats indicate are real.

To wrap up, the Poisson Means Test is an incredibly valuable tool that you should put in your toolbox. It’s not just for your weekly/monthly/annual crime report either. You can use it to test for meaningful differences between any two sets of events. So, for example, say someone ran a project over the last two months and wants to know if you can see an impact in the numbers. If you test the events from a period before against the 2-month project period you can see if there is a stat-sig difference between the counts. Now, that’s not to say that the project is responsible but at least you can demonstrate statistically there is a difference. Anecdotally, I have found new applications for the test every week and I like it because it adds a further layer of rigour to our analysis and I think that’s important. I hope you do too.

## Revisiting Aoristic Analysis

Attention users of aoristic.policeanalyst.com. I have retired the web tool and replaced it with a downloadable Excel spreadsheet. Discussion and reasoning below.

Back in 2012 I wrote a post about Aoristic Analysis where I introduced a web-based tool for conducting aoristic analysis. The reason I built the tool as a web page was that the code allowed me to split events down to the second so that they could be properly apportioned across time units. The drawback of making a web based tool is that law enforcement analysts, typically a suspicious bunch, never liked the idea of submitting their data (which is only dates, by the way, it is completely context-less) to a web server. Fair enough, we’re a paranoid bunch.

I recently had an email discussion with someone who was using my tool but wished they could have some software to do the work locally. I resolved to create an Excel version of the tool that would work in seconds like the web site. I have completed that work and the result is now available for download here: Aoristic_v3.xlsm.

The spreadsheet is a macro-enabled XLSM and that’s because I use VBA to perform the aoristic analysis. There are three custom functions –  AoristicHourOfDay, AoristicDayOfWeek and AoristicMonthOfYear  – and all the code in the spreadsheet can be seen in the following GitHub GIST:

Understanding the Code

If you open the XSLM file you will notice that there are worksheets with example usage of each of the functions. Since all three functions follow a similar approach I will discuss the logic behind AoristicHourOfDay used on the ‘Hour of Day’ sheet.

The ‘Hour of Day’ sheet has in column B the start time for an event and in column C the end time for the event. Following that there are 24 columns, one for each hour of the day, where the AoristicHourOfDay function is called. Looking at the embedded code above you can see on line 23 the function declaration for AorisiticHourOfDay and that it accepts a start and end time as well as the particular hour of interest for calculation.

I will now go through the function line by line to explain the reasoning behind it:

Line 24, 25: The function begins by first determining how many hours and how many seconds have elapsed between the start and end time.

Line 27-30: A check to make sure that the end time is not before the start time. If it is, an error is raised and the function exits.

Line 32-35: Next, the amount of one ‘unit’ is determined by dividing 1 by the number of seconds. Because this code is interested in apportioning an event down to the second, it is necessary to be able to assign units down to the second. For example, say an event was 2 hours long (7200 seconds) then every second is worth 1/7200 of the total event.

Line 37: A variable captures the starting hour based on the event start time.

Line 39-41: The code checks to see if the total number of hours for the event is 1 and the hour passed into the function is the start hour. If both of these conditions are met then the hour should be assigned the entire event (i.e. a value of 1) and the function can safely exit. This is the trivial case.

Line 43-68: If these conditions are not met then the more elaborate assignment method is used. The aoristic variable for this hour is initialized to 0 and a variable for the current hour is initialized to the starting hour for the event. Next, a for loop is initiated that loops from 1 to the total number of hours calculated for the event.

There is a lot going in this for loop so I will break it down, section by section.

Line 45: Within the loop the code first checks to see if the current hour equals the hour passed into the function. If the hours match it means that we need to assign some portion of the event to the hour.

Line 47-51: Next, the code checks to see if the current hour is equal to the start hour and if it matches the code determines the remaining seconds in the hour and assigns that to the share variable. The reason this is done is because, for the start hour, we only want the balance of the hour assigned and to determine the balance of the hour we need to subtract the current number of minutes and seconds from 3600, the number of seconds in an hour.

Line 53-55: Next the code checks to see if the loop counter is greater than 1 and less than the total number of hours. If this is the case the hour is given 3600 seconds. The reasoning here is that if the hour matches and the event started at an earlier hour and ends at a later hour then the event existed across this particular hour and should therefore be assigned a full 3600 seconds.

Line 57-61: The final if statement checks to see if the loop counter is equal to the total number of hours. This is the same as asking if it matches the last hour and if it does we include only those seconds in the part of the hour before the event ended. This is kind of like the reverse of the reasoning for the first if statement.

Line 62: After each of the if statements are complete the variable for tracking the aoristic total for the hour is incremented by adding to it the value of the share multiplied by the unit calculated above.

Line 64-67: Outside the if statement the current hour variable is incremented and reset to 0 if it passes the 24 hour mark.

One thing people may be curious about is why I did this in loop instead of just checking with the if statements and that has to do with the fact that an event can span days and therefore the same hour could be hit again and again. The loop allows for this calculation in a way that the if statements on their own could not.

Wrapping Up

The sample spreadsheet has a lot of example scenarios that illustrate different edge cases for the code. The key check on each of them is that the sum total of the values for all 24 hours should sum to 1. Excel also makes it easy to calculate percentages by hand and compare them to what the code has spit out. Having said that, while I have tested it to my satisfaction there is always the possibility for error. If you come across any odd behaviour or incorrect results please let me know and I will fix it and put a new version up.

I think this Excel method is superior to the web based tool as it allows the same level of detail as the web page plus it allows analysts to look directly under the hood and see how the work is being done. Best of all, you can download the spreadsheet and run everything locally and not have to worry about submitting sensitive data over the web.

One final tip: if instead of the share of event you want to instead know total time spent in each hour, create a new column that contains the event duration in seconds and multiply that by each of the portions in each of the hours. This new table will then show how much time was spent in each hour for each event.

## Medians and Excel Pivot Tables

One of the issues with pivot tables in Excel 2010 (and earlier versions) is that it isn’t possible to calculate median using the Value Field Settings. In this post I am going to first extol the virtues of the median (again) and then demonstrate a formula for calculating the median in a cross tabulation.

Medians Again?

Why do we care about the median and why can’t we just use the mean? The mean is the proper measure of central tendency when what we want to average is normally distributed. In those situations, adding everything up and dividing by the number of elements will give the value right at the peak of the distribution. However, when our distribution of values is heavily skewed the mean is less representative than the median. In those cases it makes more sense to take the middle value (i.e. the median) of all the values when they are ordered from smallest to largest.

The classic example of using the median is household income. There are a lot more small income households than large income households and those big households can really distort the mean. For example, let’s say that we have 7 households with incomes (in \$10,000s) of 1, 1.4, 1.8, 2, 2.1, 2.4 and 13.2. The mean of those numbers is 3.5 while the median is 2. Clearly 2 is a better representation of the centre of the data than the mean, which has been dragged higher by the one large income.

A similar issue occurs with any data set that is zero based.  For example, say you want to do an analysis to determine how long offenders wait between their arrest and when they are fingerprinted. Many services schedule identification appointments days or even weeks after the arrest and it might be interesting to see what the average time is between arrest and appointment. The shortest possible wait is 0 days, the fingerprinting is done on the day of the arrest, and in general one would expect a shorter time rather than a longer one. While there may be some 28-day waits we don’t want these outliers to unduly skew our average so we choose to take the median rather than the mean.

An Example

Let’s take this example further by considering a spreadsheet were we have 1000 rows of arrest times and fingerprint times and the duration in days between them.

The easiest way to determine the median would be to use the formula =MEDIAN(D2:D1000) but that would only give us the median for the entire data set. When analysing our data with pivot tables we are often interested in cross tabulating based on a particular element of the data and in this example we want to see if average wait time is impacted by the day of the week of the arrest.

To begin the analysis we first insert a new pivot table and use it to determine average (by which Excel means mean) wait time for each day of the week.

Have a look at these mean values, particularly Saturday. What is up with that? I’ll tell you: there are errors in the data set. Someone has keyed the dates in incorrectly and there are outliers that are horribly skewing the results. This is the other benefit of working with the median: medians resist the impact of outrageous outliers in a way that the mean doesn’t. We could go back in and do some data cleansing but let’s leave it for now so we can see how the median will handle it.

If you look at the list of options in the Value Field Settings you’ll see that Median is not an option. Since pivot tables don’t provide the option for determining the median we need to do some copying and pasting and use a formula to make it happen. Whenever I am investigating the median I first copy-and-paste-values the results of the mean pivot table to a new work sheet so that I can use the labels in my function calls. The new worksheet looks like this.

To calculate the median for each of the days of the week it is necessary to pull only those rows of data in the Data worksheet for that match the particular day. Once I have those rows I use the MEDIAN function to calculate the specific median for that day of week. To do this I use the formula in cell D4 below:

It is important to notice that the formula in the screen shot is surrounded by curly braces. You don’t type the braces in because Excel adds them when you create what’s known as an Array Formula. Array formulas are different than regular Excel formulas because they work with a whole series of data values rather than a single data value. To tell Excel that you intend a formula to be an array formula you must press CTRL SHIFT ENTER (all three keys at once, like CTRL ALT DEL) at the end, rather than just hitting ENTER. You’ll know that you did it right because Excel will add the curly braces.

In this particular array formula what we’re doing is using an IF formula to check every one of the Day of Week values in column A of the Data worksheet to see if it equals Sunday, or the value in B4. If the value does equal Sunday we return the wait time from column D for that row. The IF statement will check every row and return a range of values (all of the ones for Sunday) to the MEDIAN function which in turn calculates the median wait time for Sunday. The formula can then be filled down for each of the days just like you would with a regular formula.

You can see in the screen grab that we have median values for each of the days of the week. Clearly the outlier that impacted Saturday has been handled by using the median but also note the other days (except Monday) now have significantly smaller values. This is further evidence that the median is the correct choice for this data as using the mean, even when the data is clean, would lead to erroneous overestimates.

Conclusion

The bad news is that Excel can’t calculate medians in a pivot table. The good news is that with a simple formula and the labels produced by the pivot table it is possible to calculate the median for different cross tabulations. It does require a little experimentation with array functions but this particular example is pretty tame and serves as a decent introduction to this very powerful technique.

## Police Analytics in HQ Magazine

I wrote a Police Analytics article for the most recent issue of the OACP’s H.Q. Magazine. I tried to communicate the value of data to police services and how my Service has approached the challenge.

This article was originally published in the WINTER 2014-15 issue of H.Q. Magazine, the official publication of the Ontario Association of Chiefs of Police.

At the beginning of 2014, Halton Regional Police Service (HRPS) created a new unit with a mandate to exploit police data assets to support evidence-based decision making. Combining scientists, programmers and police experts the new unit serves as a clearinghouse for quantitative analysis and report development for customers throughout the Service. From frontline platoons and investigative bureaus to administrative units and the executive branch this unit, named Police Analytics, has changed the way analytic insights are generated and delivered to in-house customers.

Historically, police services have spent considerable resources gathering data; first in the form of paper documents and subsequently in the form of records management systems. The primary function of this data has been to document activity and support investigations but as these data stores have grown in size and complexity the possibility of conducting large-scale analysis to produce efficiencies and insights has been recognized. The challenge is that the data is not in a format conducive to performing analysis but if this challenge could be overcome then tremendous value would be realized.

Recognizing this opportunity, HRPS chose to invest in a Business Intelligence (BI) software system to better exploit these data assets. Business Intelligence is a combination of hardware and software designed to transform existing data resources into a form that is better suited to reporting and analysis. The work is highly technical but the end result is a single database that contains the entirety of a Service’s data. The HRPS IT department, working closely with our BI vendor, spent over 18 months creating a BI database tuned to the Service’s needs and the final result was unparalleled access to our information. But data alone is not enough, you also need skilled analysts who can intelligently and imaginatively utilize that data to produce results, and those skilled analysts work in our new Police Analytics unit.

Police Analytics was envisioned as a different kind of law enforcement analysis; it isn’t crime analysis or intelligence analysis but is instead a data-centric role that provides quantitative analytical products to all levels of the Service. We realized early on that we wanted individuals with a background in math, engineering or the sciences so that they would be capable of performing complex statistical work. Further, the new analysts needed to be highly technical so that they would be comfortable working with databases and writing software to perform their analysis. This combination of skill sets echoes many of the talents of programmers and developers in the world of tech companies and so that was the model we adopted. To get the best people we decided to hire expertise from outside law enforcement and in so doing we created a tech start up inside the Police.

Challenges for a New Unit

Like any start up, there have been growing pains. Our initial conversations focused on where in the organization to position this new unit and while, from a technical perspective, the unit could logically fit in IT, from an analytical perspective it was decided that the unit should fall under the Chief’s Staff. The access provided by working directly for the executive allows the analysts to have a more direct line to senior command—better to communicate analytical findings and field requests—and working alongside other executive units such as planning, audits and policy meant that the analysts could develop a holistic understanding of how data flows throughout the organization. The placement of a highly technical unit outside of the traditional IT infrastructure was a novel undertaking and providing the needed access for police analysts to do their work meant that policies and practices had to be adapted. Consensus was reached through negotiation and collaboration between departments and we were able to ensure data integrity and successfully address security concerns.

The next challenge was one of branding. We had constructed a high-functioning unit that produced useful analysis but we needed the Service to know about it. To address that issue we started an internal campaign advertising ourselves throughout the Service as the source for statistics and analytical products. We positioned ourselves as a resource for Service members of any rank to get data and advice to support their projects and we emphasized the value of having a solid quantitative foundation to produce successful project outcomes.

Evidence-based Decision Making

Our outreach efforts focused on promoting a culture of data-driven, evidence-based decision making and we encouraged Service members to think about data collection and how subtly adjusting business practices could lead to better data which in turn would lead to better analysis. As an example, our computer systems allow officers to push a button every time they change activity but some officers had gotten in the habit of not pushing the button and this lead to data gaps. To address this issue we communicated to officers how consistently updating their status led to much more accurate activity reporting that better captured all of the work they performed throughout the day. When officers discovered the benefits of pushing the buttons, namely credit where credit is due, they modified their behaviour and adopted a data-driven mentality.

We’ve worked hard to change behaviours and clean up our data sets and we’ve started to see rewards for those efforts. One of our successes has been a project that studied call volume and peak period staffing. Calls for service fluctuate throughout the day but our officers work 12-hour shifts and that leads to peaks and valleys in officer busy-ness. By accurately capturing officer status changes we obtain an extremely detailed view of what patrol officers are doing throughout their shifts and with this level of detail it is possible to broadly categorize officer time as either ‘busy’ or ‘available’. For our analysis, we aggregated hundreds of thousands of time-stamped officer activities to construct daily and hourly profiles of when our officers have the heaviest work load and using that data senior commanders are able to construct additional shifts to move officers to busy times. The end result is a reduction in officer busy-ness during peak periods and a levelling of the overall work load across day and night shifts and because the system is data driven we are able to measure the impact of the shift changes and quantitatively demonstrate our success.

Increased Data Awareness

Beyond analytical work the police analytics unit also studies organizational business practices to identify “pain points”. If a business process is cumbersome or time consuming we brainstorm as a unit how to rectify the problem. This practice has led to the development of a number of reporting tools that repackage existing data assets into more useable formats. A few examples include summary reports of field contacts for front line officers, absence management reports for human resources, and occurrence mapping tools for crime analysis. The point of these reports is not that they contain analysis but that they take existing data that is stored in isolation and synthesize it into an easily read report and where before an individual tasked with reviewing these entities may have spent hours clicking through the RMS they can now see everything in one report.

Perhaps our biggest success in this vein is the introduction of our officer activity report that allows Service members to review their activity. Our RMS captures officer activities such as arrests, charges, and tickets and our new reporting tools allow supervisors to review monthly summaries of this information for their units in a web-based, on-demand format. This reporting tool offers many improvements over the old, self-reported monthlies including accuracy, standardization and time savings. This tool has eliminated the need for members to waste time collecting and collating data that has already been captured in our databases and has resulted in a greater awareness, both for members and for command, of officer activity.

Lessons Learned and the Future

With our success establishing the police analytics unit HRPS has learned a number of lessons that may be instructive to other Services looking to create a data-focused unit:

• You need buy-in from the highest levels of the organization. When attempting to create a data-driven culture of analysis that message needs to be embraced and communicated through management.
• Success is built around people and not technology. Simply buying a piece of software does not solve your problems; you need a small core of dedicated experts who believe in the mission to find success.
• Success takes time and progress will be slow. It is not possible to successfully influence the culture of a large organization quickly and that’s doubly true when the changes are related to a complex matter such as data.
• Change is an iterative process. Only when you start looking will you see how your processes need improvement and once those improvements are in place you’ll need to gather more data and you’ll likely see more necessary changes.
• The unit needs access to senior management. The police analytics unit needs the latitude to communicate proposed changes to senior officers and receive approvals quickly.

HRSP is proud of what our police analytics unit has accomplished in a short time. We have made meaningful contributions to improving the efficiency and effectiveness of the Service and have successfully promoted the value of data-driven decision making in the organization. We also spent 2014 promoting our vision for police analytics to other Services in Canada as well as at conferences in North America and Europe where our ideas have been enthusiastically received. In 2015 we plan to ramp up our software development and update our web-based tools so that officers in the field can access information on their tablets and phones. The volume of data captured by the police is going to keep growing and specialized systems and analysts are needed to extract value from those assets. We believe Police Analytics is the future of law enforcement analysis and we encourage everyone to consider how it might benefit their organization.

## Prioritizing Using Exponential Decay

I had an officer ask the other day if I could identify the top 10 subjects in our records management system that were involved in mental health occurrences. On the surface this is a straightforward request: identify the people with the most activity but it gets trickier when you factor in the passage of time. For example, is someone who had 3 incidents over 6 months ago more of a concern than someone who had 2 incidents in the last 8 weeks? Clearly recency should factor into how a person is prioritized but history should also weigh in. To address questions like this we can use an exponential decay equation in Excel to preferentially weight recent incidents while still giving some weight to the past. In this post I will outline how to apply the method.

The first step is to obtain some data. Depending on the nature of the incident I will set an arbitary historical cut-off; in this example we’ll set it a year. This then means that we need to extract every mental health occurrence over the last year along with the date of the occurrence and a unique person identifier.

For each occurrence we then calculate the age of the incident in days by subtracting the date of the occurrence from today. This is done using Excel’s NOW() function to get the current date and the INT() function to trim off the time since we are working strictly with days. the following screen shot shows the formula in the function bar.

Next we want to create a column for the exponential decay formula. Most people are familiar with the idea of exponential decay, it is a mathematical formula where a quantity decreases at a rate proportional to its current value. In practice that means that the value drops off quickly and first and then more slowly as the value shrinks. The chart below shows a graph of exponential decay for a number of different constants. As can be seen in the chart we can tune how fast the values drop off by adjusting the constant.

In Excel, the formula for exponential decay is =EXP(-constant*time) where we get to specify the constant and time, in our case, is the age of the incident. Since we’re dealing with days and we only have a year of data we can decide how much we want past incidents to influence a person’s priority. For example, if we say that a mental health incident that happened today is worth 1 point, how much is an incident that happened 6 months ago worth? If we say that it’s work 0.5 points then we can calculate our constant as: -1*LN(0.5)/180. Where LN() is Excel’s natural logarithm function, 0.5 is point value we want to calculate and 180 is the (approximate) number of days in 6 months. This calculation results in a value of 0.003851 which we can use as our constant.

Back in our data worksheet we can finally create our new column that calculates the decay score for each incident based on its age. We use the exponential decay formula =EXP(-1*0.003851*age) to calculate the decay score for each incident as in the following screen shot. Note that the decay value will always be less than 0 but never reach 0.

Now that we have a decay score for each occurrence we need to create a pivot table of all the data. Highlight everything and select ‘PivotTable’ from the ‘Insert’ tab. This will open a modal window and you can just hit the ‘OK’ button to create your pivot worksheet.

On the pivot table worksheet drag the Identifier field to the Row Labels box and the Decay Score field to the Values box. Click on ‘Count of Decay Score’ in the Values box and select the ‘Value Field Settings’ option at the bottom of the pop-up window. Change the selected option from Count to Sum and click ‘OK’.

Almost done. Back in your pivot table right-click the Sum of Decay Score column and sort ‘Largest to Smallest’. This will sort the persons by their aggregated decay scores and the people at the top are the highest priority individuals.

This method works because the exponential function embodies the principles we were originally trying to capture: it gives more weight (a higher score) to recent incidents but it also gives some score to historic incidents as well. For every occurrence a person is involved with they have the potential to add up to 1 point to their total score (if the incident happened just today) but as the incidents age they influence the total score less. In our example above the person with the 3 incidents over six months ago has, at most, a score of 1.5 (since 0.5 is the 6 month mark) while our 2 incidents in the last 8 weeks person is likely to have a score a little less than 2. So this system weights these people the same by balancing the influence of recent and history. The best part is that we can tune how much weight we give to the two factors by adjusting the exponential decay constant. If we decide that an incident 6 months is only worth 0.1 point rather than 0.5, which greatly privileges recency over history, we can adjust the constant and rescore everyone. On the other hand, if we wanted to make a 6 month incident worth 0.9 points, we could recalculate the scores and privilege history over recency.

I like this method of prioritizing because it’s simple to implement in Excel (no plug-ins!), it’s logically justifiable and it allows the impact of recency and history to vary according to the expertise of the analyst. I hope you find it useful and good luck prioritizing!

## Performing a Chain Analysis in Excel

Chain Analysis is the name I’ve given to the act of finding sequential, potentially related phone calls in phone records. The job is to identify if a particular number called a second number and if that second number called a third number within a specific time frame, say 10 minutes. I call these sequences of calls a chain and it can be used to identify whether someone might be issuing orders or instructions via third parties. The analysis can be done in Excel and while it involves some moderately-complicated Excel functions, once the spreadsheet is set up, subsequent analyses can be done easily.

Getting Started

To start, the first thing you need to do is arrange your data properly. In the screen shot below I have arranged some sample data where column A lists the calling number, column B lists the called number and column C lists the date and time of the call. In order for this technique to work properly it is important that the data set be sorted on the date time column.

For simplicity sake (and ease of reading) I’ve replaced the phone numbers with letters but really any string will work in the analysis. Likewise, I only have 7 rows of data but you can have a whole bunch more and the analysis will work just fine.

While we’ll need a few extra columns for functions there is no reason you can’t keep additional data in the rest of the columns. As long as you sort everything properly the rest of your information will tag along as you perform the analysis.

Notice in the list that there are some chains that are easily identified without resorting to a function. Notice how, on row 3, AA calls CC and then CC calls DD a minute later. Similarly, on row 6, GG calls AA and then a minute later GG calls BB. This is also a chain to be identified. The trick now is to write a function that is able to flag these rows so that we don’t have to rely on picking them out manually.

To flag the row we’re first going to create a new column that, for each row, “looks ahead” at the rows coming up. What we want the function to do is look ahead to see how many subsequent rows fall within the time frame window we’re interested in. In the discussion above I mentioned 10 minutes so, for each row, we want to count how many subsequent rows occur within 10 minutes of the date and time of the current row. The screen shot below captures the function that I wrote for cell D2.

The very first thing to notice is that the function is wrapped in curly braces like this: {<function>}. This means that the function is what’s known as an array formula rather than a traditional scalar formula. To enter an array formula you click in the cell like you normally would but instead of hitting enter when you’re done you need to hit ctrl-shift-enter (at the same time) to tell Excel that it should be considered an array formula.

This particular function needs to be used as an array formula because we are asking the function to compare each subsequent cell to the current one in the IF statement, so we’re comparing an array of values rather than just a single one. Also, when you’re typing the function into the cell, don’t include the curly braces, Excel will add those after when you correctly use ctrl-shift-enter to complete the array function.

Let’s examine the IF statement next. The inner most part of the function is a request to check and see if the difference between cell C3 and cell C2 is less than 600 (our 5 minute cut-off). Since the date and time values in the C column are in Excel time we need to convert them to seconds before we do the comparison, which is why I multiply them by 24 (hours) and then 3600 (seconds) to get the difference into seconds. In the IF statement, if the difference between C3 and C2 is less than 600, then we return a 1 and if not we return a 0. Now, the tricky part is to wrap your head around the idea that, because this is an array function, we can use the expression C3:\$C\$8-C2 to do all of the comparisons at the same time. This means that we are doing C3-C2, C4-C2, C5-C2, etc, all the way to C8-C2. For each one of them we do the x24 x3600 conversion and comparison against 600 seconds. For each one of them we are getting either a 1 or a 0. The power of the array function method is that we can do all of those calculations at the same time, in the same cell.

Have a look at the SUM function that wraps the IF statement. Knowing that this IF statement is multivalued makes the SUM function make more sense. The role of the SUM is now to sum up all of the 1s that were returned by each of the IF statements. Have a look at the value of D2, it’s 3. It’s 3 because the date/time in C3, C4 and C5, are both within 10 minutes (600 seconds) of the value in C2.

Array formulas can be filled down just like regular functions so I have filled down the rest of the rows with this function. The key thing to remember is that you only need to compare the current row to the all the subsequent rows, and so the important bit of the function, the C3:\$C\$8-C2 bit, will always have the current row in the C2 position and the next row in the C3 position. As an example, in cell D4 the function is =SUM(IF((C5:\$C\$8-C4)*24*3600<600,1,0)) . Note that for row 4, C4 is used to subtract and the range of cells to check starts at C5.

Figuring Out the Look Ahead Row Number

The next step in the analysis is create a new column to hold the end row value for the look ahead. If the values in column D tell us how many rows to look ahead then the values in column E tell us what row that actually is. Have a look at the function in cell E2 in the screen grab below.

This function uses the ROW() function to first tell us what row we are currently in. Then the value of D2 is added to that to arrive at the final value. For cell E2 the value is 5 since the ROW() function returns 2 and the value of D2 is 3. What this means is that we only need to look ahead to row 5, from row 2, to find all the phone calls that occurred within the 10 minute window. Anything beyond row 5 is more than 10 minutes after the call in row 2. The end row is an important component of the next part of the analysis as we need to be able to specify exactly what cell range we should be looking at for determining chains.

Flagging the Chained Calls

Have a look at the screen grab for the final step of the analysis, the step where each row is flagged as part of a chain or not.

The heart of the function is the use of the MATCH function to compare the value in column B, the Callee column, with the values in the Caller column for all the rows up to the end row identified in column E. In order to do that we need to use the INDIRECT function, which is used here to assemble a cell range using the ROW() function and the value from column E. Inside the INDIRECT is (“A” & (ROW()+1) & “:A” & E3). If you evaluate this we can see that ROW()+1 becomes 4 (since ROW() is 3 and 3+1 is 4) and the value in E3 is 5, which leaves (“A” & 4 & “:A” & 5). If you slam all that together (which is what the &s are for) you end up with A4:A5, which is the range of cells we want to test our MATCH against.

Something else you need to know: when MATCH can’t find a match it returns NA, which is why the entire MATCH call is wrapped in an ISNA function call. If MATCH returns an NA value, then ISNA will return true, otherwise it returns false. I then feed that true or false into an IF statement (which is only interested in evaluating things to true or false anyway). If the value in the IF is true, that is, MATCH returned NA, then IF prints nothing, “”, to the cell. However, if MATCH returns a value and therefore ISNA returns false, then the IF statement will print “CHAINED” in the cell. For good measure, I wrap the whole thing in another IF statement to check and see if the look ahead rows in cell D3 are greater than 0. If not, there is no way there can be a chain, so the function just prints the empty string “” and we skip the whole MATCH, ISNA, IF process.

So let’s have a look at the results the function spit out. We can see that two chains have been identified. The first is on row 3 where AA called CC and then CC called DD within 10 minutes. This is what the CHAINED in cell F3 is identifying. The second CHAINED is for the chain that begins when CC calls DD and then DD calls FF (on row 5) within 10 minutes. The take away here is that chains are identified on the row in which they begin so it’s up to you to follow the chains from the starting point where they are identified.

Bonus Chains

The above example looks at Caller-Callee chains but almost the same function can be used to flag Caller-Caller chains, the ones where a caller calls a number and then within the time interval the same caller calls another number. In the screen shot below Caller-Caller chains have been identified for the second row and the sixth row and the only difference between the functions in column F and column G is that the MATCH in the column G version is being done on the values in column A (e.g. A2) rather than column B (e.g. B2). All the same logic applies but the change in the matching column changes the chaining subject.

As can be seen in the chart, row 2 is a Caller-Caller chain because AA calls BB and then within 10 minutes AA calls CC. Similarly, on row 6, GG calls AA then GG calls BB with 10 minutes. Just like with the Caller-Callee chains above, the CHAINED flag is listed on the row where the chain begins and it’s up to the analyst to follow the chain through.

Conclusion

While the functions in this post can seem a little hairy at the start I think, with a little exploration, that you can become comfortable enough with them to trust the results. While it’s true that a custom VBA function could be written to perform this analysis I think it’s important to find solutions that first attempt to use the built-in Excel functions before resorting to custom code. I hope this post has demonstrated how a fairly sophisticated kind of analysis can be performed with just the tools that come in standard Excel. Good luck!