The police produce new data points on a weekly, monthly and annual basis and every time a new number comes out there is an instinctive response to compare the new number to the last number, figure out if it’s gone up or down and attach the dreaded up and down arrows. Those up and down arrows (usually red and green respectively) can spur a lot of discussion and a lot of work as people change practice or get chewed out by superiors in an effort to make a change. In this post I am going to demonstrate how to use simple regression and prediction intervals to move beyond the up and down arrows by using historic data and some math to capture the inherent variability in a data series.
One important note before we begin: the approach detailed below is for data without a seasonal trend. This means that monthly data is likely out since most police data has a lull in the winter and a peak in the summer. I primarily use this approach for forecasting annual, running 12-month or year-to-date data. There are more elaborate methods for forecasting data with a seasonal trend and I hope to write a post about that someday.
For this post I am going to use the annual Crime Severity Index that is produced by StatsCanada for every police service in the country. StatsCan recently published the 2016 data for a local police service and in the screen grab below I demonstrate the up-and-down arrow method of trend analysis (which, not coincidentally, is how StatsCan communicates it out to police services).
As you can see, between 2015 and 2016 the Crime Severity Index went up by 6.44% (going up is bad, hence the ominous red X). Not a catastrophic increase but an executive might still demand some answers or possible some action to reverse the trend.
But let’s look at the last 16 years’ worth of Crime Severity Index values for the same police service.
The line chart tells a different story, one where CSI has been dropping for over a decade and where this year there has been a subtle uptick. The chart helps people understand visually what is going on but we can go further than the graph and build a simple model to help us understand the numbers even better.
For our model we’re going to be using simple linear regression. Recall from your days of high school mathematics that a linear regression model has the form Y = mX + b where m is the slope of the line, b is the intercept of the line, X is the independent variable (i.e. the one we can choose) and Y is the dependent or response variable (i.e. what comes out of our model when we plug in our values).
Excel makes it very easy to build a linear regression model using a line chart. Simply right click on the chart line and select “Add Trendline…”, you can even have the equation displayed on the screen like so:
In this model, m has the value -2.0433 and B has the value of 57.375. X and Y remain variables because we can sub in an X and get a Y (or the CSI value) in return.
While right-clicking on the chart to add the trendline is quick you can also use Excel functions to derive each of the pieces. The following screen grab details the function used to determine both the slope m and the intercept B.
The first takeaway is that Excel’s SLOPE and INTERCEPT formulas can be used to calculate the m and B values that are found on the line chart.
The second thing to notice is that in the table of data points the Year goes from 1 to 17 rather than 2000 to 2016. This was done because that’s how the linear model on the chart works but you can also put in the years and get nearly the same result: the slope will be identical but the intercept will be larger (on the order of a year). In the following screen grab I’ve changed over to using years to make things easier, like so:
A linear model is useful because it can be extended into the future to produce a forecast. For example, with the model parameters defined above it is possible to forecast a CSI value for next year by setting X equal to 2017 and calculating Y:
The point forecast for the 2017 Crime Severity Index is therefore 27.27.
While forecasting the future has obvious benefits it doesn’t address the up-and-down arrow issue. To address that issue we first want to produce a forecast for this year, 2016, and compare it to 2016 actual. And if we’re going to produce a forecast for this year, we shouldn’t include this year’s data in the creation of the linear model because we need to pretend like we don’t know this year’s value yet. In the following screen grab I’ve simply dropped the 2016 data from the calculation of the slope and the intercept:
By dropping the 2016 data point and recalculating the slope and intercept with the other 16 years’ worth of data, the slope and intercept change slightly. Using these news values and plugging in 2016 for X produces the following forecast for 2016.
Looking at the predicted value of 22.05, it makes sense if you don’t know anything about 2016. Every year before now the values have been trending down and all the historical data points are pointed in that direction. From the math’s point of view, there is no reason to think that the CSI should go up in 2016.
But, the forecast is wrong and that’s to be expected because point forecasts are nearly always wrong and if you give a senior decision maker only a point forecast, you’re going to end up explaining why the true value, when it comes around, isn’t the same as the forecast and then you’re going to find that people don’t trust your forecasts anymore.
So why bother with this exercise? Because linear models and point forecasts are the first step towards prediction intervals and those really do have value.
A prediction interval is simply an interval that is placed around the Y forecast that says “expect the value to be in this range, X% of the time”. I use prediction intervals as “I’m not surprised” ranges because if I forecast a number and calculate its prediction intervals and then the real number shows up and it’s within the interval range, I’m not surprised. The math said it would be so. If the real number happens to be outside the interval, well that’s something to investigate.
The following screen grab shows the mathematical formula that is used to calculate prediction intervals. I know it looks intimidating but I’ll break all the pieces down.
Breaking down the formula, here’s what each of the variables mean:
- y-hat (as it’s called) is the point forecast from above
- t*n-2 is the t-score with degrees of freedom n-2
- s is the standard error of the residuals
- n is the number of data points we have in our model
- x* is the value we plugged into the linear model to get y-hat
- x-bar is the mean value of all the x data points
- sx is the standard deviation of all the x data points
You may be familiar with some of these concepts and unfamiliar with others, and that’s ok. The good news is that we can calculate each of these with Excel and arrive at the prediction intervals by combining them all into the formula above.
In the following screen grab I’ve expanded on the initial linear forecast from above and added in several new calculations that build up each of the variables needed for the prediction interval.
Each of the pieces of the prediction interval equation are calculated in column F and column G shows the formulas that are being used. Pay particular attention to cell F11 as it contains the point forecast discussed above. Cell F13 has the margin of error value calculated using the equation above and in our example it has the value of 6.197. It is a simple matter to add and subtract that margin of error from the point forecast to create the 95% prediction interval and arrive at the upper and lower bounds on the forecast of 28.242 and 15.847.
Recall earlier that the point forecast of 22.045 was a poor estimate since the actual value for 2016 is 24.80. Also recall, from way back at the start of this post, that CSI in 2016 was up 6.4% and the question was raised “is this an issue”? Looking at the prediction interval of [28.242, 15.847] it’s clear that the actual value for 2016 is well within the range that the history of values tells us to expect and therefore the actual value is neither surprising or an issue.
The beauty of prediction intervals is that they allow us to do away with the up-and-down arrow approach that could have us chasing our tails and instead gives us an indicator we can use to figure out if something is outside the norm. While the original calculation is somewhat complex it is a straightforward exercise to set up an Excel template where all you need to do is drop in the values and have the intervals spit out the other end. I use intervals at least once a day to get a handle of how concerned I should be over changes in data series and I hope you find value in adding this technique to your analyst toolbox.