Forecasting with Simple Regression and Prediction Intervals

The police produce new data points on a weekly, monthly and annual basis and every time a new number comes out there is an instinctive response to compare the new number to the last number, figure out if it’s gone up or down and attach the dreaded up and down arrows. Those up and down arrows (usually red and green respectively) can spur a lot of discussion and a lot of work as people change practice or get chewed out by superiors in an effort to make a change. In this post I am going to demonstrate how to use simple regression and prediction intervals to move beyond the up and down arrows by using historic data and some math to capture the inherent variability in a data series.

One important note before we begin: the approach detailed below is for data without a seasonal trend. This means that monthly data is likely out since most police data has a lull in the winter and a peak in the summer. I primarily use this approach for forecasting annual, running 12-month or year-to-date data. There are more elaborate methods for forecasting data with a seasonal trend and I hope to write a post about that someday.

For this post I am going to use the annual Crime Severity Index that is produced by StatsCanada for every police service in the country. StatsCan recently published the 2016 data for a local police service and in the screen grab below I demonstrate the up-and-down arrow method of trend analysis (which, not coincidentally, is how StatsCan communicates it out to police services).

As you can see, between 2015 and 2016 the Crime Severity Index went up by 6.44% (going up is bad, hence the ominous red X). Not a catastrophic increase but an executive might still demand some answers or possible some action to reverse the trend.

But let’s look at the last 16 years’ worth of Crime Severity Index values for the same police service.

The line chart tells a different story, one where CSI has been dropping for over a decade and where this year there has been a subtle uptick. The chart helps people understand visually what is going on but we can go further than the graph and build a simple model to help us understand the numbers even better.

For our model we’re going to be using simple linear regression. Recall from your days of high school mathematics that a linear regression model has the form Y = mX + b where m is the slope of the line, b is the intercept of the line, X is the independent variable (i.e. the one we can choose) and Y is the dependent or response variable (i.e. what comes out of our model when we plug in our values).

Excel makes it very easy to build a linear regression model using a line chart. Simply right click on the chart line and select “Add Trendline…”, you can even have the equation displayed on the screen like so:

In this model, m has the value -2.0433 and B has the value of 57.375. X and Y remain variables because we can sub in an X and get a Y (or the CSI value) in return.

While right-clicking on the chart to add the trendline is quick you can also use Excel functions to derive each of the pieces. The following screen grab details the function used to determine both the slope m and the intercept B.

The first takeaway is that Excel’s SLOPE and INTERCEPT formulas can be used to calculate the m and B values that are found on the line chart.

The second thing to notice is that in the table of data points the Year goes from 1 to 17 rather than 2000 to 2016. This was done because that’s how the linear model on the chart works but you can also put in the years and get nearly the same result: the slope will be identical but the intercept will be larger (on the order of a year). In the following screen grab I’ve changed over to using years to make things easier, like so:

A linear model is useful because it can be extended into the future to produce a forecast. For example, with the model parameters defined above it is possible to forecast a CSI value for next year by setting X equal to 2017 and calculating Y:

The point forecast for the 2017 Crime Severity Index is therefore 27.27.

While forecasting the future has obvious benefits it doesn’t address the up-and-down arrow issue. To address that issue we first want to produce a forecast for this year, 2016, and compare it to 2016 actual. And if we’re going to produce a forecast for this year, we shouldn’t include this year’s data in the creation of the linear model because we need to pretend like we don’t know this year’s value yet. In the following screen grab I’ve simply dropped the 2016 data from the calculation of the slope and the intercept:

By dropping the 2016 data point and recalculating the slope and intercept with the other 16 years’ worth of data, the slope and intercept change slightly. Using these news values and plugging in 2016 for X produces the following forecast for 2016.

Looking at the predicted value of 22.05, it makes sense if you don’t know anything about 2016. Every year before now the values have been trending down and all the historical data points are pointed in that direction. From the math’s point of view, there is no reason to think that the CSI should go up in 2016.

But, the forecast is wrong and that’s too be expected because point forecasts are nearly always wrong and if you give a senior decision maker only a point forecast, you’re going to end up explaining why the true value, when it comes around, isn’t the same as the forecast and then you’re going to find that people don’t trust your forecasts anymore.

So why bother with this exercise? Because linear models and point forecasts are the first step towards prediction intervals and those really do have value.

A prediction interval is simply an interval that is placed around the Y forecast that says “expect the value to be in this range, X% of the time”. I use prediction intervals as “I’m not surprised” ranges because if I forecast a number and calculate its prediction intervals and then the real number shows up and it’s within the interval range, I’m not surprised. The math said it would be so. If the real number happens to be outside the interval, well that’s something to investigate.

The following screen grab shows the mathematical formula that is used to calculate prediction intervals. I know it looks intimidating but I’ll break all the pieces down.

Breaking down the formula, here’s what each of the variables mean:

  • y-hat (as it’s called) is the point forecast from above
  • t*n-2 is the t-score with degrees of freedom n-2
  • s is the standard error of the residuals
  • n is the number of data points we have in our model
  • x* is the value we plugged into the linear model to get y-hat
  • x-bar is the mean value of all the x data points
  • sx is the standard deviation of all the x data points

You may be familiar with some of these concepts and unfamiliar with others, and that’s ok. The good news is that we can calculate each of these with Excel and arrive at the prediction intervals by combining them all into the formula above.

In the following screen grab I’ve expanded on the initial linear forecast from above and added in several new calculations that build up each of the variables needed for the prediction interval.

Each of the pieces of the prediction interval equation are calculated in column F and column G shows the formulas that are being used. Pay particular attention to cell F11 as it contains the point forecast discussed above. Cell F13 has the margin of error value calculated using the equation above and in our example it has the value of 6.197. It is a simple matter to add and subtract that margin of error from the point forecast to create the 95% prediction interval and arrive at the upper and lower bounds on the forecast of 28.242 and 15.847.

Recall earlier that the point forecast of 22.045 was a poor estimate since the actual value for 2016 is 24.80. Also recall, from way back at the start of this post, that CSI in 2016 was up 6.4% and the question was raised “is this an issue”? Looking at the prediction interval of [28.242, 15.847] it’s clear that the actual value for 2016 is well within the range that the history of values tells us to expect and therefore the actual value is neither surprising or an issue.

The beauty of prediction intervals is that they allow us to do away with the up-and-down arrow approach that could have us chasing our tails and instead gives us an indicator we can use to figure out if something is outside the norm. While the original calculation is somewhat complex it is a straightforward exercise to set up an Excel template where all you need to do is drop in the values and have the intervals spit out the other end. I use intervals at least once a day to get a handle of how concerned I should be over changes in data series and I hope you find value in adding this technique to your analyst toolbox.

Download the sample spreadsheet for this post.

Prioritizing Using Exponential Decay

I had an officer ask the other day if I could identify the top 10 subjects in our records management system that were involved in mental health occurrences. On the surface this is a straightforward request: identify the people with the most activity but it gets trickier when you factor in the passage of time. For example, is someone who had 3 incidents over 6 months ago more of a concern than someone who had 2 incidents in the last 8 weeks? Clearly recency should factor into how a person is prioritized but history should also weigh in. To address questions like this we can use an exponential decay equation in Excel to preferentially weight recent incidents while still giving some weight to the past. In this post I will outline how to apply the method.

The first step is to obtain some data. Depending on the nature of the incident I will set an arbitary historical cut-off; in this example we’ll set it a year. This then means that we need to extract every mental health occurrence over the last year along with the date of the occurrence and a unique person identifier.

exponential_1

For each occurrence we then calculate the age of the incident in days by subtracting the date of the occurrence from today. This is done using Excel’s NOW() function to get the current date and the INT() function to trim off the time since we are working strictly with days. the following screen shot shows the formula in the function bar.

exponential_2

Next we want to create a column for the exponential decay formula. Most people are familiar with the idea of exponential decay, it is a mathematical formula where a quantity decreases at a rate proportional to its current value. In practice that means that the value drops off quickly and first and then more slowly as the value shrinks. The chart below shows a graph of exponential decay for a number of different constants. As can be seen in the chart we can tune how fast the values drop off by adjusting the constant.

640px-plot-exponential-decay-svg

In Excel, the formula for exponential decay is =EXP(-constant*time) where we get to specify the constant and time, in our case, is the age of the incident. Since we’re dealing with days and we only have a year of data we can decide how much we want past incidents to influence a person’s priority. For example, if we say that a mental health incident that happened today is worth 1 point, how much is an incident that happened 6 months ago worth? If we say that it’s work 0.5 points then we can calculate our constant as: -1*LN(0.5)/180. Where LN() is Excel’s natural logarithm function, 0.5 is point value we want to calculate and 180 is the (approximate) number of days in 6 months. This calculation results in a value of 0.003851 which we can use as our constant.

exponential_3

Back in our data worksheet we can finally create our new column that calculates the decay score for each incident based on its age. We use the exponential decay formula =EXP(-1*0.003851*age) to calculate the decay score for each incident as in the following screen shot. Note that the decay value will always be less than 0 but never reach 0.

exponential_4

Now that we have a decay score for each occurrence we need to create a pivot table of all the data. Highlight everything and select ‘PivotTable’ from the ‘Insert’ tab. This will open a modal window and you can just hit the ‘OK’ button to create your pivot worksheet.

exponential_5

On the pivot table worksheet drag the Identifier field to the Row Labels box and the Decay Score field to the Values box. Click on ‘Count of Decay Score’ in the Values box and select the ‘Value Field Settings’ option at the bottom of the pop-up window. Change the selected option from Count to Sum and click ‘OK’.

exponential_6

Almost done. Back in your pivot table right-click the Sum of Decay Score column and sort ‘Largest to Smallest’. This will sort the persons by their aggregated decay scores and the people at the top are the highest priority individuals.

exponential_7

This method works because the exponential function embodies the principles we were originally trying to capture: it gives more weight (a higher score) to recent incidents but it also gives some score to historic incidents as well. For every occurrence a person is involved with they have the potential to add up to 1 point to their total score (if the incident happened just today) but as the incidents age they influence the total score less. In our example above the person with the 3 incidents over six months ago has, at most, a score of 1.5 (since 0.5 is the 6 month mark) while our 2 incidents in the last 8 weeks person is likely to have a score a little less than 2. So this system weights these people the same by balancing the influence of recent and history. The best part is that we can tune how much weight we give to the two factors by adjusting the exponential decay constant. If we decide that an incident 6 months is only worth 0.1 point rather than 0.5, which greatly privileges recency over history, we can adjust the constant and rescore everyone. On the other hand, if we wanted to make a 6 month incident worth 0.9 points, we could recalculate the scores and privilege history over recency.

I like this method of prioritizing because it’s simple to implement in Excel (no plug-ins!), it’s logically justifiable and it allows the impact of recency and history to vary according to the expertise of the analyst. I hope you find it useful and good luck prioritizing!

Performing a Statistical T-Test in Excel

The value of the t-Test is that it can tell you if a difference you are seeing in two measurements is legitimate or if the difference is likely to be just from randomness. In this blog post I am going to use the t-Test to determine if the difference in the number of daily motor vehicle collisions during the winter is meaningfully different then the number during the summer. While this is kind of a no-brainer example it will help illustrate the value of the t-Test and give you some ideas on how you might apply it to give some statistical rigour to your analytical findings.

Before we get into the test let me provide some background to help understand what it is we’re trying to do. In statistics we are always comparing two hypotheses. The first is named the null hypothesis and the second is named the alternate hypothesis. We almost always want to reject the null hypothesis and accept the alternate one because in that situation it means something interesting has occurred. For our motor vehicle collision example we want to determine if there is a meaningful difference in the number of daily motor vehicle collisions during the first quarter of the year (i.e. the winter) versus the third quarter of the year (i.e. the summer). Now, you may be thinking “obviously there’s a difference, the weather” but let’s, for the sake of education, assume that we need to demonstrate statistically that there is a difference. In this study the null hypothesis would be that “there is no difference between the quarters, the  number of collisions are the same” while the alternate hypothesis would be “yes, there is a statistically significant difference between the two quarters”.

The term “statistically significant” needs some explanation. When we want something to be statistically significant we want to be able to say, with a particular level of confidence, that the results we are seeing are not just due to chance. First we pick our “confidence level” or how sure we want to be and then, through the magic of statistics, we are provided a number that our results have to beat in that we can be that sure. In this example we’re going with the 95% confidence level which means that if we find a difference between the average number of MVCs during the winter and summer we want to be 95% sure it’s not due to chance. Or, in other words, we are willing to wrongly reject the null hypothesis only 5% of the time. We can up the confidence interval to 99% if we want but, as we’ll see, we’ll need to adhere to even stricter conditions. Whatever we choose just remember that the “significant” part in statistically significant doesn’t mean “important” (as significant typically means in everyday conversation) it just means “not due to randomness”.

How do we go about demonstrating that the two quarters have real, non-random differences in the number of daily motor vehicle collisions? The first thing we do randomly select from the first and third quarters 30 different days worth of MVC counts. The screen grab below shows my randomly selected MVC data.

 ttest_0

Notice that we there are two columns of data, one marked Winter and one marked Summer and each has 30 entries. Also notice that at the bottom of each column is a number labeled “Mean”. In those cells I have used Excel’s AVERAGE function to find the mean of all the numbers (add them all up, divide by 30). And behold, the means are different. That proves the quarters are different, right? Not quite. We still have to deal with the peskiness that is statistical significance. It’s possible, after all, that it’s just through chance those means are different. As discussed above we want to be 95% sure that they aren’t.

To prolong your agony I’m actually going to show you two ways to perform the t-Test: the (relatively) quick way using Microsoft’s Data Analysis ToolPak Add-in and the longer, manual way that uses built in Excel functions.

First, using the Add-In. 

The Analysis ToolPak is a free Add-In provided by Microsoft as part of the default installation of Excel. Add-Ins are a kind of Excel file, typically with a .xlam extension, that package together a bunch of related functionality. To get started with the Analysis ToolPak Add-In we need to check if it is installed. You may or may not have it already installed but to check if you do you click the “Data” tab in Excel and look at the far right. If you see “Data Analysis” as an option, it’s already active (see screen shot).

advanced_1

If you don’t see it that means we need to turn it on. First, click the green “File’”tab (or the Excel jewel if you’re in 2007) to call up the Save/Open/Close menu. Look around the option named “Options’”and click it. This will open up a window with a menu on the left. Click on the menu option named “Add-Ins” on the left hand side. The screen will change and at the bottom, beside the word “Manage”, will be a select box and a button labeled“Go”. Click the“Go”button to open yet another window and you will see a series of check- boxes. Fingers crossed that one of them says “Analysis ToolPak” and is unchecked. Click the checkbox and click ok. Now go back to Data tab, look at the far right and you’ll see “Data Analysis’”. You’re set.

Now that the ToolPak is installed click the “Data Analysis” button and a modal window will open up. Click “t-Test: Two-Sample Assuming Unequal Variances” in the list and click the “OK” button and this will open up a new window (check the screen shot below) with some options. For the “Variable 1 Range” click the little arrow button and select all of the “Winter” data, which is in B2:31. For “Variable 2 Range” select all the “Summer” data from C2:C31. For the “Hypothesized Mean Difference” enter 0 (since we’re trying to prove they are the same, the difference of the mean should be 0) and everything else can stay the same. However, notice that there is field named “Alpha” that has the value 0.05. It’s not a coincidence that 0.05 = 1 – 0.95. Alpha is another way of asking how sure we want to be.

ttest_1

Click the “OK” button and Excel will open up a new worksheet populated with a bunch of labels and values just like the one in the screen shot below.

t-test-new-1

So what are we looking at? Well, there are the means we saw. Next comes the variances, which are measures of how spread out the values are, and they are quite different (which is why we needed to use the test that assumed unequal variances) and a little further down we see our “t Stat” and it has a value of about 4.35. Skip down two lines to the value labeled “t Critical one-tail” and note that it is about 1.68. That’s good news for us because the rule is that if your “t Stat” is larger than your “t Critical” value then the null hypothesis can be rejected and our results are therefore statistically significant. This means that we can now state, with the power of math backing us up, that we are 95% sure that the daily average number of MVCs in the winter are greater than the daily average number of MVCs in the summer.

Second, the manual way. 

So we used a built-in tool to do a bunch of heavy lifting for us (as tools should) and we read an answer off a table. It works, but how did it work? How can we calculate these numbers if the ToolPak isn’t around but we still, for some reason, have access to Excel?

First, have a look at this screen shot because it will help to explain each of the steps I used to perform a manual unequal variance t-Test.

t-test-new-2

On the left are the two original columns of data, one for winter and one for summer. Notice that there are 30 entries for each season but that I used “freeze panes” so that I can show the bottom of the data (starting in row 32). For both seasons I calculated three additional variables, the mean (using Excel’s AVERAGE function), the standard deviation (using Excels SDEV function) and the count of the number of data points (using Excel’s COUNT function). These will all be needed in the coming calculations.

On the right hand side of the screen shot I produced my own little table and I’m going to walk through each of these variables and explain the thinking behind them.

The first needed variable is the Sum of the Squared Deviations for which I used Excel’s DEVSQ function as a short cut. Think of the variable like this: you have a mean value for  a season and you have 30 different data points that are either going to be a little bit more or a little bit less than the average (as data points usually are). For each data point, subtract its value from the mean and square it. Now sum those squared differences all up and that’s the sum of the squared deviations (or differences). Notice that I added together the sum of the squared deviations for both the winter and summer seasons together into a single value.

Why did we need the sum of the squared deviations? Because we want to calculate the pooled sample variance. Why pooled? Because we put both the winter and the summer values together. But what’s variance? The average of the squared differences from the mean. This is easy to calculate, just divide the sum of the squared deviations by the number of points we have in our two seasons, minus 2 (1 for each season) to account for Bessel’s correction which is a correction for the bias in the estimation of the population variance (just trust me, subtract 2, Excel does).

The next variable is the Standard Error of Difference in Means and to calculate it we take the square root of the pooled sample variance times the sum of one over the counts for the winter and summer data points (i.e. 1/30 + 1/30). But what is the standard error of difference in means? It’s just the average expected difference between the means for two samples that have this many data points. We’re trying to determine, after all, if two means are significantly different from one another and, given the size of the two sample groups, we should expect that there would be some difference. Now, imagine that we run this little study over and over again with other, randomly selected groups of MVCs, each pair would have their own difference between the means. That sounds like a lot of work, so instead of running this study over and over again we can instead use the formula for the standard error of difference to answer the question: “what’s the expected average of that difference?”.

But what do we need that for? To calculate our t-statistic. And to perform that calculation we just divide the actual difference in the means (winter mean minus summer mean) by the standard error of difference in means to end up with approximately 4.35, just like the ToolPak did.

But we’re not done, we still need to determine our critical t-score at the 95% confidence interval. To do that we cheat a bit and use Excel’s built in TINV function. The first input is the probability that we want to test, in this case it’s 0.05 (i.e. 1-0.95) or the alpha from the ToolPak. The eagle-eyed will note that in the calculation in the screen shot 0.05 is actually multiplied by 2 and this is because we’re interested in replicating the “one-tailed” result from the ToolPak and since TINV is “two-tailed” we need to double up the probability to account for that.

TINV also takes a second parameter, degrees of freedom, and usually degrees of freedom is pretty straightforward (it’s just the sum of the two counts, minus 2, like for the pooled variance) except when we’re dealing with an unequal variance scenario. In order to handle this scenario we need to break out the more elaborate calculation defined by Moser and Stevens 1992 and spelled out in the ugly Excel formula:

=(1/B34 + F8/C34)^2/(1/(B34^2*(B34-1)) + F8^2/(C34^2*(C34-1)))

Where the value in cell F8 is calculated as =C32^2/B32^2 or the ratio of the squared standard deviations for each season. God, it’s a mess, I know. But just dissect the formula and you’ll see that it’s just a complicated formula using the counts for the two seasons in multiple places. One thing you may notice is that the resulting degrees of freedom is not an integer and that we have the approximate value 48.58. The ToolPak results above have a degrees of freedom of 49. This is because Microsoft rounds. You can choose to round or not, the results are pretty much the same.

With degrees of freedom we can finally run the TINV function and when we do we get a critical t-stat of 1.68, just like the ToolPak (although the ToolPak number is slightly different further into the decimals because I didn’t round my degrees of freedom). And just like with the ToolPak, our t-stat is way bigger than that critical t-value so we can reject the null hypothesis and assume that we have statistically significant results at the 95% confidence level. Yes!

And with that we are finally done our odyssey of manually calculating the unequal variance t-test. We’ve calculated the same results as the ToolPak and we got right into the nitty-gritty of it. Well done.

Conclusion

If you made it this far (and didn’t just skip the manual section) then bravo, that was a real grind. Either way, I hope you learned something about the t-statistic and how it can be used to add some rigour to pronouncements about means. I do, somewhat unfortunately, find that a lot of the time the t-stat plays spoiler in that decision makers see a difference in averages and conclude that something is working when the reality is that, statistically speaking, there is no significant difference. Nevertheless, it’s important to be able to back up your analysis with some math and t-stats are a great way to provide that oomph. Good luck.

 

Understanding the Difference Between Mean and Median

As analysts we are often interested in averages. But an average is a tricky thing because the word average actually covers several related concepts. In this blog post I am going to discuss two different concepts of average – the Mean and the Median – and why it’s crucial to understand their differences.

The Mean is what we typically mean when we use the term average. The Mean is calculated by adding everything up and dividing by the number of items. The resulting value is supposed to represent the centre of all the values. People intuitively understand the Mean because a lifetime of being exposed to the concept of the bell curve (or the normal distribution as it is called in math circles) has taught them that the big peak in the middle of the curve is the average.

The Median is less well known but just as easy to conceptualize. Consider all of the items in your collection and sort them from smallest to largest. The Median is then the value of the item that sits in the middle of the sorted list. In other words, it’s the value of the item that splits the collection in half with an equal number of items above and below it.

When dealing with data that follows the bell curve, that is data that produces a symmetric and unimodal (single peak) distribution, the Mean and the Median have the same value and we don’t have any problems. The issue is that a lot of data we’re likely to encounter in our day-to-day work— such as call response times, time spent on patrol, man hours spent on an occurrence—are not accurately represented by a symmetric bell curve but instead follow a skewed distribution that is weighted more to one side than the other.

But what does this mean for averages? It means that for data that is positively skewed (the bulk of the data is on the left side of the chart) that the Mean is going to be larger than the Median and vice versa for negatively skewed data. The problem arises when the difference between the Mean and Median gets large enough to meaningfully change the stat that you are reporting.

For example, the number of man-hours spent on an occurrence is the kind of data that is going to be positively skewed. This is because of the nature of the data: 0 is the lowest value that can exist but the upper limit is not bounded. If you create a histogram of the data you’re likely to find a whole lot of occurrences with a small to medium amount of time being spent (the peak on the left) but also a few occurrences with a lot of time being spent (the long tail on the right). I’ve created the following simplified data set below to illustrate the scenario.

You can see in the screen shot that the occurrence that took 24 hours is clearly an extreme example but it has a significant impact on the Mean. If a Commander came to you and asked for the average number of man-hours spent on occurrences do you feel that the Mean of 8.8 would be representative, considering that only one occurrence actually took longer than that? Clearly the Median value of 5 hours is more representative of man-hours and is a more accurate average.

This result comes about because the Mean is sensitive to the inclusion of extreme values because of the way it is calculated while the Median, which is relatively indifferent to the inclusion of extreme values because it only concerns itself with the middle value for a data set, is more likely to provide an appropriate number that is stable in the face of outliers.

The screen shot also conveniently illustrates the function calls used to calculate both the Mean and Median in Excel. Frustratingly, Microsoft has opted to use the function name AVERAGE for the Mean further erroneously cementing them as synonyms. Thankfully the Median is a straightforward call to the MEDIAN function.

You’ll notice a third function in the list: SKEW. I discuss above about the positive and negative skewness of data and the SKEW function is useful for coming to grips with the direction and magnitude of a data set’s skewness. In the occurrence man-hours example I state that the data is positively skewed and that’s reflected by the positive (greater than 0) value returned by the SKEW function and that indicates that the Mean likely overestimates the Median. Contrarily, if SKEW returns a value that is less than 0 you’ll know that the data is negatively skewed and that the Mean likely underestimates the Median. Finally, if the value is close to 0 your data set likely follows the bell curve or normal distribution and the Mean and Median will be nearly the same.

The key take away from this post is that it is important to not just blindly take the average (by which I mean the Mean) of a stat without first considering how the data is distributed. So much law enforcement data is skewed one way or the other that it makes sense to take a minute and run the MEDIAN and SKEW functions in Excel to make sure you’re reporting the most appropriate number to your superiors.

Cumulative Distribution Function Charts in Excel

Most analysts know what a histogram is: it’s a chart where each piece of data is put into one of several bins and then the bin counts are plotted in order to gain insight into how the data is distributed. Histograms are useful as they have an intuitive appeal because it’s easy to see how the data is distributed just by looking at the chart. Of course, they aren’t perfect. The whole binning process is pretty subjective and depending on how the bins are selected the results can be different, which is deceptive. If only there was an alternative that could address these shortcomings…

Well, there is: histograms have an underutilized sibling known as the Cumulative Distribution Function (CDF) chart which compliments and, I think, improves on the histogram by doing away with some of the histogram’s problems and offering some useful innovations of its own. Namely, the ability to read off the chart what percentage of the data falls between two points, something that is not possible with a histogram. In this post I will discuss why I think you should make the CDF part of your analytical toolbox and also demonstrate how easy it is to create a CDF in Excel.

Convince me

First, two example charts that describe my dataset: a completely made up list of 30,000 calls-for-service and how long they last from initial call to the closing off of the incident. The first chart is a standard histogram and the second is the corresponding CDF. For both charts the horizontal axis is the same: it’s hours. For the histogram the vertical axis is a count of how many calls are in each bin while the vertical axis of the CDF is a measure of percentage (why percentage will become clear in a bit).

cdf_1

So the histogram is the same old boring story. We have a normal distribution that is skewed to the left (fatter on the left than the right). Looks like most calls take about five hours and only a few go above 12 hours. Yawn.

cdf_2

Ok, the CDF looks great. But what’s it good for? What can it tell us that the histogram can’t? Remember that I mentioned you can use a CDF to determine what percentage of the data falls between two points? Let’s explore that now.

Suppose I want to know how many calls take less than 5 hours (that is, between 0 and 5 hours). I can read that right off the graph by going across the horizontal axis until I reach 5 and then going up until I hit the trend line. I then read across to the vertical axis and see that the value is about 60%. What this means is that 60% of all calls take 5 hours or less. That’s a much firmer number than ‘the most calls are around 5 hours’. Similarly, what if the question was how many calls take between 5 and 10 hours. Well, first read off the 10 hour value, which is about 99% and we know the 5 hour value is 60% so just take the difference 99-60 to arrive at 33% of all calls take between 5 and 10 hours. Let’s see you do that with a histogram (hint: you can’t do it).

I’m convinced. How do I make one in Excel?

Creating a CDF is very straightforward and the following five steps describe the entire process. I’ve included a quick Excel screen grab to illustrate what the first few rows of the data will look like when you’re done (it’ll make more sense when you’re done the steps).

cdf_3

Step One: Start with your data in Column A in Excel sorted from smallest value to largest. My data had a lot of zero values, so that is why the column starts off with a series of zeros.

Step Two: In Column C (yes I intentionally skipped B) put the number 1 in C1, 2 in C2, 3 in C3, etc.  Here’s a useful hint to speed this up: use =C1+1 for C2 and fill down.

Step Three: In B1 enter how many data points you have. I had 30,000 so I dropped that number in there.

Step Four: For B2 use the formula =C2/$B$1 and fill down. The idea is that you want to divide the value in Column C by the total count. When you get to the bottom of your data the last value should be 1.

Step Five: Now that the data has been prepped highlight all of columns A and B and insert a new Scatter chart. What you’ll get is the monotonically increasing (i.e. always going up) trend line. You can edit the styling so that the graph is a line rather than points and make whatever other changes make it look it pretty. I recommend fixing the vertical axis at 1 (since the data can never go above 100%) and the minimum horizontal axis at 0, if that’s appropriate. Voila, now you have a CDF chart.

But how do CDFs address the histogram’s shortcomings? Did you notice any bins during the calculation? Nope. CDFs use the value of every data point and therefore do away with binning and the loss of information binning introduces. And because there are no bins the shape of the CDF can’t change like it can in a histogram so that source of subjectivity is neatly side-stepped. And you want to know what else is great? It’s easy to compare two data sets with CDFs by simply plotting them on the same chart. Comparing two histograms on the same plot is not particularly enlightening besides noting how the peaks compare.

In conclusion, CDFs are like grilled cheese: great and easy. And like grilled cheese everyone should try one at lunch (or whenever you next do a frequency analysis). The CDF is a great compliment to traditional histogram analysis and will hopefully become a valuable tool in your graphic analysis toolbox.

An olive branch to histograms

Making a histogram is actually more work than creating a CDF. But if you really want to create one you can either use Microsoft’s Analysis ToolPak (what happened to the ‘c’ MS?) or perform some more elaborate Excel JuJu. Good luck.

Being Careful with Averages

In crime analysis we are used to calculating averages and when I say average I mean the “mean”—e.g. add up ten numbers and divide by ten—that kind of average. But in crime analysis we are also frequently interested in examining rates or ratios. For example: how many hours of patrol time are spent responding to calls. When we combine the two there is the potential to be tripped up since, if we are talking ratios, we need to be very careful about how we calculate an average. Depending on the approach we take we’ll get different numbers.

An example will help illustrate the problem.

Let’s say that your CAD system keeps track of how many hours an officer spends on patrol and, of that time, how much of it is spent responding to calls. The following table illustrates a small sample of this kind of data.

mean_1

From the data in the table I calculated a ‘busy-ness’ ratio for each officer; this is Call Time divided by Patrol Time. From those values I in turn calculated the average Busy-ness, which is 0.51, or 51% busy. But wait, let’s add up all the Patrol Time (78.6 hours) and all the Call Time (40.6 hours) and find the ratio. (40.6/78.6)=0.52 or 52% busy. These numbers are different because of the mathematical principle that says that the ratio of averages (0.52) does not necessarily equal the average of ratios (0.51). But which one is ‘correct’? Well, they both are, and the one you go with all depends on what question you are trying to answer.

For the ratio of averages (0.52) we are asking a question about the entire police service: “how much patrol time did we spend on calls?” For this question 52% is the correct answer because it takes into account the total time spent on patrol and the total time spent on calls. If an officer spends 3 hours on patrol or 12 hours, all of that is taken into account in the final number.

The average of ratios (0.51) is trickier as it answers a question about officers going on patrol: “how much of their patrol can an officer expect to spend on calls?” The difference here arises because the average of ratios treats all values contributing to the mean as having equal weight and this is a fine assumption when the question is focused on how the averages behave. In other words, the length of the shift doesn’t matter, only how much of the shift was spent on calls.

But why am I telling you this since the difference seems to be pretty small? It’s because there are situations where the difference can be significant. For example, look at the modified table below where by adjusting the patrol times I was able to increase the difference between the averages to 4%. Making a 4% mistake on something like officer busy-ness can be the difference between hiring more officers and not. That’s not a mistake I would want hanging around my neck.

mean_2

The take away from this post is to be mindful when taking averages because so many of the numbers crime analysts look at are ratios. Whether we’re looking at officer busy-ness or per capita car thefts or tickets issued per traffic blitz it is important to understand the question that the average is supposed to answer.

Bonus commentary for pedants
But wait, you might be saying, how is the 0.52 number a ratio of averages? No average was even calculated; we just summed the numbers and divided the Patrol Time by the Call Time. That’s a ratio of sums. Yes, that’s true, but if you had wanted to be a real stickler you could have found the average Patrol Time (78.6/8 = 9.825) and the Call Time (40.6/8 = 5.075) and divided those numbers and arrived at the same answer (5.075/9.825) = 0.52. It works this way because the divisor in both the averages is the same (it’s 8 representing the 8 shifts included in the calculation) and therefore it cancels out when the ratio is taken.