Quick and Easy Year-To-Date Visualization with Stacked Column Charts

A common request from customers is to ask about historic crime trends. They usually want to know something like “How many break and enters have we had in District Y since 2010?” and what they likely want to do is get a sense of how things are moving; are things trending up and should they start a project or are they trending down and they can calm the nerves of a local council person.

Getting the data is usually straightforward and the most helpful presentation of the numbers is a chart but there’s always the wrinkle of how to show this year’s data compared to previous years. Unless the customer is asking for the data on the day after New Year’s the count for the current year is going to be Year-to-Date (YTD) only and comparing this YTD numbers to previous years is a bit complicated, especially once the chart leaves your hands and they start sharing it around. Having run into this request many times I’ve developed the following visualization method that uses a stacked column chart to present YTD data in an intuitive, readable format.

In the example spreadsheet I’ve created for this post I’ve generated some random reported dates dating back to 2007. In columns B and C I’ve broken out the reported year and month for the incident using Excel’s built in YEAR() and MONTH() functions. Real life data will likely have a lot more columns but these are the only ones you really need to make this technique work.

ytd_1

Next I create a pivot table with the data and I drag the “Reported” field from column A into the Values area and the “Reported Year” field into the Rows area. This creates a standard count-by-year table that I can copy over to another worksheet to prepare my chart.

ytd_2

This data covers the full year but because the data I am working with was generated in August of 2016 I want to be able to pull out the January through August data from the previous years for the YTD comparison. To do that I drag the “Reported Month” field into the Filters area and select only months 1 through 8 in the Reported Month filter.

ytd_3

The resulting list of values represents the YTD values for each year and I copy those over to my working sheet. I organize the data in a table with columns for the year, the total for the year and the January through August values. I then create one additional column labelled “Sep – Dec” and subtract the “Jan – Aug” count from the annual total to fill this column. The final table looks like this:

ytd_4

With the data in place we can now chart. Select the last two columns of data and then insert a stacked column chart. You may need to fiddle a bit with the settings (you’ll want to add the year labels in for each column) but what you’ll see is a chart that simultaneously visualizes both the YTD values and the overall annual values.

ytd_5

In my sample chart I’ve selected strongly contrasting colours to aid the YTD comparison. By looking at just the blue portions of the columns it’s possible to see that 2016 has been a good year so far for this kind of incident. We’re down significantly from last year at this time and actually at the lowest YTD levels we’ve seen since 2009. If we consider the orange column we can see that besides 2013 and 2015 the number of incidents have been fairly steady and that 2016 is likely going to come in as a little below average.

Charting historic crime volume is an everyday request for most analysts. By leveraging the convenience of stacked column charts you can create a single visual that helps understand both annual and YTD trends.

Download the sample spreadsheet for this post.

Halton Regional Police Service wins 2016 ComputerWorld Data+ Editors’ Choice Award

In the September issue of ComputerWorld the Halton Regional Police Service was awarded a 2016 Data+ Editors’ Choice Award for their innovative use of data and analytics to improve the effectiveness and efficiency of the police service.

For all the talk of big data and technology, I think this is the most important insight for organizations looking to adopt data analytics:

“Unless people can believe in the product, the solving of the technical challenges doesn’t mean much.”

Read the article: http://www.computerworld.com/article/3117301/data-analytics/halton-regional-police-service.html

Comparing Event Counts with the Poisson Means Test

An exceedingly common task for crime analysts is to count the number of events that occur in one time period (e.g. day, week, month, YTD) and compare them to a prior time period. The change between these two numbers if often the source of much strategy and project-making in law enforcement. What does it mean that we had more motor vehicle collisions this week compared to last? What should we do about the increased number of mental health apprehensions YTD? What are we doing right to make robberies go down this month? These types of questions, and the actions that are taken from the discussions they generate, are the daily bread of law enforcement. Unfortunately, one of the things that is seldom done is to check if the change in the event counts is statistically significant. That is, is there really a difference between these numbers or can we just chalk it up to randomness?

This is where the Poisson Means Test comes in to play. It is a statistical test that can be used to determine if the difference between two counts is statistically significant or not. In prior posts I have written about statistical significance being about deciding between two hypotheses. The first (or null) hypothesis says that there is no difference between the two sets of events. The second hypothesis (the exciting one) says that, yes, these numbers are different enough that we would only expect to see this level of difference by chance every, say, 1 out of 20 times (when dealing with a 95% confidence interval). In this case, what we want to be able to do is pick a confidence interval (I’m sticking with 95%) and look at two different event counts (let’s say robberies YTD this year and last) and say that ‘yes, something is meaningfully different about these two numbers’, according to the statistical test. And we want to be able to do this in Excel.

But before we get into Excel a little background on Poisson. A Poisson process is a process that generates independent events over a period of time. Because the events are independent the time between consecutive events is considered random. It’s a perfect process for modelling things like criminal occurrences because notionally they are unrelated and we can’t predict with specificity when the phone is next going to ring. I’m not going to get into any math in this post as it’s not super important for understanding how to use the test so just know that Poisson is about counting the number of times something happens in a time period.

In order to compare two sets of events we need to define a couple of terms. The first set of events has a total count C1 that occurred over time intervals T1. The count C1 is the total number of events but there is some flexibility in defining the time intervals and that flexibility has to do with the other event count you want to compare. In our example I have the number of robberies in Jan-Feb-Mar of 2015 and 2014. I could say that T1=1 quarter, or I could say T1=3 months or I could say that T1=90 days, it really doesn’t matter because T2, the time for the events counted in C2, is exactly the same, since we’re doing a YTD comparison between two years. However, not all the use cases are like that, it is very possible to compare Jan-Feb-Mar 2015 to April 2015 in which case you’d want T1 to be 3 (months) and T2 to be 1 (month). Basically, you pick the largest common time period you have between the two data sets and used that as the basis for T.

Enough chit-chat, an example will make it things clearer.

poisson_1

I am sure this looks common to everyone: two year’s worth of crime data, the difference in the events and the percent increase. Can you just hear the operations inspector demanding  to know why robberies are up 21% and the detectives scrambling for answers? The problem is that we don’t know if that change is because of anything real or is just due to the randomness of the data. But we can use the Poisson Means Test to help us figure it out and in this example, using the nomenclature above, we have C1=44, T1=1 (because it’s one quarter), C2=56 and T2=1. What we need to do next is send those numbers through a function to tell us whether the difference is statistically significant or not. I sure hope Excel has a function for that … but it does not. However, thanks to some talented programmers I found a solution.

I first learned about the Poisson Means Test from the blog of Evan Miller, a programmer and statistician who has developed a great piece of statistical software for Mac named Wizard. Evan has a blog post on the subject of event counting where he explicitly tells readers to never make decisions on plus/minus differences without first checking the Poisson Means Test. To help facilitate this he has created a great online tool for checking two event counts to see if there is a meaningful difference between the counts.

The online tool works great but I need to be able to run these kinds of tests in bulk in Excel (where all crime data resides) and I wanted to see if I could make Excel do the test. Following Evan’s notes in his blog post and peeking under the hood at his tool’s javascript I discovered that most of the heavy lifting for the tool is performed by a javascript library known as jstat. Jstat provides numerous special functions including the, ahem, ‘Inverse of the Lower Regularized Incomplete Gamma Function’.  I took the time to figure out what that means so you don’t have to and what I learned is that Excel doesn’t offer a function to do it. But, Excel does offer VBA, and the jstat source code is available on GitHub, so I ported all the javascript functions I needed into VBA user defined functions and made them available through the following GIST.

Now, I don’t expect you to understand this code because I don’t understand a lot of it. I know what it is doing but there is a lot of code in here that I suspect comes from some flavor of a Numerical Recipes book. The point is, we don’t need to know how it works, the geniuses at jstat already figured it out, I just need to make sure that it works in VBA and I didn’t mess it up during the translation. I’m pretty confident in my coding after a lot of testing and you can test it if you want by trying Evan’s tool and my Excel macro together and seeing if you get the same results (you will). You can also, if you are adventurous, try out the R library zipfR, which has a function named Rgamma.inv, which does the same thing.

So how do we use the VBA to do something useful? Check the screen shot below where I have added a fifth column to my table named P-value. This is your usual statistical p-value that, for a 95% confidence interval, needs to have a value less than 0.05 in order to indicate statistical significance. As you can see in the image, I have called the custom function PoissonPValue from the VBA code and passed in the four values I identified above: C1, T1, C2, T2. The output of the function call is the value is 0.271, which means that, statistically speaking, there is no difference between 44 and 56, so that 21% increase is nothing to get worked up about (certainly not something to change strategy over).

poisson_2

I know that this is a hard truth to swallow because it runs counter to the thinking of most law enforcement personnel. They want to believe that if a number goes up then something is to blame and that something can be handled to make the number go back down. However, from personal experience I can attest, much of the week-to-week and month-to-month differences are not statistically significant. The reality is, randomness happens, and unless we want to be chasing our tails running after statistical ghosts, we would be well advised to concentrate our efforts on differences that the stats indicate are real.

To wrap up, the Poisson Means Test is an incredibly valuable tool that you should put in your toolbox. It’s not just for your weekly/monthly/annual crime report either. You can use it to test for meaningful differences between any two sets of events. So, for example, say someone ran a project over the last two months and wants to know if you can see an impact in the numbers. If you test the events from a period before against the 2-month project period you can see if there is a stat-sig difference between the counts. Now, that’s not to say that the project is responsible but at least you can demonstrate statistically there is a difference. Anecdotally, I have found new applications for the test every week and I like it because it adds a further layer of rigour to our analysis and I think that’s important. I hope you do too.

Revisiting Aoristic Analysis

Attention users of aoristic.policeanalyst.com. I have retired the web tool and replaced it with a downloadable Excel spreadsheet. Discussion and reasoning below.

Back in 2012 I wrote a post about Aoristic Analysis where I introduced a web-based tool for conducting aoristic analysis. The reason I built the tool as a web page was that the code allowed me to split events down to the second so that they could be properly apportioned across time units. The drawback of making a web based tool is that law enforcement analysts, typically a suspicious bunch, never liked the idea of submitting their data (which is only dates, by the way, it is completely context-less) to a web server. Fair enough, we’re a paranoid bunch.

I recently had an email discussion with someone who was using my tool but wished they could have some software to do the work locally. I resolved to create an Excel version of the tool that would work in seconds like the web site. I have completed that work and the result is now available for download here: Aoristic_v3.xlsm.

The spreadsheet is a macro-enabled XLSM and that’s because I use VBA to perform the aoristic analysis. There are three custom functions –  AoristicHourOfDay, AoristicDayOfWeek and AoristicMonthOfYear  – and all the code in the spreadsheet can be seen in the following GitHub GIST:

Understanding the Code

If you open the XSLM file you will notice that there are worksheets with example usage of each of the functions. Since all three functions follow a similar approach I will discuss the logic behind AoristicHourOfDay used on the ‘Hour of Day’ sheet.

aoristic_v1

The ‘Hour of Day’ sheet has in column B the start time for an event and in column C the end time for the event. Following that there are 24 columns, one for each hour of the day, where the AoristicHourOfDay function is called. Looking at the embedded code above you can see on line 23 the function declaration for AorisiticHourOfDay and that it accepts a start and end time as well as the particular hour of interest for calculation.

I will now go through the function line by line to explain the reasoning behind it:

Line 24, 25: The function begins by first determining how many hours and how many seconds have elapsed between the start and end time.

Line 27-30: A check to make sure that the end time is not before the start time. If it is, an error is raised and the function exits.

Line 32-35: Next, the amount of one ‘unit’ is determined by dividing 1 by the number of seconds. Because this code is interested in apportioning an event down to the second, it is necessary to be able to assign units down to the second. For example, say an event was 2 hours long (7200 seconds) then every second is worth 1/7200 of the total event.

Line 37: A variable captures the starting hour based on the event start time.

Line 39-41: The code checks to see if the total number of hours for the event is 1 and the hour passed into the function is the start hour. If both of these conditions are met then the hour should be assigned the entire event (i.e. a value of 1) and the function can safely exit. This is the trivial case.

Line 43-68: If these conditions are not met then the more elaborate assignment method is used. The aoristic variable for this hour is initialized to 0 and a variable for the current hour is initialized to the starting hour for the event. Next, a for loop is initiated that loops from 1 to the total number of hours calculated for the event.

There is a lot going in this for loop so I will break it down, section by section.

Line 45: Within the loop the code first checks to see if the current hour equals the hour passed into the function. If the hours match it means that we need to assign some portion of the event to the hour.

Line 47-51: Next, the code checks to see if the current hour is equal to the start hour and if it matches the code determines the remaining seconds in the hour and assigns that to the share variable. The reason this is done is because, for the start hour, we only want the balance of the hour assigned and to determine the balance of the hour we need to subtract the current number of minutes and seconds from 3600, the number of seconds in an hour.

Line 53-55: Next the code checks to see if the loop counter is greater than 1 and less than the total number of hours. If this is the case the hour is given 3600 seconds. The reasoning here is that if the hour matches and the event started at an earlier hour and ends at a later hour then the event existed across this particular hour and should therefore be assigned a full 3600 seconds.

Line 57-61: The final if statement checks to see if the loop counter is equal to the total number of hours. This is the same as asking if it matches the last hour and if it does we include only those seconds in the part of the hour before the event ended. This is kind of like the reverse of the reasoning for the first if statement.

Line 62: After each of the if statements are complete the variable for tracking the aoristic total for the hour is incremented by adding to it the value of the share multiplied by the unit calculated above.

Line 64-67: Outside the if statement the current hour variable is incremented and reset to 0 if it passes the 24 hour mark.

One thing people may be curious about is why I did this in loop instead of just checking with the if statements and that has to do with the fact that an event can span days and therefore the same hour could be hit again and again. The loop allows for this calculation in a way that the if statements on their own could not.

Wrapping Up

The sample spreadsheet has a lot of example scenarios that illustrate different edge cases for the code. The key check on each of them is that the sum total of the values for all 24 hours should sum to 1. Excel also makes it easy to calculate percentages by hand and compare them to what the code has spit out. Having said that, while I have tested it to my satisfaction there is always the possibility for error. If you come across any odd behaviour or incorrect results please let me know and I will fix it and put a new version up.

I think this Excel method is superior to the web based tool as it allows the same level of detail as the web page plus it allows analysts to look directly under the hood and see how the work is being done. Best of all, you can download the spreadsheet and run everything locally and not have to worry about submitting sensitive data over the web.

One final tip: if instead of the share of event you want to instead know total time spent in each hour, create a new column that contains the event duration in seconds and multiply that by each of the portions in each of the hours. This new table will then show how much time was spent in each hour for each event.

Medians and Excel Pivot Tables

One of the issues with pivot tables in Excel 2010 (and earlier versions) is that it isn’t possible to calculate median using the Value Field Settings. In this post I am going to first extol the virtues of the median (again) and then demonstrate a formula for calculating the median in a cross tabulation.

Medians Again?

Why do we care about the median and why can’t we just use the mean? The mean is the proper measure of central tendency when what we want to average is normally distributed. In those situations, adding everything up and dividing by the number of elements will give the value right at the peak of the distribution. However, when our distribution of values is heavily skewed the mean is less representative than the median. In those cases it makes more sense to take the middle value (i.e. the median) of all the values when they are ordered from smallest to largest.

The classic example of using the median is household income. There are a lot more small income households than large income households and those big households can really distort the mean. For example, let’s say that we have 7 households with incomes (in $10,000s) of 1, 1.4, 1.8, 2, 2.1, 2.4 and 13.2. The mean of those numbers is 3.5 while the median is 2. Clearly 2 is a better representation of the centre of the data than the mean, which has been dragged higher by the one large income.

A similar issue occurs with any data set that is zero based.  For example, say you want to do an analysis to determine how long offenders wait between their arrest and when they are fingerprinted. Many services schedule identification appointments days or even weeks after the arrest and it might be interesting to see what the average time is between arrest and appointment. The shortest possible wait is 0 days, the fingerprinting is done on the day of the arrest, and in general one would expect a shorter time rather than a longer one. While there may be some 28-day waits we don’t want these outliers to unduly skew our average so we choose to take the median rather than the mean.

An Example

Let’s take this example further by considering a spreadsheet were we have 1000 rows of arrest times and fingerprint times and the duration in days between them.

median_1

The easiest way to determine the median would be to use the formula =MEDIAN(D2:D1000) but that would only give us the median for the entire data set. When analysing our data with pivot tables we are often interested in cross tabulating based on a particular element of the data and in this example we want to see if average wait time is impacted by the day of the week of the arrest.

To begin the analysis we first insert a new pivot table and use it to determine average (by which Excel means mean) wait time for each day of the week.

median_2

Have a look at these mean values, particularly Saturday. What is up with that? I’ll tell you: there are errors in the data set. Someone has keyed the dates in incorrectly and there are outliers that are horribly skewing the results. This is the other benefit of working with the median: medians resist the impact of outrageous outliers in a way that the mean doesn’t. We could go back in and do some data cleansing but let’s leave it for now so we can see how the median will handle it.

If you look at the list of options in the Value Field Settings you’ll see that Median is not an option. Since pivot tables don’t provide the option for determining the median we need to do some copying and pasting and use a formula to make it happen. Whenever I am investigating the median I first copy-and-paste-values the results of the mean pivot table to a new work sheet so that I can use the labels in my function calls. The new worksheet looks like this.

median_3

To calculate the median for each of the days of the week it is necessary to pull only those rows of data in the Data worksheet for that match the particular day. Once I have those rows I use the MEDIAN function to calculate the specific median for that day of week. To do this I use the formula in cell D4 below:

median_4

It is important to notice that the formula in the screen shot is surrounded by curly braces. You don’t type the braces in because Excel adds them when you create what’s known as an Array Formula. Array formulas are different than regular Excel formulas because they work with a whole series of data values rather than a single data value. To tell Excel that you intend a formula to be an array formula you must press CTRL SHIFT ENTER (all three keys at once, like CTRL ALT DEL) at the end, rather than just hitting ENTER. You’ll know that you did it right because Excel will add the curly braces.

In this particular array formula what we’re doing is using an IF formula to check every one of the Day of Week values in column A of the Data worksheet to see if it equals Sunday, or the value in B4. If the value does equal Sunday we return the wait time from column D for that row. The IF statement will check every row and return a range of values (all of the ones for Sunday) to the MEDIAN function which in turn calculates the median wait time for Sunday. The formula can then be filled down for each of the days just like you would with a regular formula.

median_5

You can see in the screen grab that we have median values for each of the days of the week. Clearly the outlier that impacted Saturday has been handled by using the median but also note the other days (except Monday) now have significantly smaller values. This is further evidence that the median is the correct choice for this data as using the mean, even when the data is clean, would lead to erroneous overestimates.

Conclusion

The bad news is that Excel can’t calculate medians in a pivot table. The good news is that with a simple formula and the labels produced by the pivot table it is possible to calculate the median for different cross tabulations. It does require a little experimentation with array functions but this particular example is pretty tame and serves as a decent introduction to this very powerful technique.

 

Police Analytics in HQ Magazine

I wrote a Police Analytics article for the most recent issue of the OACP’s H.Q. Magazine. I tried to communicate the value of data to police services and how my Service has approached the challenge.

This article was originally published in the WINTER 2014-15 issue of H.Q. Magazine, the official publication of the Ontario Association of Chiefs of Police.

At the beginning of 2014, Halton Regional Police Service (HRPS) created a new unit with a mandate to exploit police data assets to support evidence-based decision making. Combining scientists, programmers and police experts the new unit serves as a clearinghouse for quantitative analysis and report development for customers throughout the Service. From frontline platoons and investigative bureaus to administrative units and the executive branch this unit, named Police Analytics, has changed the way analytic insights are generated and delivered to in-house customers.

Historically, police services have spent considerable resources gathering data; first in the form of paper documents and subsequently in the form of records management systems. The primary function of this data has been to document activity and support investigations but as these data stores have grown in size and complexity the possibility of conducting large-scale analysis to produce efficiencies and insights has been recognized. The challenge is that the data is not in a format conducive to performing analysis but if this challenge could be overcome then tremendous value would be realized.

Business Intelligence and Police Analytics

Recognizing this opportunity, HRPS chose to invest in a Business Intelligence (BI) software system to better exploit these data assets. Business Intelligence is a combination of hardware and software designed to transform existing data resources into a form that is better suited to reporting and analysis. The work is highly technical but the end result is a single database that contains the entirety of a Service’s data. The HRPS IT department, working closely with our BI vendor, spent over 18 months creating a BI database tuned to the Service’s needs and the final result was unparalleled access to our information. But data alone is not enough, you also need skilled analysts who can intelligently and imaginatively utilize that data to produce results, and those skilled analysts work in our new Police Analytics unit.

Police Analytics was envisioned as a different kind of law enforcement analysis; it isn’t crime analysis or intelligence analysis but is instead a data-centric role that provides quantitative analytical products to all levels of the Service. We realized early on that we wanted individuals with a background in math, engineering or the sciences so that they would be capable of performing complex statistical work. Further, the new analysts needed to be highly technical so that they would be comfortable working with databases and writing software to perform their analysis. This combination of skill sets echoes many of the talents of programmers and developers in the world of tech companies and so that was the model we adopted. To get the best people we decided to hire expertise from outside law enforcement and in so doing we created a tech start up inside the Police.

Challenges for a New Unit

Like any start up, there have been growing pains. Our initial conversations focused on where in the organization to position this new unit and while, from a technical perspective, the unit could logically fit in IT, from an analytical perspective it was decided that the unit should fall under the Chief’s Staff. The access provided by working directly for the executive allows the analysts to have a more direct line to senior command—better to communicate analytical findings and field requests—and working alongside other executive units such as planning, audits and policy meant that the analysts could develop a holistic understanding of how data flows throughout the organization. The placement of a highly technical unit outside of the traditional IT infrastructure was a novel undertaking and providing the needed access for police analysts to do their work meant that policies and practices had to be adapted. Consensus was reached through negotiation and collaboration between departments and we were able to ensure data integrity and successfully address security concerns.

The next challenge was one of branding. We had constructed a high-functioning unit that produced useful analysis but we needed the Service to know about it. To address that issue we started an internal campaign advertising ourselves throughout the Service as the source for statistics and analytical products. We positioned ourselves as a resource for Service members of any rank to get data and advice to support their projects and we emphasized the value of having a solid quantitative foundation to produce successful project outcomes.

Evidence-based Decision Making

Our outreach efforts focused on promoting a culture of data-driven, evidence-based decision making and we encouraged Service members to think about data collection and how subtly adjusting business practices could lead to better data which in turn would lead to better analysis. As an example, our computer systems allow officers to push a button every time they change activity but some officers had gotten in the habit of not pushing the button and this lead to data gaps. To address this issue we communicated to officers how consistently updating their status led to much more accurate activity reporting that better captured all of the work they performed throughout the day. When officers discovered the benefits of pushing the buttons, namely credit where credit is due, they modified their behaviour and adopted a data-driven mentality.

We’ve worked hard to change behaviours and clean up our data sets and we’ve started to see rewards for those efforts. One of our successes has been a project that studied call volume and peak period staffing. Calls for service fluctuate throughout the day but our officers work 12-hour shifts and that leads to peaks and valleys in officer busy-ness. By accurately capturing officer status changes we obtain an extremely detailed view of what patrol officers are doing throughout their shifts and with this level of detail it is possible to broadly categorize officer time as either ‘busy’ or ‘available’. For our analysis, we aggregated hundreds of thousands of time-stamped officer activities to construct daily and hourly profiles of when our officers have the heaviest work load and using that data senior commanders are able to construct additional shifts to move officers to busy times. The end result is a reduction in officer busy-ness during peak periods and a levelling of the overall work load across day and night shifts and because the system is data driven we are able to measure the impact of the shift changes and quantitatively demonstrate our success.

Increased Data Awareness

Beyond analytical work the police analytics unit also studies organizational business practices to identify “pain points”. If a business process is cumbersome or time consuming we brainstorm as a unit how to rectify the problem. This practice has led to the development of a number of reporting tools that repackage existing data assets into more useable formats. A few examples include summary reports of field contacts for front line officers, absence management reports for human resources, and occurrence mapping tools for crime analysis. The point of these reports is not that they contain analysis but that they take existing data that is stored in isolation and synthesize it into an easily read report and where before an individual tasked with reviewing these entities may have spent hours clicking through the RMS they can now see everything in one report.

Perhaps our biggest success in this vein is the introduction of our officer activity report that allows Service members to review their activity. Our RMS captures officer activities such as arrests, charges, and tickets and our new reporting tools allow supervisors to review monthly summaries of this information for their units in a web-based, on-demand format. This reporting tool offers many improvements over the old, self-reported monthlies including accuracy, standardization and time savings. This tool has eliminated the need for members to waste time collecting and collating data that has already been captured in our databases and has resulted in a greater awareness, both for members and for command, of officer activity.

Lessons Learned and the Future

With our success establishing the police analytics unit HRPS has learned a number of lessons that may be instructive to other Services looking to create a data-focused unit:

  • You need buy-in from the highest levels of the organization. When attempting to create a data-driven culture of analysis that message needs to be embraced and communicated through management.
  • Success is built around people and not technology. Simply buying a piece of software does not solve your problems; you need a small core of dedicated experts who believe in the mission to find success.
  • Success takes time and progress will be slow. It is not possible to successfully influence the culture of a large organization quickly and that’s doubly true when the changes are related to a complex matter such as data.
  • Change is an iterative process. Only when you start looking will you see how your processes need improvement and once those improvements are in place you’ll need to gather more data and you’ll likely see more necessary changes.
  • The unit needs access to senior management. The police analytics unit needs the latitude to communicate proposed changes to senior officers and receive approvals quickly.

HRSP is proud of what our police analytics unit has accomplished in a short time. We have made meaningful contributions to improving the efficiency and effectiveness of the Service and have successfully promoted the value of data-driven decision making in the organization. We also spent 2014 promoting our vision for police analytics to other Services in Canada as well as at conferences in North America and Europe where our ideas have been enthusiastically received. In 2015 we plan to ramp up our software development and update our web-based tools so that officers in the field can access information on their tablets and phones. The volume of data captured by the police is going to keep growing and specialized systems and analysts are needed to extract value from those assets. We believe Police Analytics is the future of law enforcement analysis and we encourage everyone to consider how it might benefit their organization.

 

Prioritizing Using Exponential Decay

I had an officer ask the other day if I could identify the top 10 subjects in our records management system that were involved in mental health occurrences. On the surface this is a straightforward request: identify the people with the most activity but it gets trickier when you factor in the passage of time. For example, is someone who had 3 incidents over 6 months ago more of a concern than someone who had 2 incidents in the last 8 weeks? Clearly recency should factor into how a person is prioritized but history should also weigh in. To address questions like this we can use an exponential decay equation in Excel to preferentially weight recent incidents while still giving some weight to the past. In this post I will outline how to apply the method.

The first step is to obtain some data. Depending on the nature of the incident I will set an arbitary historical cut-off; in this example we’ll set it a year. This then means that we need to extract every mental health occurrence over the last year along with the date of the occurrence and a unique person identifier.

exponential_1

For each occurrence we then calculate the age of the incident in days by subtracting the date of the occurrence from today. This is done using Excel’s NOW() function to get the current date and the INT() function to trim off the time since we are working strictly with days. the following screen shot shows the formula in the function bar.

exponential_2

Next we want to create a column for the exponential decay formula. Most people are familiar with the idea of exponential decay, it is a mathematical formula where a quantity decreases at a rate proportional to its current value. In practice that means that the value drops off quickly and first and then more slowly as the value shrinks. The chart below shows a graph of exponential decay for a number of different constants. As can be seen in the chart we can tune how fast the values drop off by adjusting the constant.

640px-plot-exponential-decay-svg

In Excel, the formula for exponential decay is =EXP(-constant*time) where we get to specify the constant and time, in our case, is the age of the incident. Since we’re dealing with days and we only have a year of data we can decide how much we want past incidents to influence a person’s priority. For example, if we say that a mental health incident that happened today is worth 1 point, how much is an incident that happened 6 months ago worth? If we say that it’s work 0.5 points then we can calculate our constant as: -1*LN(0.5)/180. Where LN() is Excel’s natural logarithm function, 0.5 is point value we want to calculate and 180 is the (approximate) number of days in 6 months. This calculation results in a value of 0.003851 which we can use as our constant.

exponential_3

Back in our data worksheet we can finally create our new column that calculates the decay score for each incident based on its age. We use the exponential decay formula =EXP(-1*0.003851*age) to calculate the decay score for each incident as in the following screen shot. Note that the decay value will always be less than 0 but never reach 0.

exponential_4

Now that we have a decay score for each occurrence we need to create a pivot table of all the data. Highlight everything and select ‘PivotTable’ from the ‘Insert’ tab. This will open a modal window and you can just hit the ‘OK’ button to create your pivot worksheet.

exponential_5

On the pivot table worksheet drag the Identifier field to the Row Labels box and the Decay Score field to the Values box. Click on ‘Count of Decay Score’ in the Values box and select the ‘Value Field Settings’ option at the bottom of the pop-up window. Change the selected option from Count to Sum and click ‘OK’.

exponential_6

Almost done. Back in your pivot table right-click the Sum of Decay Score column and sort ‘Largest to Smallest’. This will sort the persons by their aggregated decay scores and the people at the top are the highest priority individuals.

exponential_7

This method works because the exponential function embodies the principles we were originally trying to capture: it gives more weight (a higher score) to recent incidents but it also gives some score to historic incidents as well. For every occurrence a person is involved with they have the potential to add up to 1 point to their total score (if the incident happened just today) but as the incidents age they influence the total score less. In our example above the person with the 3 incidents over six months ago has, at most, a score of 1.5 (since 0.5 is the 6 month mark) while our 2 incidents in the last 8 weeks person is likely to have a score a little less than 2. So this system weights these people the same by balancing the influence of recent and history. The best part is that we can tune how much weight we give to the two factors by adjusting the exponential decay constant. If we decide that an incident 6 months is only worth 0.1 point rather than 0.5, which greatly privileges recency over history, we can adjust the constant and rescore everyone. On the other hand, if we wanted to make a 6 month incident worth 0.9 points, we could recalculate the scores and privilege history over recency.

I like this method of prioritizing because it’s simple to implement in Excel (no plug-ins!), it’s logically justifiable and it allows the impact of recency and history to vary according to the expertise of the analyst. I hope you find it useful and good luck prioritizing!

Performing a Chain Analysis in Excel

Chain Analysis is the name I’ve given to the act of finding sequential, potentially related phone calls in phone records. The job is to identify if a particular number called a second number and if that second number called a third number within a specific time frame, say 10 minutes. I call these sequences of calls a chain and it can be used to identify whether someone might be issuing orders or instructions via third parties. The analysis can be done in Excel and while it involves some moderately-complicated Excel functions, once the spreadsheet is set up, subsequent analyses can be done easily.

Getting Started

To start, the first thing you need to do is arrange your data properly. In the screen shot below I have arranged some sample data where column A lists the calling number, column B lists the called number and column C lists the date and time of the call. In order for this technique to work properly it is important that the data set be sorted on the date time column.

For simplicity sake (and ease of reading) I’ve replaced the phone numbers with letters but really any string will work in the analysis. Likewise, I only have 7 rows of data but you can have a whole bunch more and the analysis will work just fine.

While we’ll need a few extra columns for functions there is no reason you can’t keep additional data in the rest of the columns. As long as you sort everything properly the rest of your information will tag along as you perform the analysis.

chain_1

Notice in the list that there are some chains that are easily identified without resorting to a function. Notice how, on row 3, AA calls CC and then CC calls DD a minute later. Similarly, on row 6, GG calls AA and then a minute later GG calls BB. This is also a chain to be identified. The trick now is to write a function that is able to flag these rows so that we don’t have to rely on picking them out manually.

The Look-Ahead Column

To flag the row we’re first going to create a new column that, for each row, “looks ahead” at the rows coming up. What we want the function to do is look ahead to see how many subsequent rows fall within the time frame window we’re interested in. In the discussion above I mentioned 10 minutes so, for each row, we want to count how many subsequent rows occur within 10 minutes of the date and time of the current row. The screen shot below captures the function that I wrote for cell D2.

chain_2

The very first thing to notice is that the function is wrapped in curly braces like this: {<function>}. This means that the function is what’s known as an array formula rather than a traditional scalar formula. To enter an array formula you click in the cell like you normally would but instead of hitting enter when you’re done you need to hit ctrl-shift-enter (at the same time) to tell Excel that it should be considered an array formula.

This particular function needs to be used as an array formula because we are asking the function to compare each subsequent cell to the current one in the IF statement, so we’re comparing an array of values rather than just a single one. Also, when you’re typing the function into the cell, don’t include the curly braces, Excel will add those after when you correctly use ctrl-shift-enter to complete the array function.

Let’s examine the IF statement next. The inner most part of the function is a request to check and see if the difference between cell C3 and cell C2 is less than 600 (our 5 minute cut-off). Since the date and time values in the C column are in Excel time we need to convert them to seconds before we do the comparison, which is why I multiply them by 24 (hours) and then 3600 (seconds) to get the difference into seconds. In the IF statement, if the difference between C3 and C2 is less than 600, then we return a 1 and if not we return a 0. Now, the tricky part is to wrap your head around the idea that, because this is an array function, we can use the expression C3:$C$8-C2 to do all of the comparisons at the same time. This means that we are doing C3-C2, C4-C2, C5-C2, etc, all the way to C8-C2. For each one of them we do the x24 x3600 conversion and comparison against 600 seconds. For each one of them we are getting either a 1 or a 0. The power of the array function method is that we can do all of those calculations at the same time, in the same cell.

Have a look at the SUM function that wraps the IF statement. Knowing that this IF statement is multivalued makes the SUM function make more sense. The role of the SUM is now to sum up all of the 1s that were returned by each of the IF statements. Have a look at the value of D2, it’s 3. It’s 3 because the date/time in C3, C4 and C5, are both within 10 minutes (600 seconds) of the value in C2.

Array formulas can be filled down just like regular functions so I have filled down the rest of the rows with this function. The key thing to remember is that you only need to compare the current row to the all the subsequent rows, and so the important bit of the function, the C3:$C$8-C2 bit, will always have the current row in the C2 position and the next row in the C3 position. As an example, in cell D4 the function is =SUM(IF((C5:$C$8-C4)*24*3600<600,1,0)) . Note that for row 4, C4 is used to subtract and the range of cells to check starts at C5.

Figuring Out the Look Ahead Row Number

The next step in the analysis is create a new column to hold the end row value for the look ahead. If the values in column D tell us how many rows to look ahead then the values in column E tell us what row that actually is. Have a look at the function in cell E2 in the screen grab below.

chain_3

This function uses the ROW() function to first tell us what row we are currently in. Then the value of D2 is added to that to arrive at the final value. For cell E2 the value is 5 since the ROW() function returns 2 and the value of D2 is 3. What this means is that we only need to look ahead to row 5, from row 2, to find all the phone calls that occurred within the 10 minute window. Anything beyond row 5 is more than 10 minutes after the call in row 2. The end row is an important component of the next part of the analysis as we need to be able to specify exactly what cell range we should be looking at for determining chains.

Flagging the Chained Calls

Have a look at the screen grab for the final step of the analysis, the step where each row is flagged as part of a chain or not.

chain_4

The heart of the function is the use of the MATCH function to compare the value in column B, the Callee column, with the values in the Caller column for all the rows up to the end row identified in column E. In order to do that we need to use the INDIRECT function, which is used here to assemble a cell range using the ROW() function and the value from column E. Inside the INDIRECT is (“A” & (ROW()+1) & “:A” & E3). If you evaluate this we can see that ROW()+1 becomes 4 (since ROW() is 3 and 3+1 is 4) and the value in E3 is 5, which leaves (“A” & 4 & “:A” & 5). If you slam all that together (which is what the &s are for) you end up with A4:A5, which is the range of cells we want to test our MATCH against.

Something else you need to know: when MATCH can’t find a match it returns NA, which is why the entire MATCH call is wrapped in an ISNA function call. If MATCH returns an NA value, then ISNA will return true, otherwise it returns false. I then feed that true or false into an IF statement (which is only interested in evaluating things to true or false anyway). If the value in the IF is true, that is, MATCH returned NA, then IF prints nothing, “”, to the cell. However, if MATCH returns a value and therefore ISNA returns false, then the IF statement will print “CHAINED” in the cell. For good measure, I wrap the whole thing in another IF statement to check and see if the look ahead rows in cell D3 are greater than 0. If not, there is no way there can be a chain, so the function just prints the empty string “” and we skip the whole MATCH, ISNA, IF process.

So let’s have a look at the results the function spit out. We can see that two chains have been identified. The first is on row 3 where AA called CC and then CC called DD within 10 minutes. This is what the CHAINED in cell F3 is identifying. The second CHAINED is for the chain that begins when CC calls DD and then DD calls FF (on row 5) within 10 minutes. The take away here is that chains are identified on the row in which they begin so it’s up to you to follow the chains from the starting point where they are identified.

Bonus Chains

The above example looks at Caller-Callee chains but almost the same function can be used to flag Caller-Caller chains, the ones where a caller calls a number and then within the time interval the same caller calls another number. In the screen shot below Caller-Caller chains have been identified for the second row and the sixth row and the only difference between the functions in column F and column G is that the MATCH in the column G version is being done on the values in column A (e.g. A2) rather than column B (e.g. B2). All the same logic applies but the change in the matching column changes the chaining subject.

chain_5

As can be seen in the chart, row 2 is a Caller-Caller chain because AA calls BB and then within 10 minutes AA calls CC. Similarly, on row 6, GG calls AA then GG calls BB with 10 minutes. Just like with the Caller-Callee chains above, the CHAINED flag is listed on the row where the chain begins and it’s up to the analyst to follow the chain through.

Conclusion

While the functions in this post can seem a little hairy at the start I think, with a little exploration, that you can become comfortable enough with them to trust the results. While it’s true that a custom VBA function could be written to perform this analysis I think it’s important to find solutions that first attempt to use the built-in Excel functions before resorting to custom code. I hope this post has demonstrated how a fairly sophisticated kind of analysis can be performed with just the tools that come in standard Excel. Good luck!

Month-over-Month Crime Stats Aren’t Useful (and two alternatives)

Do you prepare a monthly crime report? Good. Do you break down the number of incidents by crime type, for example assaults, break and enters, robberies, etc? Good. Once you have this month’s numbers, do you prepare a plus/minus over last month’s numbers, maybe with a splash of red or green to indicate direction? Pointless. I’ve seen too many crime reports, comp stat presentations, power point slides, what have you, that compare a current month’s crime (or week’s crime) to the prior month (or week). This is not a helpful exercise because crime has, as any large scale analysis of your incidents will indicate, an annual cycle.

The number of crimes reported to the police has three distinctive cycles. The first is daily, you get a different number of crimes being reported depending on the hour of the day. The second is weekly, depending on the day of the week, you’re going to get more calls. The third and final cycle is annual, you get more calls in the summer and fewer in the winter (in the northern hemisphere anyway). The following diagrams illustrate these cycles.

hod

 

dow

moy

Now, I realize that day of week cycle doesn’t look like anything legitimate but keep in mind that the volume of incidents is high enough that the difference is statistically significant. I encourage you to perform an analysis on your own data to demonstrate to yourself that what I’m saying is true.

Because crime has an annual cycle that means that if your crimes bottom out in January and peak in July then every month from January to July, on average, your current month’s stats are going to be higher than the previous month’s. Similarly, after you reach the peak your current month stats are going to be less than the previous months. The problem, of course, is that when these stats are reviewed each month commanders are going to be chewing out underlings and developing strategies to counteract the perceived rise in incidents during the first half of the year and celebrating a job well done in the second half of the year. Of course, this is an exaggeration, cops know as well as anyone that they get busier in the summer than they do in the winter but to the untrained observer (the media, the public, etc) this kind of stat reporting is not helpful.

Fortunately, there are a couple of alternatives:

The first is to compare this month to the same month from the previous year. This is a solid indicator since, barring drastic changes in the long-term trend (which you should quantify!), the numbers should be viable for comparison. The number is viable because it compares two points that are on the same position in the annual crime cycle.

The second alternative is to only compare a year-to-date (YTD) stat and forget the whole month-over-month nonsense. Like option one above, a YTD stat will provide you with a legitimate yardstick to compare one time period to another. The yardstick is legitimate because it takes into account the annual cycle by summing up everything since the beginning of the newest iteration.

So there it is: don’t perform month-over-month stat comparisons. There are two equally easily calculated stats that will fill the spot in the report and provide actual insight. The key is to remember the natural crime cycles.

Performing a Statistical T-Test in Excel

The value of the t-Test is that it can tell you if a difference you are seeing in two measurements is legitimate or if the difference is likely to be just from randomness. In this blog post I am going to use the t-Test to determine if the difference in the number of daily motor vehicle collisions during the winter is meaningfully different then the number during the summer. While this is kind of a no-brainer example it will help illustrate the value of the t-Test and give you some ideas on how you might apply it to give some statistical rigour to your analytical findings.

Before we get into the test let me provide some background to help understand what it is we’re trying to do. In statistics we are always comparing two hypotheses. The first is named the null hypothesis and the second is named the alternate hypothesis. We almost always want to reject the null hypothesis and accept the alternate one because in that situation it means something interesting has occurred. For our motor vehicle collision example we want to determine if there is a meaningful difference in the number of daily motor vehicle collisions during the first quarter of the year (i.e. the winter) versus the third quarter of the year (i.e. the summer). Now, you may be thinking “obviously there’s a difference, the weather” but let’s, for the sake of education, assume that we need to demonstrate statistically that there is a difference. In this study the null hypothesis would be that “there is no difference between the quarters, the  number of collisions are the same” while the alternate hypothesis would be “yes, there is a statistically significant difference between the two quarters”.

The term “statistically significant” needs some explanation. When we want something to be statistically significant we want to be able to say, with a particular level of confidence, that the results we are seeing are not just due to chance. First we pick our “confidence level” or how sure we want to be and then, through the magic of statistics, we are provided a number that our results have to beat in that we can be that sure. In this example we’re going with the 95% confidence level which means that if we find a difference between the average number of MVCs during the winter and summer we want to be 95% sure it’s not due to chance. Or, in other words, we are willing to wrongly reject the null hypothesis only 5% of the time. We can up the confidence interval to 99% if we want but, as we’ll see, we’ll need to adhere to even stricter conditions. Whatever we choose just remember that the “significant” part in statistically significant doesn’t mean “important” (as significant typically means in everyday conversation) it just means “not due to randomness”.

How do we go about demonstrating that the two quarters have real, non-random differences in the number of daily motor vehicle collisions? The first thing we do randomly select from the first and third quarters 30 different days worth of MVC counts. The screen grab below shows my randomly selected MVC data.

 ttest_0

Notice that we there are two columns of data, one marked Winter and one marked Summer and each has 30 entries. Also notice that at the bottom of each column is a number labeled “Mean”. In those cells I have used Excel’s AVERAGE function to find the mean of all the numbers (add them all up, divide by 30). And behold, the means are different. That proves the quarters are different, right? Not quite. We still have to deal with the peskiness that is statistical significance. It’s possible, after all, that it’s just through chance those means are different. As discussed above we want to be 95% sure that they aren’t.

To prolong your agony I’m actually going to show you two ways to perform the t-Test: the (relatively) quick way using Microsoft’s Data Analysis ToolPak Add-in and the longer, manual way that uses built in Excel functions.

First, using the Add-In. 

The Analysis ToolPak is a free Add-In provided by Microsoft as part of the default installation of Excel. Add-Ins are a kind of Excel file, typically with a .xlam extension, that package together a bunch of related functionality. To get started with the Analysis ToolPak Add-In we need to check if it is installed. You may or may not have it already installed but to check if you do you click the “Data” tab in Excel and look at the far right. If you see “Data Analysis” as an option, it’s already active (see screen shot).

advanced_1

If you don’t see it that means we need to turn it on. First, click the green “File’”tab (or the Excel jewel if you’re in 2007) to call up the Save/Open/Close menu. Look around the option named “Options’”and click it. This will open up a window with a menu on the left. Click on the menu option named “Add-Ins” on the left hand side. The screen will change and at the bottom, beside the word “Manage”, will be a select box and a button labeled“Go”. Click the“Go”button to open yet another window and you will see a series of check- boxes. Fingers crossed that one of them says “Analysis ToolPak” and is unchecked. Click the checkbox and click ok. Now go back to Data tab, look at the far right and you’ll see “Data Analysis’”. You’re set.

Now that the ToolPak is installed click the “Data Analysis” button and a modal window will open up. Click “t-Test: Two-Sample Assuming Unequal Variances” in the list and click the “OK” button and this will open up a new window (check the screen shot below) with some options. For the “Variable 1 Range” click the little arrow button and select all of the “Winter” data, which is in B2:31. For “Variable 2 Range” select all the “Summer” data from C2:C31. For the “Hypothesized Mean Difference” enter 0 (since we’re trying to prove they are the same, the difference of the mean should be 0) and everything else can stay the same. However, notice that there is field named “Alpha” that has the value 0.05. It’s not a coincidence that 0.05 = 1 – 0.95. Alpha is another way of asking how sure we want to be.

ttest_1

Click the “OK” button and Excel will open up a new worksheet populated with a bunch of labels and values just like the one in the screen shot below.

t-test-new-1

So what are we looking at? Well, there are the means we saw. Next comes the variances, which are measures of how spread out the values are, and they are quite different (which is why we needed to use the test that assumed unequal variances) and a little further down we see our “t Stat” and it has a value of about 4.35. Skip down two lines to the value labeled “t Critical one-tail” and note that it is about 1.68. That’s good news for us because the rule is that if your “t Stat” is larger than your “t Critical” value then the null hypothesis can be rejected and our results are therefore statistically significant. This means that we can now state, with the power of math backing us up, that we are 95% sure that the daily average number of MVCs in the winter are greater than the daily average number of MVCs in the summer.

Second, the manual way. 

So we used a built-in tool to do a bunch of heavy lifting for us (as tools should) and we read an answer off a table. It works, but how did it work? How can we calculate these numbers if the ToolPak isn’t around but we still, for some reason, have access to Excel?

First, have a look at this screen shot because it will help to explain each of the steps I used to perform a manual unequal variance t-Test.

t-test-new-2

On the left are the two original columns of data, one for winter and one for summer. Notice that there are 30 entries for each season but that I used “freeze panes” so that I can show the bottom of the data (starting in row 32). For both seasons I calculated three additional variables, the mean (using Excel’s AVERAGE function), the standard deviation (using Excels SDEV function) and the count of the number of data points (using Excel’s COUNT function). These will all be needed in the coming calculations.

On the right hand side of the screen shot I produced my own little table and I’m going to walk through each of these variables and explain the thinking behind them.

The first needed variable is the Sum of the Squared Deviations for which I used Excel’s DEVSQ function as a short cut. Think of the variable like this: you have a mean value for  a season and you have 30 different data points that are either going to be a little bit more or a little bit less than the average (as data points usually are). For each data point, subtract its value from the mean and square it. Now sum those squared differences all up and that’s the sum of the squared deviations (or differences). Notice that I added together the sum of the squared deviations for both the winter and summer seasons together into a single value.

Why did we need the sum of the squared deviations? Because we want to calculate the pooled sample variance. Why pooled? Because we put both the winter and the summer values together. But what’s variance? The average of the squared differences from the mean. This is easy to calculate, just divide the sum of the squared deviations by the number of points we have in our two seasons, minus 2 (1 for each season) to account for Bessel’s correction which is a correction for the bias in the estimation of the population variance (just trust me, subtract 2, Excel does).

The next variable is the Standard Error of Difference in Means and to calculate it we take the square root of the pooled sample variance times the sum of one over the counts for the winter and summer data points (i.e. 1/30 + 1/30). But what is the standard error of difference in means? It’s just the average expected difference between the means for two samples that have this many data points. We’re trying to determine, after all, if two means are significantly different from one another and, given the size of the two sample groups, we should expect that there would be some difference. Now, imagine that we run this little study over and over again with other, randomly selected groups of MVCs, each pair would have their own difference between the means. That sounds like a lot of work, so instead of running this study over and over again we can instead use the formula for the standard error of difference to answer the question: “what’s the expected average of that difference?”.

But what do we need that for? To calculate our t-statistic. And to perform that calculation we just divide the actual difference in the means (winter mean minus summer mean) by the standard error of difference in means to end up with approximately 4.35, just like the ToolPak did.

But we’re not done, we still need to determine our critical t-score at the 95% confidence interval. To do that we cheat a bit and use Excel’s built in TINV function. The first input is the probability that we want to test, in this case it’s 0.05 (i.e. 1-0.95) or the alpha from the ToolPak. The eagle-eyed will note that in the calculation in the screen shot 0.05 is actually multiplied by 2 and this is because we’re interested in replicating the “one-tailed” result from the ToolPak and since TINV is “two-tailed” we need to double up the probability to account for that.

TINV also takes a second parameter, degrees of freedom, and usually degrees of freedom is pretty straightforward (it’s just the sum of the two counts, minus 2, like for the pooled variance) except when we’re dealing with an unequal variance scenario. In order to handle this scenario we need to break out the more elaborate calculation defined by Moser and Stevens 1992 and spelled out in the ugly Excel formula:

=(1/B34 + F8/C34)^2/(1/(B34^2*(B34-1)) + F8^2/(C34^2*(C34-1)))

Where the value in cell F8 is calculated as =C32^2/B32^2 or the ratio of the squared standard deviations for each season. God, it’s a mess, I know. But just dissect the formula and you’ll see that it’s just a complicated formula using the counts for the two seasons in multiple places. One thing you may notice is that the resulting degrees of freedom is not an integer and that we have the approximate value 48.58. The ToolPak results above have a degrees of freedom of 49. This is because Microsoft rounds. You can choose to round or not, the results are pretty much the same.

With degrees of freedom we can finally run the TINV function and when we do we get a critical t-stat of 1.68, just like the ToolPak (although the ToolPak number is slightly different further into the decimals because I didn’t round my degrees of freedom). And just like with the ToolPak, our t-stat is way bigger than that critical t-value so we can reject the null hypothesis and assume that we have statistically significant results at the 95% confidence level. Yes!

And with that we are finally done our odyssey of manually calculating the unequal variance t-test. We’ve calculated the same results as the ToolPak and we got right into the nitty-gritty of it. Well done.

Conclusion

If you made it this far (and didn’t just skip the manual section) then bravo, that was a real grind. Either way, I hope you learned something about the t-statistic and how it can be used to add some rigour to pronouncements about means. I do, somewhat unfortunately, find that a lot of the time the t-stat plays spoiler in that decision makers see a difference in averages and conclude that something is working when the reality is that, statistically speaking, there is no significant difference. Nevertheless, it’s important to be able to back up your analysis with some math and t-stats are a great way to provide that oomph. Good luck.