Anyone who’s taken a sufficiently high-level statistics course, or tried to teach themselves statistics, knows that there are a bunch of different kinds of statistical procedures, and a bunch of different “statistics” and “tests” they have to do to figure out whether the results are “significant.” For example, I’ve seen the F Test introduced, explained mathematically, derived a few times – but I never quite figured out what it was actually doing. Not during my mathematical statistics course, not during my regression or econometrics courses, not at work, not in my own reading. Then last night a friend asked what it was and I explained it in about 30 seconds, then realized that I’d figured it out. I figured it would be nice if someone explained this on the internet, and I’m someone, so here goes:
Analysis of Variance
Statistics courses often shove a unit on Analysis of Variance somewhere into the middle or end, but the principle behind it is the one that motivates all of frequentist statistics. Here it is in a nutshell:
Variance is a measurement of how much variation there is in the data. For example, if you’re measuring the physical location of an object in latitude/longitude, the variance of your bed is low, the variance of your housecat, Roomba, or other pet is a little higher, and the variance of you or your car is even higher.
When you build a statistical model, you are trying to explain the variation in the data you’ve observed. You will usually be able to explain some, but not all, of the variance. Let’s say grandma and grandma (or mom and dad, if you’re a bit older) have prospered financially. They spend spring and summer up north in Connecticut or New Jersey, and fall and winter in sunny Florida. Your model has only one predictor: what season is it? This explains the vast majority of the variance in their location. While they might move a little bit to get groceries, go out to dinner, visit a friend, or play golf, you can almost always get their location within a few miles by using that one simple predictor: is it warm season, or cold season.
But it’s not enough to just see that you explained some of the variance. Sometimes this will happen by coincidence. If you assign a group of people randomly into two groups, A and B, usually one will be a little taller than the other on average – it would be a big coincidence if the average heights were literally exactly the same. But that doesn’t have predictive value – just because group A is taller doesn’t mean you expect group A to be taller next time you divide people randomly into two groups. You can’t always use common sense – for example, maybe you’re not already sure whether something’s truly predictive or not – so you’d like some objective measure of how likely you are to get a prediction that works this well by chance, if there weren’t really a true relationship between the things you’re measuring.
It turns out that under certain simple conditions, you can neatly divide the observed variance into the part explained by your model, and the remaining noise. In the grandma and grandpa example, the explained part is whether they’re in their average Florida location or their average New York Area location, and the unexplained part is trips around town.
Now here’s the core of statistics*:
1) Come up with a single measurement of how much your model explains.
2) Transform that measurement into a statistic that has well-defined statistical behavior.
3) Look at its value, and see whether the statistic is a lot bigger than you would expect if your model just worked by chance.
If your model didn’t “truly” explain anything, then both the apparently explained and apparently unexplained parts of the variance would be distributed like a Chi-Squared random variable. If you divide the explained variance by the unexplained variance, you get another statistic, the ratio of the two. The more of the variance your model explains, the bigger this ratio gets. And since the ratio of Chi-Squared random variables takes the F distribution, you have a statistic with a well-behaved distribution. This is called the F Statistic. Then you just look to see, how big is this F statistic? Is it bigger than 90% of the values it would take by chance if your predictors were meaningless? 95% 99%? This tells you how unlikely it is that your model worked as well as it did by accident.
Now, I oversimplified this a bit. For one thing, the ratio of variances has to be multiplied by something based on how many records and how many predictors you’re using. For another, sometimes the F test is used to compare a model with another model, instead of comparing a model with no model. And it’s not literally true that mathematical statistics is nothing but significance tests over and over again – for example, there are fancy Bayesian techniques that explicitly estimate a prior and a posterior, there are ways to figure out whether a weird outlier is messing up your model, etc. But this is the part that keeps showing up, over and over again, but no one really bothers to explain because it’s too obvious – or because they never figured it out.
Another Example of This Sort
My friend, Franklin, reciprocated by explaining the central principle of sorting algorithms – ways to sort a bunch of records in a dataset, by some key (for example, you might sort bunch of records of people by full name). Basically, all efficient sorting algorithms work by minimizing the number of extra comparisons they have to make; if you know that A>B, and B>C, you don’t need to compare A and C to know that A>C. The computer science courses he’d seen never covered this, they just tell you about the various sorting algorithms and derive how efficient they are.
This central insight also explains both why the radix sort is so much more efficient than other sorting algorithms, and why it took so long to discover such a simple thing. Basically, you might think of sorting algorithms as a way to approximate the theoretical minimum number of comparisons needed to know, for any given pair of objects in a set, which one is greater than the other. This minimum is n*log(n), where n is the number of objects.
But what you actually want to do is order the data while making as few comparisons as possible. And the smallest number of comparisons you can make is not n*log(n), but zero. Essentially (oversimplifying only very slightly), instead of drawing any comparisons at all, you enumerate all possible values for your sort key in order, and then put each record directly in the corresponding slot. At no point do you compare any records with each other at all.
Look for the Central Insight
In general, you’ll be able to retain a body of knowledge a lot better if you can remember its central insight – the rest can be reconstructed if you have that one intuition. Another example: the central insight of Wittgenstein’s Philosophical Investigations is thinking of language as a behavior with certain social results. The central principle of double-entry accounting is that the credit is where the money comes from, the debit is where the money goes, and you always record both sides of the movement. What are some other examples?