Thursday, March 24, 2011

What Wisdom Do Statisticians Have

From a Nate Silver post((on a theory predicting the Republicans are almost sure to maintain control of the House in 2012):
The issue with this model, and some others like it, is what’s known in the statistical business as overfitting. This occurs when the number of variables is large relative to the sample size: in this case, the full version of Mr. Enten’s model contains six variables, but is used to explain only 15 cases (Congressional elections in presidential years since 1952).
A general rule of thumb is that you should have no more than one variable for every 10 or 15 cases in your data set. So a model to explain what happened in 15 elections should ideally contain no more than one or two inputs. By a strict interpretation, in fact, not only should a model like this one not contain more than one or two input variables, but the statistician should not even consider more than one or two variables as candidates for the model, since otherwise he can cherry-pick the ones that happen to fit the data the best (a related problem known as data dredging).
If you ignore these principles, you may wind up with a model that fits the noise in the data rather than the signal.
 Seems to me there's a relationship with our construction of narratives.  The more detail, the more variables, we can stick in and still have a cohesive story the more satisfying it is. So what Silver says is that stories aren't scientific explanations, they're history.

No comments: