The issue with this model, and some others like it, is what’s known in the statistical business as overfitting. This occurs when the number of variables is large relative to the sample size: in this case, the full version of Mr. Enten’s model contains six variables, but is used to explain only 15 cases (Congressional elections in presidential years since 1952).Seems to me there's a relationship with our construction of narratives. The more detail, the more variables, we can stick in and still have a cohesive story the more satisfying it is. So what Silver says is that stories aren't scientific explanations, they're history.
A general rule of thumb is that you should have no more than one variable for every 10 or 15 cases in your data set. So a model to explain what happened in 15 elections should ideally contain no more than one or two inputs. By a strict interpretation, in fact, not only should a model like this one not contain more than one or two input variables, but the statistician should not even consider more than one or two variables as candidates for the model, since otherwise he can cherry-pick the ones that happen to fit the data the best (a related problem known as data dredging).
If you ignore these principles, you may wind up with a model that fits the noise in the data rather than the signal.
Blogging on bureaucracy, organizations, USDA, agriculture programs, American history, the food movement, and other interests. Often contrarian, usually optimistic, sometimes didactic, occasionally funny, rarely wrong, always a nitpicker.
Thursday, March 24, 2011
What Wisdom Do Statisticians Have
From a Nate Silver post((on a theory predicting the Republicans are almost sure to maintain control of the House in 2012):
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment