[Cc J.]
D.,
This article on something called "colliders" and selection bias in statistics has fascinating implications. I think the following two paragraphs give the flavor (emphasis added).
Conditioning on a collider can occur any time that there is an underlying selection regime that involves either variables in the dataset or correlates of variables in the dataset. This is almost inevitable if you have built a composite dataset out of multiple constituent datasets. That is, a case appears in the sample if it meets one or more sampling criteria. This is actually a fairly common sample design, usually premised on the idea of not wanting to "miss anything" and/or wanting to increase the sample size.
Once you start looking for it you see it in a lot of studies. For instance, suppose a researcher were interested in which firms had donated to a particular PAC. The researcher might start with a basic sample like the Fortune 500 but then notice only 5 firms had donated to the PAC. Because statistical power in analysis of a binary variable is a function of both the number of cases (higher is better) and the proportion (close to .5 is better), the analysis would have minimal statistical power. The researcher might then add to the data all firms that donated to the PAC, regardless of whether or not they were in the 500. If the researcher were then to do a logistic regression of donating to the PAC as a function of annual revenues the results would almost inevitably be a strong negative effect. The reason is that inclusion in the sample is defined by high revenues (which is the inclusion criteria for the Fortune 500) OR donating to the PAC. There are firms with low revenues that didn't donate to the PAC, lots of them in fact, but they don't appear in the dataset.
This is definitely something I will keep in mind when reading analyses.
-M.
--
Hahahahaaaa!!! That is ME laughing at YOU, cruel world.
-Jordan Rixon
I could not love thee, dear, so much,
Loved I not Honour more.
No comments:
Post a Comment