Dear stats. boffins
As I understand it, statistical significance tests (and others forms of statistical inference such as confidence intervals) assume, and thus require, random sampling or at least randomisation of respondents across experimental conditions. However, in my discipline, Marketing Management, we typically conduct large sample (n > 200), non-experimental survey studies based on non-random, non-probability samples. The reason for this unfortunate state of affairs is simple: it is prohibitively expensive for academic researchers to obtain true probability samples (e.g., simple random or stratified samples) of consumers. Most often, our non-probability samples are not even vaguely representative of any specific larger population (e.g., we would select a non-probability sample of undergraduate students at a single university and then pretend that we can generalise from this sample to all undergraduate students in South Africa). We then merrily go ahead to conduct statistical significance tests on our non-random data, while ignoring the fact that statistical significance tests require random samples. This practice is not unique to the Marketing discipline, but is a reality in many other management disciplines. My questions are: What, if anything, can one legitimately conclude from quantitative studies based on non-probability samples? Can significance tests (e.g., t-tests, ANOVAs, regression analyses) legitimately be applied to data from non-random samples? Aside from descriptive statistics, which other forms of statistical analyses (e.g., effect sizes, bootstrapping) can legitimately be applied to quantitative data collected from non-random samples?
I would appreciate your comments on these questions and perhaps also suggestions of sources that discuss these issues in more detail.
Actually, in my opinion, it is a very important question for social sciences, in general. Very often our population (universe) can not be reached, because scarcity of budget or other resources for researching, or perhaps because the proper features of the population (i.e., ilegal inmigrants, summer holiday makers in rented flats-houses...).
The key for your questions is "legitimately", as you underline. But I would like to add another dimension. What is the alternative for unknown populations (in terms of quantity, i.e., how many people have we to consider, how are they distributed on the territory...), for elusive population? In my opinion, we can not forget the problems (the object of study), but we have to design a strategy for get the more accuracy as possible by other means.
Specifically, I can give you an example of this strategy, which I published some time ago.
it's a little bit of a tricky question this one, because a few things can be done to help the situation a little bit, but there is no panacea here.
the key is to recognise that a non-probability samples induce another layer of uncertainty, which is selection bias. in anything statistics-related if you have uncertainty associated to an estimate you have to model it the best you can. to work on your example about about sampling undergraduates from a single university in South Africa and then generalising to all South African universities... what kind of information do you know about that particular university and the other ones that you can include in your model? you can try more modern techniques like propensity score matching. if know something about the probability distribution of a certain statistic that pertains to all the universities in south africa and have data on the distribution of your one particular university, you can use Importance Sampling or some other form of Markov Chain Monte Carlo method to get an idea of the how much are your statistics biased contingent on the selection that you made. if you have any extra covariates you can always try and add some other controls. Rosenbaum & Rubin have worked on a series of methods to handle selection bias or you can always try the faithful Heckman method.
now the if you come to me and say you simply sampled undergraduates and have nothing else then there is very little i (or any statistician) can do to help and you are most correct: most of the analyses you'll make will be quite meaningless. but this is an issue of design and not statistics per se... paraphrasing the immortal John Willet: "We cannot fix through analysis what we bungled through design”.
why is it that people prefer to turn their heads the other way and keep on doing their ANOVAs and t-tests? simple: it because it's easier than doing the right thing (plus some bizarre misconception of the Law of Large Numbers, but that's another story).