21st July 2011 at 12:46 pm #3333
We’re trying to use a number of variables to predict care need. We’ve got quite a long list of variables, and data for each of them, and are now trying to reduce the list of variables, as some seem quite similar. However we don’t have any outcome data (i.e. information on care need) to test our model on. I’d appreciate your comments on whether what we’re thinking of doing sounds reasonable.
We’ve examined the literature for factors that predict care need (including demographic, health, lifestyle, social, social care, socioeconomic factors). We’ve found local data for all of them – a lot of this is not available at an individual level, so we’ve aggregated it up to 43 groups. A lot of these variables seem quite similar (but also slightly different), e.g. 2 different variables for income, prevalence of a number of health conditions, so we want to reduce the number of variables. But we don’t have any data on actual care need so can’t straightforwardly use statistics to see which variables are the better predictors. We do have current service use but we know there are people out there who aren’t using services they need, or are buying services privately, or have lower levels of care need so aren’t eligable for current services, so this is incomplete.
What we were thinking of doing was calculating correlations between all the variables, then looking at the patterns of correlations to help us decide which variables to eliminate. i.e. if two variables seem to cover the same topic (like the two versions of income) and correlate with the same other variables, then we say they are essentially the same & get rid of one of them. We were going to use expert knowledge to decide which to keep – which they thought was more important, or data quality considerations, if one was better.
Does this sound reasonable, or does anyone have any other suggestions or comments?
Thank you!22nd July 2011 at 7:55 am #3339
great thanks22nd July 2011 at 12:53 pm #3338Jeremy MilesParticipant
Principal components analysis is probably better than factor analysis for this.22nd July 2011 at 12:56 pm #3337
Cheers – though I didn’t think you could do PCA or factor analysis without outcome data to predict? I was looking for a way to produce a model to use until we get the outcome data. Then I agree one of these sounds best really.
Thanks for your suggestions so far!25th July 2011 at 8:12 am #3336
Thanks I didn’t know they didn’t need an outcome variable, I’ll look into that. Cheers.28th July 2011 at 4:43 pm #3335Brian PerronParticipant
I think both principal components and factor analysis are strategies to consider, although I would lean more toward principal components analysis since you are looking for variable reduction strategies. Factor analysis, in my opinion, is not a good option. While you can subject just about any set of variables to a factor analysis and obtain at least a one factor solution, you can still have very serious conceptual problems with the factor(s) identified, particularly if the model uses formative indicators. This problem is clearly discussed in the classic paper by Bollen and Lennox (1991).
I strongly encourage you to move away from examining patterns of correlations for specifying your model. In the area in which you are working, everything is associated with everything else to some degree, called the “crud factor” by Paul Meehl. Also, with a large enough sample size, many of the associations will be statistically (but not practically) significant. Starbuck (2006) offers a good rationale with an example for avoiding the correlation drive approach: “Finding significant correlations is absurdly easy in this population of variables, especially when researchers make two-tailed tests with a null hypothesis of no correlation. Choosing two variables utterly at random, a researcher has 2-to-1 odds of finding a significant correlation on the first try, and 24-to-1 odds of finding a significant correction within three tries…” (p. 49).
While it seems that you have done a fairly exhaustive literature search to identify the full range of variables, I think you should focus more on using theory to identify which are the key variables / constructs to include in your model. For the variables in which you have multiple measures, you can then make decisions on which ones most appropriately match the theory and have the best measurement properties.
But, I really think you need to have a theory driven as opposed to data driven approach in this situation. Since you are interested in service utilization, you might consider the widely (and perhaps overly) used behavioral model of service utilization by Andersen, or the network episode model by Pescosolido. Both of them can be easily found on PubMed or Google Scholar.
So, my suggestion in brief: use theory rather than analysis to reduce your list of variables…
Bollen, K.A., & R. Lennox. (1991). Conventional Wisdom on Measurement: A Structural Equation Perspective. Psychological Bulletin, 110, 305-314.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66 (Monograph Supplement 1-Vol. 66), 195-244.
Starbuck, W. H. (2006). The production of knowledge: the challenge of social science research. New York: Oxford.29th July 2011 at 8:13 am #3334
Thanks. Well we did want to be mostly theory driven really, but using the data as a check or second opinion. The literature didn’t really identify the important variables in enough detail to be able to discard variables just from that. At the moment we’ve managed to eliminate some based on data quality considerations (which we can be a bit more sure on). As you say, most things were correlated with most other things! I’ve run a PCA with various numbers of factors, which gave some interesting answers. Some factors were things we could identify as constructs (e.g. there was one that seemed to be safety considerations) but the first two factors (and in different versions) seemed to jiggle them up a bit from the topics that we had in mind, and we didn’t really want to do that unless we could think of a good reason that might lie behind it. Because we didn’t have everything from the same source, we had to add our data up 42 groups so that’s quite a low number really so being a bit cautious about that.
Anyway your suggestion sounds v sensible & confirms the position we were coming to. I’ll check out those references you suggest.
I really have had v useful comments from all you guys, thanks!
- The forum ‘Default Forum’ is closed to new topics and replies.