13th March 2013 at 5:46 pm #1653Kate MartinMember
I want to run a stepwise multiple regression analysis for some anthropometric variables, but my data is not normally distributed. I have managed to use square root transformation of my positively skewed data, and also for my negatively skewed data by “reversing” it first i.e. for -ve skew: SQRT(Highest value – Height_cm + 1).
I think this is all ok so far (please say if I’m wrong). But now I have 3 problems:
1) Andy Field’s SPSS book (2nd ed.) pg 80, says I need to now “reverse the data back” before I run it in the multiple regression analysis – how do I do that? Is it just the Highest value after transformation – the variable values i.e. XHighest – Xi tranformed ? If this is the case, how do it do it? (I did it last time in the transformation box as part of the transformation).
2) I have also been told that once I’ve run the transformed data as a multiple regression, I need to somehow “transform it back”. What does this mean? Is this right? Do I need to interpret the multiple regression output in a special way for transformed data?
3) I may have a problem with collinearity as the correlations between all my variables are very strong. How can I tell if it is a problem, and if it is, what does it mean for interpreting the multiple regression output? Is there some way of correcting/accounting for this?
Thanks very much.6th April 2013 at 6:39 pm #1654Jean Marion RussellMember
Firstly which variables are not normally distributed?
The assumption in Regression is only that the dependent variable has a random normally distributed component. You explanatory variables are held to have no random element at all. Therefore there is no reason to worry about them being normally distributed.
What you do have to worry about is whether certain points are unduly influential. If I recall correctly the best way to estimate a line between two points is to have the data weighted heavily towards the end points of the range and about equal in both. So sampling should ideally look like an inverted normal distribution with bounds (the reason for this is the end points are totally reliant on the points on one side of them while the middle relies on points on both sides). What you really do not want is a small number of points a long way from the rest of your data as these have disproportionate influence on the estimates.
However you are not in any position to assess the normality of the data at present. If one of your explanatory variables has a few very high or low values then it could be causing the skewness.
Secondly the aim at the end is to present the results in terms of the original data scales. So you write your equation
y=(Highest valuet-L-1)^2 where L is the linear predictive combination. It will not be a nice formula.
3) there are a variety of ways of telling for collinearity, the simplest is to look for high correlations between explanatory variables, but as this does not allow for complex patterns it is better to look at the Variance Inflation Factor and other collinearity diagnostics. If Andy Field does not cover these then Tabachnick and Fidell do (it does not manage which version as this is in right back to the second edition, just take the one in the library). If you are using a selection technique for deciding on the model I would suggest that you also ran a principle component analysis. If the first few components have more than one term loading on them there is a good chance that you have collinearity in your data set and are not able to distinguish between the effects of these terms.
- You must be logged in to reply to this topic.