I think I understand how outliers can skew the data. Nonetheless, I dont think these qualify as outliers. By your description, outliers are "atypical" and "infrequent observations." The success of Mickelson, Singh, and Woods is not atypical or infrequent, but rather typical and frequent. They earned bigger shares of the success pie, and excluding them would only unjustifiably skew the results.
Here is some of what the Engineering Statistics Handbook says about outliers:
Outliers should be investigated carefully. Often they contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. Is there a different slope in the 1980 line vs the 2005 line? And remember that the outliers have a large effect on the slope, but less on the r-squared.
I'll try to supply this later when I get a chance.
What I mean is that if you analyzed $ vs putting for instance, you might get an r-squared of 20% and that would suggest that putting accounts for more of the variation in earnings than does driving distance.
That may well be, but really what I am concerned with is the how the correlation for driving has changed over the years.
Nonetheless, golf monetary success is dependent upon so many variables that I just dont think that 8% is very low in the greater scheme of things: Putting, Chipping, Driving Accuracy, Sand Saves, Recovery Shot, GIR, Experience, Age, Equipment, Bounceback, Health, state of mind, confidence level, Reaching Par 5s, strength of field, starts, purses, etc. Many of these factors likely overlap. For example, Longer drives might mean higher GIR so part of the GIR percentage attributable to driving would have to be subtracted out so the total sum is equal to 1. For example, Longer drives might mean higher GIR so part of the GIR percentage would have to be subtracted out.
The 2% to 8% change you report is not quite a lot, as I tried to point out in the following. It's not statistically significant.
I still do not think you can link the magnitude of the correlation to its statistical significance. If I figured more years and the observation held up, I assume that we could acheive a statistically significant comparison even with relatively low rsq.
Is the r-squared of 0.02 what you got by analyzing the 1980 data or is it hypothetical?
Both.
Did you mean 2005 in the last sentence? I don't have a particular r-squared in mind as being "important". 8%, or 6% if you remove the outliers, is not convincing.
I have trouble accepting the rejection of these numbers without any sort of methodology attached to that rejection. Finding out what would be convincing and why would surely help us define the issue, wouldn't it? How about 10%? 20% 30%? 80%?
The analysis I provided you suggests there is no significance to the difference in the correlations between 1980 and 2005.
I dont think that is what your analysis suggests. Rather, I think your analysis suggests that that the data we have thus far studied is inconclusive about whether the difference between the years is fluky or whether it actually means something. This is a key distinction, I think.
I'll try to provide the least squared lines later.
Also, send me your email via IM and I'll send you the data.