Thanks, Kevin, I hadn't seen that. Yes, it appears Jim C did his best to apply the method to a survey that wasn't head to head and correctly noted that approach doesn't solve transitive violations. The survey instrument obviously matters a great deal; it influences and even determines outcomes, so an actual head to head would be best.
In the end, everything depends on what we're measuring. No one other than Ron Whitten should be under any illusions that we're doing anything other than measuring the subjective preferences of a defined population. As with the notion of the "ideal woman," tastes are subjective and vary, often greatly, over time.
I'd like to see the statistical experts like Anthony Fowler, Jim Colton, JC Cummings, et al discuss the merits / problems of the following two exercises.
I. Scaled comparison course ranking:
1. Head to head comparisons.
2. Dump Doak Scale (I don't see how is it remotely possible to apply in a scaled comparison exercise) and 1-to-10 types of scoring systems and force reviewer to allocate 10 rounds between the two, instructing the reviewer to neglect distance, money, and access as considerations. (We could even make it head-to-head-to-head or head-to-head-to-head-to-head.) The outcome of this step is a measure or weight for each course as given by the individual rater.
3. Collate raters' course scores into a weight for each course.
4. Do NOT just list the courses for a ranking! Instead, graphically show the relation of the weights to each other. I think this would show that only a handful of courses matter. The rest is just noise / random walking.
This approach is criteria free, which is fine because in my heart of hearts I believe that people, unless given the narrowest, least-interesting, totally straitjacketing criteria, deep down inside just pick things they like for whatever reason. And if you ask them the reason they may not know or they may be lying to us and / or to themselves. They / we probably don't even realize it.
So...
II. Ranking criteria -- not sure I have the order right but here goes:
1. Prior to the scaled comparison exercise above, have each reviewer assign 100 pennies total across a list of architectural and non-architectural attributes.
2. Time gap / break.
3. Reviewer completes scaled comparison course-ranking exercise.
4. For a random selection of courses, have reviewer tick boxes next to the list of attributes he feels the course possesses.
5. Derive weightings of attributes using reviewer's scaled-comparison scores.
6. Normalize attribute weightings across all reviewers.
7. Compare to a priori 100 pennies exercise.
This would allow us to infer (fancy guess) the criteria raters use. It would allow us to explicitly weed out non-architectural criteria and then go back and reweight reviewers' original course rankings. And it would allow us to see how closely reviewers' stated criteria (like a magazine's rating criteria) match what they actually value in course attributes.
Mark