In spite of the title of this post, which might seem frivolous or inappropriate for a blog on learning technology, I mean this to be a serious discussion on the broad issue of measuring learning, an issue that will be with us for some time to come. I’ve written about this before but it occurs to me that I should take up this theme on a regular basis because it is obviously getting prominence elsewhere and given my economics background I likely come at these issues somewhat differently than others, so can offer a critique with my own spin. Then too, I think there is a lot for us to learn from professional sports, baseball in particular but other sports too, regarding the measuring of performance, what is possible as well as what seems to be elusive if not impossible. It is these lessons on measurement that sports can teach us, if we choose to look, that is the focus of this post.
Let me begin with what I take to be an obvious fact, but I want to make sure you agree. Professional baseball measures performance of players to a much greater extent than we ever have or ever are likely to measure student performance in college. Further, the measurement happens in a public setting where indeed the fandom (and the sports reporters and commentators who feed the interest of the fans) spend a good deal of time interpreting the implications of the various statistics that are collected. In other words, there in an abundance of data on player performance. It’s not abundant relative to measures on say, subatomic particle movement, but it sure is abundant relative to the formal measures we have on student performance.
The data themselves are not sufficient
With that as a given, let me move to the next somewhat obvious point – the data themselves don’t give the last word, people disagree about what the data mean. On the McGwire-in-the-Hall issue, this ESPN poll shows that more baseball writers are likely to vote against him than for him at present, but there are a significant number who are for. I don’t believe there is any disagreement about the evidence itself. But there is disagreement about what the evidence implies for addressing the question. Some might say, “Yeah, but that’s because of the steroid business. If it weren’t for that this wouldn’t be such a big deal.” It’s true, the question at hand wouldn’t be so controversial were the steroid issue not a concern – I’ll say more on that below – but it’s not true that in and of itself that would create unanimity on how to answer the question. There are many other baseball issues that have generated substantial disagreement among the experts – for example, who should have been the MVP in the American League this year? – where steroids didn’t matter at all and yet there still is substantial disagreement about the answer to the question, though again no disagreement about the data. (So readers are not confused by what I mean, when I refer to disagreements about the data consider the case last year of the SAT exams that were scored incorrectly. Those errors in scoring created a sense that the data themselves were not to be trusted.)
Further, even when there is unanimity or near unanimity of view about the answer to a question, there is a general recognition that the answer is contingent, not definitive, and subject to change based on additional evidence that might be brought to bear. For example, before going into last night’s national championship game for college football, Ohio State was a near unanimous pick as the number 1 team in the country and was a clear favorite before the game was played But, of course, after being pummeled by Florida, that view was revised and this morning Florida is the number 1, to nobody’s surprise. In other words – that’s why we play the game. The contingent aspect about what we know given performance measures is rarely discussed. (In the area of health, I’m under the impression that we’ve flip flopped both ways on the issue of whether eating eggs is good for you based on the latest set of clinical studies.) It points to the fact that the data itself can never be sufficient and that even well informed views can be revised based on new relevant information.
What’s the model?
In the McGwire case there are at least two different ways of thinking about the data that would rationalize a no vote on admission to the Hall of fame. The first is the “say it ain’t so, Joe” approach, applied to Pete Rose and is the reason he has not been admitted to the Hall of Fame – Pete bet on baseball, the number one taboo. Some might say Pete should get in anyway, having the most hits in a career of any player in history, but everyone understands why he is not in. Economists would call this a case of lexicographic preferences. The first requirement is not betting. If a player can’t get by that first requirement, nothing else matters. There are no tradeoffs in this case.
Some people might treat the steroids issue in this lexicographic way. But, in fairness to McGwire’s case, steroids weren’t banned at the time he played the game and, further, McGwire did acknowledge that he took androstenedione, which was available over the counter, though it was banned in other pro sports. So some baseball writers might take a view different from the lexicographic approach; to wit that while McGwire shouldn’t be banned outright from the Hall of Fame his performance needs to be handicapped on account of his taking steroids and when compared to historical norms set by players of earlier generations who are in the Hall, that handicapping needs to be put in place to make a fair comparison.
I note that baseball writers are not obliged to say what the model is they have in mind when they vote. My point here is that we can’t infer very well from the voting which model it is. The data seems to show that some voters who will vote against admission this year, the first year that McGwire is eligible, nonetheless have left the door open that they might vote the other way in the future. One might conjecture that these voters couldn’t hold the lexicographic view. But even if that is right, it doesn’t say that those who indicated they’d vote against in the future as well do hold the lexicographic view. It could be, instead, that they just think McGwire was an ordinary player who did exceptional things because of the drug enhancement. There is some evidence to support that interpretation in that the poll on voting for Barry Bonds (he is still playing so not a yet a candidate for the Hall) shows greater support for him than for McGwire. How does one explain that?
Lest one think this an exercise relevant only to baseball and not student learning, consider this incomplete list of factors that might affect students performance: prior preparation in the subject in high school or in taking prerequisites in college, coming from a family background where education was not emphasized, not being a native speaker of English, having a learning disability such as dyslexia, being found to have plagiarized earlier in the term, and coming from a foreign country where the norms regarding behavior during an assessment of performance are different than they are here. In my time as a faculty member, I’ve been in settings where many of these factors have been treated sometimes in a lexicographic manner and others times in a handicap way. Conceptually, the comparison with baseball is right on.
And as with baseball, we are mostly not explicit at all about our models of superior performance. On even this simple aspect, which is far from a full model otherwise, we don’t articulate whether we are measuring value added or absolute performance and while for fairness reasons we list the vehicles to be used for assessment purposes – there will be two midterms, a final, and a term paper – and now, increasingly, we may provide rubrics for how written portions of assessments will be evaluated, we still don’t provide relatively simple models that show how performance data gets mapped into judgments about learning. This leads to the next point.
Which variables should we focus on?
When I was kid learning about baseball there were three variables held in high esteem as metrics of offensive performance – Batting Average (BA), RBIs, and Home Runs, and as if by destiny to emphasize the point in two consecutive years (1966 and 1967) the American League had a *triple crown* winner, first Frank Robinson and then Carl Yastrzemski, and there haven’t been any since. But due to the work of Bill James and others, popularized even to a non-baseball audience in the book Moneyball, there are other variables, notably On Base Percentage (OBP) and Slugging Percentage. The conceptual difference in selecting between BA or OBP is how one views walks. If they are in essence pitcher errors, then BA is the correct measure. But if walks are earned by the hitter, either by having tough at bats with lots of pitches fouled off, or as a sign of respect by pitchers due to the strong performance of the batter in prior occasions, then OBP is the better measure. One can readily agree that in some instances it is the one and in other instances it is the other; James’ work can be seen as an argument based on the data that in most cases walks show offensive prowess.
We have a similar type of issue in measuring student performance. You can get many instructors to agree with the proposition that it is comparatively easy to measure the performance of outliers, both the exceptionally good and the really horrible performers, but it is much harder to rank those in between. Consider an exam with 20 short answer questions each either right or wrong with each question worth 4 points if answered correctly and consider an alternative exam with exactly the same questions where 10 of those questions are worth 2 points a piece and the other 10 questions are worth 6 points a piece. It is not hard to come up with examples where student A outscores student B on the first exam, meaning student A got more questions right, but student B outscores student A on the second exam, because student B got more of the 6 point questions right. From an economics perspective, this is an example of the index number problem. What informs the instructor to choose one weighting scheme over the other? Enumerating the criteria for making that determination is precisely what I mean by asking us to specify the model.
What is not being measured?
Baseball is a team sport in contrast with golf, for example, which in most circumstances is an individual sport. There are, of course, a lot of performance measures in golf, such as the number of greens hit in regulation or the number of putts per hole, measures that are subsidiary to the ultimate performance measure – what place did the golfer come in at the conclusion of the tournament? One can build rather straightforward models that won’t predict perfectly but might do not too badly on the prediction front, that map these subsidiary measures into the ultimate performance measure. At least in concept if not in practice, that is straightforward to do.
Measuring performance in team sport conceptually differs from measuring performance in individual sport in that the whole may very well not equal the sum of its parts and so one would like metrics of the contribution to the whole, as well as the metrics of the contribution of the individual. For example, when Roger Maris set the then record for Home Runs with 61 in a season, it was said that his performance benefited from the fact that Mickey Mantle batted behind Roger – the pitchers didn’t want to walk Roger unnecessarily for fear that Mickey would drive Roger in, so Roger saw a disproportionately high number of fastballs. It’s a lot easier to hit and especially to hit for power, if the batter can correctly anticipate that fastballs are coming.
There are anecdotes, such as the above, of this type of team contribution but there is little statistical evidence of this sort that is presented. (One such offensive statistic is batting average with runners in scoring position while one such defensive statistic, rarely discussed, is “chances” in that some chances take hits away from the opponents and hence a player with an exceptionally high number of chances contributes to the performance of other defensive players, an is indirectly measured in the pitcher’s performance.) Most of the statistics that are presented are individualistic in nature. My sense of this is that the teams themselves keep some statistics on contribution to team, but those statistics are not tracked more generally. And for selection into the Hall of Fame, in particular, it seems as if the membership choice is biased toward power hitters, who have impressive individual statistics, and away from superior defensive players who may very well “provide the glue” that makes the team stick together as a unit and function well.
We in higher education are increasingly being told that students need to function well in teams; this is a skill valued highly in the marketplace. I don’t doubt that. But I do doubt whether we can measure how well students function in teams and, since we do measure student performance by the projects students produce, we too likely confound individual and team performance when we identify their jointly produced good work. Even when we measure the team product as a whole and give all team members the same grade for that work, we don’t measure how one member of the team affects the performance of other team members. I know that some folks allow team members to rate each other on performance, a subjective approach to get around this problem, but I fear these subjective evaluations measure effort reasonably well yet leadership not well at all.
It is surprising that baseball doesn’t do more to measure the individual contribution to the team, but given that it is important to note the added consequence that such measures as commonly do exist, with their emphasis on individual performance, encourage the players themselves to focus that way to raise their own marketability and thereby to enhance their own future compensation. In the corporate setting, the practice of giving out stock options to employees is an attempt to address this incentive issue. We should think about the incentive issue more in the higher education context.
It’s time to close but before I do let me give the reason for me to choose to look at professional sport and baseball in particular regarding performance measure rather than looking directly at student learning in the context of higher education. The measurement problems are complex but given the public nature of the discussion now, pointing out the measurement problems directly, especially to those critics of higher education who believe we’ve not been held to account, creates the risk of them feeling we are running away from the problem rather than addressing it. There are things we can do to help in terms of accountability, but that really won’t address the measuring learning issues. That should be the take away message from this piece.
On accountability, for example, we could publish the final course grade distributions of all courses with enrollment of at least 20, with an eye toward letting the world (on campus and off) better understand how to interpret what a letter grade signifies from a rank ordering of student performance perspective, and perhaps to provide incentive to reduce grade inflation and thereby make the grading more informative in the sense of Blackwell, with the minimum enrollment number of 20 added in to indicate that we are still sensitive to student privacy issues and don’t want to give out their individual performance information without the students themselves authorizing such distribution. There is every reason for us to make our processes more transparent to all and taking this step would be a move toward transparency.
But addressing transparency in this way doesn’t make it any easier to measure student learning. And if Major League Baseball hasn’t fully solved the performance measure issues, how can anyone reasonably expect that we can?