01 février 2012

Critical Fallacy #13 : Inconsistent Standards


Old system (until 2005) detailing individual judges notes (1997 European Figure Skating Championships; Paris) :
Meaning of the marks : 0 = not skated ; 1 = very poor ; 2 = poor ; 3 = mediocre ; 4 = good ; 5 = very good ; 6 = perfect and faultless
Average mark (mean) for Technical Merit : 5.73 | Standard Deviation for Technical Merit : 0.1
Average mark (mean) for Presentation : 5.83 | Standard Deviation for Presentation : 0.07 
New International Judging System for Figure Skating :
Judging in figure skating is inherently subjective. Although there may be general consensus that one skater "looks better" than another, it is difficult to get agreement on what it is that causes one skater to be marked as 5.5 and another to be 5.75 for a particular program component. As judges, coaches, and skaters get more experience with the new system, more consensus may emerge. However, for the 2006 Olympics there were cases of 1 to 1.5 points differences in component marks from different judges. This range of difference implies that "observer bias" determines about 20% of the mark given by a judge. Averaging over many judges reduces the effect of this bias in the final score, but there will remain about a 2% spread in the average artistic marks from the randomly selected subsets of judges. (Wikipedia
A Standard Deviation of 0.1 is what Figure Skating calls "subjective" or "observer bias"!

* * *


The year-end poll (LA Weekly-Village Voice) works quite similarly to the old figure skating system. Except not really...
  • unlike everyone else, 47 reviewers out of 95 don't believe The Tree of Life is Top10 material (or didn't watch it)
  • 54/95 don't believe that A Separation is Top10 material (or didn't watch it)
  • 61/95 don't believe that Melancholia and Certified Copy are Top10 material (or didn't watch them)
  • 64/95 don't believe that Uncle Boonmee is Top10 material (or didn't watch it)
  • 69/95 don't believe that Mysteries of Lisbon and Margaret are Top10 material (or didn't watch them)
  • 70/95 don't believe that Drive is Top10 material (or didn't watch it)
  • 71/95 don't believe that Take Shelter is Top10 material (or didn't watch it)
  • 74/95 don't believe that Meek's Cutoff is Top10 material (or didn't watch it), yet they all end up in the final "Top10"
Films don't need to qualify to enter the "competition" (contrary to Cannes), EVERY film distributed in the USA is eligible (which is about 560), left to the voters's discretion (though if reviewers were to rank a predetermined set of 10 titles... the result would be just as ridiculous). There are not 10 certified judges, but 95 random people who happen to get paid for their whimsical opinions in the press (including the infamous Dan Kois! lol). Voters didn't even see ALL nominated titles that others are ranking as the best of the year. (See Presumptuous Best Film in the World)

#1 The Tree of Life : 48 mentions / 307 points
Average score (mean) [48 ballots, including 5 unranked ballots] : 6.4pts out of 10 | Standard Deviation : 2.6
Average score (mean) [95 ballots, including the 47 ballots ignoring that film] : 3.2pts out of 10 | Standard Deviation  : 3.7

Which means that the top ranking contestant of the year was only scored by 48 out of the 95 judges; therefore half of the judges didn't even deem it worthy of a 10th place on their top10. Which is ludicrous if we're considering the election of the cream of the crop of a given year.
The maximum score would be 950 for a film ranking 1st on every ballot. The winner gets less than a third of the best possible score! The 10th film (Take Shelter) gets less than a 1/7th of the perfect score. That's how wide the discrepancy runs within just 10 ranks. Basically it means that there is no consensus at all (not even a 51% majority!!!) on which film (or which 10 top films) should be elevated above the rest. Because reviewers don't know what they are judging, and only defend their arbitrary, whimsical, random, uneducated taste. This is not judging. This is not evaluating. This has nothing to do with a "consensual" poll.

The standard deviation is huge! In figure skating it was 1 or 2% of the total mark, here it is between 40 and 115% of the total mark!!! Meaning that movie reviewers can't agree on the estimation of the one "best film of the year". To me, it spells total absence of a common value standard, therefore invalidates the very purpose of tabulating a CONSENSUAL poll in the first place... If voters don't use a common standard to pick contenders, adding up the votes will NEVER produce a significant result !!! Just deal with it, embrace your subjectivity, and stick to publishing individual ballots then.
If half the voters pick their favourite apple and the other half picks their favourite orange... the winner will be either the popular apple or the popular orange, but either way  half the voters will count for nothing in the final result. This is not a consensus. It looks more like a polarized presidential election than a convergent critical evaluation.

The cynics will say that rating figure skating is easier, that cinema is an art and not a sport, thus even more sensible to "subjective evaluation". Although Olympics judging is harder to appreciate than to look at the finish line screen shot in horse racing, precisely because all contestants are very close in technical skills (figure skating qualifications have very high standards contrary to populist Movie competitions). 

To be honest, a Top10 (scoring from 1pt for 10th place to 10pts for 1st place) is not equivalent to the figure skating system, which starts at zero for failure. A 10th spot on a Movie Top10 is not failure, but it could get the film ranked as low as 154th on the final list; which is the lowest in the top third of ALL FILMS OF THE YEAR (basically one guy thought that film was Top10 material and it ends up barely better than 66% of all the 560-some films of 2011). We're not really dealing with the cream of the crop here, at least not according to the evaluation standards of this panel... Anyway, the difference with figure skating is that the final score fluctuates between 132/950 (10th rank) and 307/950 (1st rank); instead of between 5.6/6.0 and 5.9/6.0 for skaters (for a single skater in the example above, but is representative of the range of marks in the final stage of the competition).

J. Hoberman (who didn't vote for The Tree of Life) sees this poor result as a validation of his own taste. Although all the other films (much less controversial) received an even weaker consensus! Comforted by the fact he didn't want Malick to win, the article he wrote to comment the result of a poll his newspaper organized is an apologetic message to his readers for such a "high-brow" film topping their poll... Or pander to his broadest readership should I say. 

My point is that competent sport judges roughly agree on which contestant is great and which one isn't podium-worthy. While in film culture, people who posture as "judges certified to grant year's-best ranking" may find the same film either a masterpiece or a total failure. This is not what should happen when evaluating art. Respected art critics agree more or less on the best artwork, and nitpick with as much subtlety as figure skating judges (minimal standard deviation). They might argue about the respective ranking within a top10, which top10 material to rank above another top10 material.

I don't think anyone should agree 100% on which single film is THE best of the year, or THE best ever... because picking only one over other comparably worthy contenders is too difficult (be it The Tree of Life or Citizen Kane...). We don't ask critics such an exactitude. But when limited to the few dozens praise-worthy films distributed in a single year, electing the 10 best ones must be roughly consensual around 15 or 20 contenders at least (not 160 possible candidates depending on the voter's point of view). Is cinema evaluation just as random as picking 10 titles from a hat? It would seem so when we look at these results...
Regularly we can read editorials saying that there aren't ENOUGH "good movies" each year to make a satisfying line up in Cannes or any other major festival... yet at the end of the year there isn't even 50% of the voters to agree on which ones are these few award-worthy titles! WTF?

And I'm not even factoring in the lackluster line up of the American Distribution schedule, or the titles elected in this final top10 (that are not necessarily my own idea of a "Best of the Year")... 

Related :

1 commentaire:

HarryTuttle a dit…

"In this episode, David Chen, Devindra Hardawar, and Adam Quigley chat with Joanna Robinson about what the difference is between their “favorite” and the “best” films of the year, analyze the hollow phenomenon of Chuck Norris, and discuss the relative quality of Superman Returns."
The Favorite and The Best (29 Jan 2012; The /Filmcast; After Dark; Ep. 173) [MP3] 58'