Predicting The Best Picture Winner

With the Oscars being less than 36 hours away, it seems that practically every fan of cinema is making their predictions for the winners of each category. No longer being an undergraduate, I was experiencing a little Oscar withdrawal last night after not being able to take part in the annual Oscar debate with The Tiger.

The one thing I never understood when reading Oscarwatch,, and other sites, is how people place so much weight on pre-Oscar awards (I'm talking SAG, Director's Guild, Golden Globes, etc) but no one has ever taken the time to do some basic data analysis to really see which awards are the best predictors of Oscar glory.

So, I decided to take a little time and see if I might uncover anything interesting. You'll be surprised, to say the least.

Using the five past years of Oscar data, I used regression analysis to see if I might uncover a model to predict which movie might take home the Best Picture Oscar this year. (Please, please let The Departed win!)

There are a number of predictors I thought would be especially valid for predicting Oscar success.

Looking at every Oscar Best Picture nominee of the past five years, I collected data on the following categories:

The movie's pre-nomination box-office gross
Is the film's director also nominated for Best Director?
How many Oscar nominations does the film have among the best acting and screenplay categories?
Did the film win a Golden Globe for Best Comedy or Drama?
Did the film win the Screen Actors Guild, Directors Guild, Producers Guild or American Cinema Editor's award?

I collected this data for Oscars from 2002 to 2006. I then ran a regression analysis to see how each variable predicts Oscar success and if the variable is even a good predictor.

Linear regression Number of obs = 25
F( 9, 15) = 50.97
Prob > F = 0.0000
R-squared = 0.8259
Root MSE = .21547

| Robust
wonbestpic~e | Coef. Std. Err. t P>|t|
boxoffice | .0000422 .0004562 0.09 0.927
bestdirector| -.0955407 .0630659 -1.51 0.151
otherOscar noms | .0055256 .0269864 0.20 0.841
wonGGdrama | .0379726 .0940788 0.40 0.692
wonGGcomedy | .0054738 .1385837 0.04 0.969
wonSAG | .2675211 .1854943 1.44 0.170
wonDG | .8988662 .1319837 6.81 0.000 .
wonPGA | -.5649189 .2538095 -2.23 0.042
won ACE | .4189429 .2361956 1.77 0.096
_cons | .0135011 .1108538 0.12 0.905

So what does this tell us? First off, this model predicts Best Picture winners with 83 percent accuracy. Secondly and surprisingly, many of the predictors that we all analyze so relentlessly mean very little for the outcome of the Best Picture Oscar. (P>|t| tells us how strong a predictor the variable is. Because I'm only using five years of data, I'm ruling anything greater than .2 is not a strong predictor. a P>|t| value of .1, for instance, would indicate that the variable will be inaccurate 10 percent of the time.)

Box office success, other Oscar nominations, and the Golden Globes mean practically nothing when we control for other pre-Oscar awards.

Also, you should know that Robust Coef. represents if the variable is positively or negatively correlated to winning a Best Picture Oscar, and how strong that relationship is. Larger numbers mean the variable is a better predictor. (If all the Robust Coefficients for a film add to 1.525, it essentially means a 100% chance of winning the Oscar.)

Knowing this, notice anything else unusual? Notice that negative sign next to the variable representing a PGA win? That's right, this analysis say that winning the Producer's Guild is NEGATIVELY correlated with winning the Best Picture Oscar. But how can that be right?

Just to take another look at this data, let's drop these variables and only use the SAG, DG, PGA and ACE to predict a Best Picture Oscar.

Linear regression Number of obs = 25

F( 4, 20) = 46.88

Prob > F = 0.0000

R-squared = 0.8186

Root MSE = .19048


| Robust

Won Best Picture | Coef. Std. Err. t P>|t|


wonSAG | .2477876 .1471156 1.68 0.108

wonDG | .8849558 .1086208 8.15 0.000

wonPGA | -.5132743 .2193453 -2.34 0.030

won ACE| .3982301 .2138609 1.86 0.077

_cons | -.0353982 .0241972 -1.46 0.159


Even when dropping these variables, the model loses almost no predictive power. This regression also confirms the predictive power of the SAG, DG, PGA, and ACE.

That's right, the PGA continues to look like a huge kiss of death for any hopefuly Oscar winner. But wait, there's more. The DGA variable has a positive sign and a coefficient of .88 - that's practically a guarantee that the winning film of the DGA will win the Best Picture Oscar.

So, our model to predict best picture success is now:

Best Picture Winner Odds = SAG X .247 + DG X .88 + PGA X -.51 + ACE X .398 + -.035.
When a film wins the SAG, DG, PGA or ACE, substitute 1 for that value. If the film does not win that award, substitute 0.

For example, when applying this model to Crash, a film that won the SAG and ACE, it would looks like this.

Best Picture Winner Odds = 1 X .247 + 0 x .88 + 0 x -.51 + 1 x .398 + -.035. This leads to:

Best Picture Winner Odds = .61. This is the highest point total achieved by a film nominated for Best Picture that year, so the model predicted Crash as the favorite in 2006.

So how does this model stack up to results from the past five years? Let's take a look. (Again, a 1 represents a win, and a 0 represents not winning that award.)

Movie Won SAG
Won DG Won PGA
Ace Points
Crash 1
0 0
1 0.61
Brokeback Mountain 0
1 1
0 0.3369
Good Night and Good Luck 0
0 0
0 -0.035
Capote 0
0 0
0 -0.035
Munich 0
0 0
0 -0.035
Million Dollar Baby 0
1 0
0 0.8499
The Aviator 0
0 1
1 -0.15
Finding Neverland 0
0 0
0 -0.035
Ray 0
0 0
1 0.363
Sideways 1
0 0
0 0.212
Return of the King 1
1 1
1 0.9819
Lost in Translation 0
0 0
0 -0.035
Master and Commander 0
0 0
0 -0.035
Mystic River 0
0 0
0 -0.035
Seabiscuit 0
0 0
0 -0.035
Chicago 1
1 1
1 0.9819
Gangs of New York 0
0 0
1 0.363
The Hours 0
0 0
0 -0.035
The Two Towers 0
0 0
0 -0.035
The Pianist 0
0 0
0 -0.035
A Beautiful Mind 0
1 0
0 0.8499
Gosford Park 1
0 0
0 0.212
In the Bedroom 0
0 0
0 -0.035
Fellowship of the Ring 0
0 0
0 -0.035
Moulin Rouge 0
0 1
1 -0.15

Looking at the predictions for each film, the model correctly predicts the winner of the Best Picture Oscar for every year, 2002-2006. It even predicted Crash over Brokeback Mountain, and it had Million Dollar Baby as the easy winner over The Aviator. This certainly provides more evidence that the PGA is a negative indicator of Oscar success, and the DG is a strong indicator.

Now, let's apply this data to Sunday's nominees.

Movie Won SAG Won DG Won PGA Ace Points
Babel 0 0 0 1 0.363
Little Miss Sunshine 1 0 1 0 -0.301
Letters from Iwo Jima 0 0 0 0 -0.035
The Departed 0 1 0 1 1.2479
The Queen 0 0 0 0 -0.035

Little Miss Sunshine takes a huge hit from winning the PGA, and The Departed gets a huge boost from winning the DG. Babel is still in the mix, although trailing The Departed. The Queen and Letters from Iwo Jima have no chance. (The Departed's score is greater than 1 due to it's ACE tie with Babel. It's an unusual situation.)

Does Little Miss Sunshine really have the least chance of winning out of all of the nominees? There are certainly some problems with this model. There's a 12 percent chance of homoscedasticity, which can lead to problems. There's also the problem of only using data from the last five years, but I'm not terribly concerned about this as I was looking for the latest Oscar trends. But based on the past five years of Oscars, it certainly looks like it's a two horse race between The Departed and Babel.

Hope you've enjoyed this alternate analysis of the Oscar race. If others ask for it, I might perform a similar analysis for the directing and acting categories before Sunday night. Please leave your comments and suggestions.

Hope everyone enjoys the awards, and here's to The Departed taking home the gold!

Addendum: changed the terminology from "percentage" to "points." Sorry for the confusion, and thanks to Tim for the pointer.

Addendum 2: I went back to the 1998 Oscar (Titanic) and ran regressions on that data. Things became a lot more convoluted, and the model's predictive power dropped down to about 50%. Still pretty good, but not great. With the new data, the PGA award basically became useless, as it doesn't explain Oscar sucess at all when using data from 1998 to 2006. Box office revenues did become significant and with positive explanatory power, although not that significant. Every $15 million a movie makes gives it a small, small boost in its chances. Nominations in other acting and screenwriting categories did become significant and useful. For every screenwriting or acting nomination a film gets, its about 7 percent more likely to win. I can post the full results tomorrow, if anyone desires.

What I take away from these results is that in the long-run, its much more difficult to predict Oscar success. However, I still think we can spot trends in human behavior, and the original model spots the trends over the last five years pretty well, so I'm going to stick with it. We'll see what happens tomorrow night.

Anonymous Anonymous said...

Hi. Thanks for doing this. Can you put together a similar analysis for all oscar winners going back to 1990 - I think that was the first year of the PGA. Thanks.

3:50 PM  
Blogger aurix said...

I did some math, going back three years, and it turned out the formula was right only once. It didn't predict a win for Gladiator in 2001 and Shakespeare in Love in 1999. It was right with American Beauty, though. But with American Beauty which won SAG, DG and PGA, that year was quite predictable. The years of Gladiator and Shakespeare in Love had tight races. According to the formula, Crouching Tiger Hidden Dragon (DG award) would've won over Gladiator (PGA + ACE). And Saving Private Ryan (DG+PGA+ACE) would have won over Shakespeare in Love (just SAG)

5:34 PM  
Blogger OKonheim said...

But you only used 5 years of data (2002-2006)?

5:41 PM  
Anonymous Anonymous said...

I love this, but what about the fact that Little Miss Sunshine won the ACE for Comedy/Musical? (Granted, it's not up for the editing Oscar.)

5:44 PM  
Anonymous Anonymous said...

Oh, just kidding--I forgot that went to Dreamgirls!

5:55 PM  
Blogger NATHANIEL R said...

wow. i know NOTHING about math and didn't understand most of this... but i still don't understand how Crash could be predicted by any formula since it had lower box office than any other winner in many years and it didn't win the DGA which this formula seems to favor.

so, no, still don't get it.

8:41 PM  
Blogger Adam said...


I'm in the process of collecting more data to go back to 1990. We'll see what happens then.


I did that analysis as well, and the model certainly doesn't hold up as well going back past 2002. But the model also doesn't apply past 2002, because I didn't use that data. The question is basically, should we include data going back to the 1990s, and look at Oscar trends for the past 15 years? Because right now the data examines the trend over the past 5 years, and it does a pretty decent job.

I think I'm going to go back and look at the 1990s data, but I don't think those results will predict any better for Sunday's results than what I have now, but they might predict better over the next 15 years. Does that make sense?

8:51 PM  
Blogger Adam said...


In regards to the box office of Crash, I tested whether box office results matter in predicting the Best Picture, and it turns out from 2002-2006 that box office had no explanatory power at all in regards to winners. Basically, money didn't matter. Although, Lord of the Rings could be skewing this, but the results are pretty strong.

Also, Crash won the SAG and ACE, which have very strong predictive power for Oscar winners. Brokeback won the dreaded PGA, which basically killed it, even though it won the Director's Guild.

Why does winning the PGA predict that a film won't win best picture. Maybe producers have different tastes than actors, directors and editors, and producers tastes simply are not in line with the Academy's tastes.

Anyone have any other explanations/theories?

8:59 PM  
Anonymous Anonymous said...

You might want to check this:

11:45 PM  
Anonymous Pirolmaster said...

Hey, first of all I was very pleased to read this, because I'm very into these statistics things due to my studies. So it was fun, too.
I see you took the STATA program doing an simple OLS regression?!

So one question: How did you implement this in STATA?? Are there any specialties doing this? Or just using 1's and 0's for winners/losers the last 5 years ago?!

At least, I don't expect a wide-spread answer about handling a dataset, just a very short hint ;-)

But very nice done! And let's see if the power of statistics (and of my heart, too) could bring the win for Departed ;-)

5:20 AM  
Blogger Adam said...


I did just run a simple linear regression with STATA. Nothing too fancy at all.

In regards to using the data - the 1s and 0s are known as "dummy variables." A 1 is used to represent a yes, and 0 to represent a no, or in this case 1 equals a win and 0 equals a loss.

There isn't really a technical answer as to why I only used five years worth of data to begin with. I basically just wanted to look at the latest trends.

As I add more years of data, the model begins to reflect historical Oscar trends and not recent Oscar trends. There's a trade-off with the data between having a more stable model (using manys years of data), but at the same time one incorporates more years of data and then loses the ability to only concentrate on the latest years.

It's a tradeoff I think I'm willing to make.

7:47 AM  
Blogger Duncan said...

Why not these this for fun.

1) Do 20% resampling.
2) Do a leave one out analysis.
3) Try your own evaluation model as follows: Go back ten years and use every other year's data to build you equation. Then use the other sequence for an evaluation data set. So use 97,99,01,03,05 for building and 98,00,02,04,06 for evaluating, or the opposite.
4) Use more than 5 years. Do you think concept drift is that strong?

These are interesting results (thanks) and those at
are as well (Babel predicted). I'd like to see someone go back to 1928. (Or at least as long as some of the other indicators existed!)

8:53 AM  
Anonymous Anonymous said...

R-squared is not predictive power, that's just fit to a line. If you put in 100 meaningless variables you would get a higher R-squared than with 2 real ones.

What is the standard error? That would show true power of prediction.

10:13 AM  
Blogger Adam said...

The standard error for each of the variables is listed in the last column, also known as the P>|t| stat. The SAG variable comes in right at the borderline acceptable significance of .1, but I decided to keep it. All other variables are statistically acceptable.

The regression has an P>F stat of 0.0, which looks great on my end. I don't see any reason to doubt the model's validity. But again, I don't have a PhD in economics or statistics.

10:41 AM  
Anonymous Anonymous said...

I don't think you can use linear regression here; instead, you need a logit or probit model because the dependent variable (Best Picture Win) is dichotomous - i.e., 0 or 1. You won't get p-values or R-squares; you'll get Wald statistics and psuedo R-squares instead. It's a far better model to use.

10:44 AM  
Blogger Adam said...


Interesting take on what kind of model to use. I'm not terribly familiar with logit models. But I'll consult a friend that has experience in this area and see what we can do.

10:54 AM  
Anonymous Anonymous said...

You tried this last year, Adam_S, at the Home Theater Forum and all but guaranteed a win for The Aviator. You thought Million Dollar Baby had no chance.

I see you've tweaked the formula so that Million $ Baby now "easily" beats The Aviator, but just as we all tried to telll you then, human beings don't behave in statistically predictable modes -- if they did, you would never see states swiching back and forth between Republicans and Democrat politicians. Emotion plays as big a factor as logic. This is why your analysis is wrong, and your attempt to reduce the Academy awards to a mathematical formula a waste of your time.

1:13 PM  
Blogger Adam said...

I'm afraid you're mistaken, as I have no clue what the Home Theater Forum is, let alone have I ever appeared at it. I'm not sure who you think I am but I'm sorry to tell you that I didn't even know how to run a regression in February of 2005.

Also, I could easily refute every single proposition you make with so much evidence, but I won't take the time. Please educate yourself before making such claims as "human beings don't behave in statistically predicatable modes."

I'm sorry you didn't enjoy this little experiment - I was merely trying to take another angle at predicting the awards. Who knows if the trends will hold for tonight, but I thought it would be fun to play with the data. Hope you enjoy the show.

1:41 PM  
Anonymous Anonymous said...

Why don't you put up or shut up and run your little formula going back as far as you can? Or maybe you're afraid to.

3:23 PM  
Blogger Adam said...


When I get some time I'll be glad to run the regression all the way back to 1990 (when the PGA started, as I'm told). Also, please read the second addendum and you'll see the results from including data back to 1998.

Like I've said before, the results will be different from what I have now, but it won't necessarily refute the present model. The new model will tell us about success in indicators from 1990 - 2006. The current model tells us about variables from 2002-2006. It's hard to tell which model will be more relevant for tonight - maybe Academy voters are the same as they've been since 1990, or maybe they've changed their voting behavior.


4:46 PM  
Anonymous John C. King said...

Using a limited data set to the last five or so years may give a more accurate prediction. "The thinking" of the academy changes over time.

What would be interesting is coming up with predictors as to why the model doesn't predict "Shakespeare in Love" and "American Beauty". Kinda ad hoc I know but could lead to something interesting.

3:44 PM  
