The Brownlow Downlow.

Predicting the Most Valuable Player of the Australian Football League.

Introduction

In the Australian Football League, the Brownlow Medal is awarded to the most valuable player for the year. At the end of each game, the field umpires award "votes" to the players they regard as the top players for that game: 3 votes to the best player, 2 votes to the 2nd best player and 1 vote to the 3rd best player. At the end of the regular season, these votes are revealed and the Brownlow Medal is awarded to the player with the most votes.

Results in brief

Ordinal regression — Predicted vote densities for 2015 test data.

We've established a rock solid baseline using a balanced bagging ensemble approach with a random forest as our base model. Trained and validated on 2002-2014, we correctly predicted 5 of the top 7 in 2015 test data. Definitely some outliers in there but it will provide a good benchmark going forward as we add better validation methods, try new approaches and integrate with betting market outcomes.

Player	Team	Predicted votes	Actual votes	Predicted rank	Actual rank
Sam Mitchell	HW	26.7	26	1	3
Nat Fyfe	FR	26.2	31	2	1
Andrew Gaff	WC	26.0	17	3	12
Dan Hannebery	SY	24.6	24	4	5
Matt Priddis	WC	23.9	28	5	2
Lachie Neale	FR	23.6	10	6	32
Josh Kennedy	SY	22.3	25	7	4
Scott Thompson	AD	21.0	12	8	25
David Mundy	FR	20.9	19	9	8
Trent Cotchin	RI	20.7	17	10	12

Framing the problem

Predicting the winner of the Brownlow medal can best be viewed as a "learning to rank" problem. Given a game (with 2 teams of typically 22 players), we want to predict the 1st, 2nd and 3rd best players.

It combines elements of classification, in that we have discrete target variables of "1st (3 votes)", "2nd (2 votes)", "3rd (1 vote)" and "0 votes", with regression, in that these categories still have a value and a meaningful, ordered relationship with one another.

Furthermore, given the value of these models to betting (in Australia, sports wagering is legal and AFL/Brownlow markets are highly liquid), we don't simply want to predict hard ranks to players for each game. It would be ideal if we could model "soft" rankings with uncertainty, to establish implied betting odds and expected return on investment for various markets.

Approaches

Ordinal regression

While modelling this partial ordering can be done directly using ordinal regression models, there are 3 key drawbacks to these models for our problem:

There is no constraint on multiple players in a given game being awarded a specific vote (e.g. 2 players could be assigned "3 votes" during inference).
For our data, the "1 vote" category, and to a slightly lesser extent the "2 vote" category, are incredibly noisy. Intuitively, this makes sense, as the "best player" is usually much easier to identify or pick than the "third best player".
Some approaches (SVMs) assign hard votes, while other approaches (logistic regression) don't have easily interpretable soft votes; simply multiplying each vote by the probability of a player obtaining that vote leads to similar issues to those in (1).

Binary classification on "3 votes"

Another option would be to train a model only on the "3 votes" category:

Train some probabilistic classification model
Infer probabilities of "3 votes" classification for each player in a given game
Map the highest probability to "3 votes", the 2nd highest to "2 votes", the 3rd highest to "1 vote" and the rest to "0 votes"
Select model and hyperparameters according to relevant metrics on the validation set (e.g. rank MSE, F1-score for each class and globally)

There are two key drawbacks to this approach:

Votes are hard assigned (i.e. no implied uncertainty)
Class imbalance may impact performance of models - only 1 in ~44 (e.g. ~2%) of players are in the "3 vote" category for each game

While there are a number of ways to combat these drawbacks (some of which I believe are superior to the below and will explore at a later date), I propose a fairly simple solution to establish a solid baseline below.

A bagging-based approach

While there are many methods to combat class imbalance, a convenient solution for our problem is a balanced bagging approach (Wallace, Small, Brodley and Trikalinos).

Imbalanced classes — A poor separation of classes due to class imbalance.

This allows us to reduce the variance of individual bagged models by taking the average of predicted votes, while also using the entire set of predictions as a proxy for understanding the distribution of possible vote outcomes:

Use 2015 as test data, with 2002-2014 as training and valdiation data.
Split train/val 0.75/0.25 by game (this is key to inferring ranks)
Within the training data, train 100 models on 100 bootstrap samples on the binary classification of "3 vote", where the majority class (non-3 vote) is undersampled.
On the validation data, infer votes based on the ranks of P(3 vote) within each game.
Choose best model and hyperparameters based on validation set performance.

Input variables

We used commonly available player statistics for each game - possessions (kicks/handballs, contested/uncontested), marks, goals, behinds, goal assists, one percenters etc. All individual stats were normalized within a game, since these stats are judged by umpires within the context of that game. We also included whether a player's team won the game or not, again normalized within each game for consistency. This allowed us to accurately compare the performance of models that are sensitive to scaling vs. those that aren't.

Metrics

After inferring votes, we examined learning curves as well as confusion matrices to assess the performance of our models.

MSE learning curve — Learning curve: Running average vote MSE by sample.

While the above learning curve was for our optimal random forest model, the same general trend was true for all models examined: better accuracy and lower variance than individual undersampled bootstrap models.

Confusion matrix — Vote confusion matrix for all bootstrapped models on the validation set.

Similarly, the general trend in confusion matrices was the same: models performed well for "0 votes", reasonably well for "3 votes", but not well for "1 vote" and "2 votes". This is something we can definitely improve on in the future.

While this is useful for examining the performance of individual models (monitoring bias/variance, accuracy), it is challenging to compare many different models and hyperparameters at once. We instead chose summary statistics closely related to the above:

Vote MSE by vote category and on average (arithmetic, geometric and global)
F1-score by vote category and on average (arithmetic, geometric and global)

Note that because of the way we have done inference by only allocating votes in the exact same proportion as the ground truths, in each vote category F1-score = precision = recall.

Model validation

We performed hyperparameter grid search across a number of classification models with a probabilistic output (required for game ranking and votes): Gaussian Naive Bayes, Logistic Regression, Decision Trees, Random Forests. In the future it would be great to incorporate SVMs by ranking players within a game by distance to the separating hyperplane, but given we are just setting a baseline right now it can wait. Due to resource constraints (i.e. my MacBook Air), we used a reduced 30 bootstrap samples to compare models. The top two performing models on the validation set are below:

Logistic Regression with L2 penalty and C = 1
Random Forest with 30 trees, max depth of 30, sqrt max features, min samples per split of 10, min samples per leaf of 1, bootstrap on

We ultimately chose the random forest model. While it generally underperformed in the F1 score (i.e. it had a higher misclassification rate), it outperformed in MSE. This implies that even though the RF model had more misclassifications in aggregate, the misclassifications were, on average, closer to the ground truth (especially on 1 and 2 votes).

Test set results

Predicted top 20, 2015 — Predicted vote densities for 2015 test data.

Player	Team	Predicted votes	Actual votes	Predicted rank	Actual rank
Sam Mitchell	HW	26.7	26	1	3
Nat Fyfe	FR	26.2	31	2	1
Andrew Gaff	WC	26.0	17	3	12
Dan Hannebery	SY	24.6	24	4	5
Matt Priddis	WC	23.9	28	5	2
Lachie Neale	FR	23.6	10	6	32
Josh Kennedy	SY	22.3	25	7	4
Scott Thompson	AD	21.0	12	8	25
David Mundy	FR	20.9	19	9	8
Trent Cotchin	RI	20.7	17	10	12

What's next?

A more comprehensive validation process. Given that voting trends change over time, it is likely that some voting signals decay. We will compare different sliding lookback windows on the training/validation set.
Roughly balanced bagging, a method more aligned to the original bagging technique that samples the majority class based on a negative binomial distribution.
Single model (gradient boosting, neural nets), simulation based approaches (e.g. a generalization of the Gumbel-max trick appears promising).