fantasy league optimization

published on 2019-04-16

The European cycling season is in full swing, which for many means competing in various cycling fantasy leagues, with the main event being, of course, the Tour de France.

A fantasy league is a competition, often held between coworkers or friends, where participants compose a team of players for a certain sports season or event. The goal is to acquire as many points as possible determined by the league's score system while adhering to its constraints, most often in the form as a budget cap.

This is the first in a series of posts that will guide you through the various facets of designing a good fantasy league team. They generally won't introduce any particular strategies or models ­— sports are much too complex and varied to be captured in a single model — but rather present a number of criteria of a good model. The focus will be on cycling to keep things somewhat concrete, but most of the guidelines will apply to other sports as well.

Designing good fantasy leagues is a fascinating endeavor. It involves combining disciplines ranging from sports knowledge, team dynamics and the latest news on fitness to optimization, statistics and modeling. It will be valuable to be familiar with at least some of these subjects. Since they are so diverse, we'll aim our attention at a single aspect at a time, starting with optimization.

playing with a crystal ball

At heart, fantasy leagues are a kind of optimization problem: the objective is to maximize some league score function by assigning a riders to a team under a number of constraints. We're usually required to choose a team before the event starts and no points have been awarded. But before we tackle that problem, let's see how to solve it if the results were known in advance. Even with such perfect information, we'll see it's not exactly trivial.

Consider for example a Tour de France fantasy league where all participating riders have known (and fixed!) salary demands. Riders are awarded points at the end of the event for their individual accomplishments such as winning stages or finishing in Paris wearing the polka dot jersey. The total score for a team is determined by the sum of points awarded to the riders in the team under the constraint that the sum of the riders' salaries stays within a fixed budget.

Selecting the highest scoring team under these constraints is equivalent to the knapsack problem, a well studied problem in combinatorial optimization. Given a set of items, each given a weight and a value, the objective of the knapsack problem is to choose a subset of items to put in a knapsack that maximizes the total value while staying under a given weight limit. By perceiving riders with a known score and salary as objects with values and weights respectively, we can translate the team selection problem to a knapsack problem. This helps bring things into the realm of mathematical optimization, which provides many tools for handling such well studied problems.

If we add more constraints, like a limit on the number of riders in a team, the problem no longer fits the mold of the knapsack problem. We can still model it using the more general method of integer programming. The idea is to represent every rider by a unique number, say 11 through nn, and then assign a 11 to that number if the rider is in our team and 00 otherwise. This can be seen as a list of length nn containing ones and zeros. Let's refer to this binary list using the symbol aa; so when rider ii is part of our team, we write ai=1a_i = 1. These are the variables that we are optimizing for. Further, every rider ii is assigned a value viv_i and a cost cic_i. The corresponding integer program is commonly written like this:

maximizei=1nviaisubject toi=1ncj,iaibj,j=1,...,mai{0,1}.i=1,...,n \begin{aligned} \text{maximize} ,, &\displaystyle\sum\limits_{i=1}^{n} v_{i} a_{i} & &
\text{subject to} ,, &\displaystyle\sum\limits_{i=1}^{n} c_{j,i} a_{i} \leq b_j, & &j=1 ,..., m
& a_{i} \in {0,1}. & &i=1 ,..., n \end{aligned}

It looks intimidating, but it is a direct translation of our team selection problem. On the bottom line is the requirement that every element in our list must be either 00 or 11. This seems sensible: a rider is either in our team or not. The top line specifies the objective function. It is sum over the products viaiv_i a_i, and because every aia_i is either zero or one, it is really summing up the values of the riders in our team! The middle line specifies the constraints. The budget constraint is for example given by aicib\sum a_i c_i \leq b, with bb being the team budget. Other limits, like a cap on the number of riders from the same team or the number of sprinters, can be encoded in a similar fashion.

Solving integer programs and computing optimal knapsacks is NP-hard, which means that there is no known algorithm that can solve the general case in reasonable time. In theory, it could take centuries to compute the optimal team from hundreds of riders, but usually the set of interesting riders is so small that a common laptop can do it in a few seconds.

Let's apply this approach for a fantasy league that is already underway. The velogames spring classics league is a good fit since it has typical cycling league rules. Scores are determined by the sum of the riders' points, and there is a limit on the team's cost. Additionally, there can be no more than nine riders to a team. We can encode this by adding the constraint ai19\sum a_i \cdot 1 \leq 9 to our integer program. Again, since every aia_i is either zero or one, this effectively counts the number of riders in our team, which is exactly what we need.

As with essentially all things that could conceivably be computed, there exists a python package that enables us to do this with only a few lines of code. Below is an annotated python script that uses the cvxpy library to find the optimal team using integer programming. The costs and scores for riders were taken from the velogames standings page on April 10, 2019.

import pandas
import cvxpy
import numpy as np

# Load data
table = pandas.read_csv('data.csv')
weights = table['Cost'].values
values = table['Points'].values
rider_count = len(weights)

# The variable we are solving for: a binary vector whose length
# is given by the total number of riders
selection = cvxpy.Variable(rider_count, boolean=True)

# The total cost should be no more than 100, and the team
# cannot contain more than 9 riders
cost_constraint = weights * selection <= 100
size_constraint = np.ones(rider_count) * selection <= 9

# Our total utility is the sum of the item utilities
total_utility = values * selection

# We tell cvxpy that we want to maximize total utility 
# subject to cost_constraint and size_constraint
knapsack_problem = cvxpy.Problem(
    cvxpy.Maximize(total_utility),
    [cost_constraint, size_constraint]
)

# Solve the problem
best_score = knapsack_problem.solve()

# Retrieve rider indices
indices = [i for i, val in enumerate(selection.value) if val >= 0.9] 

# Display the dream team and its score
print(table.ix[indices])
print(best_score)

It produces the following team:

Rider Team Points Cost
Alexander Kristoff UAE-Team Emirates 335 16
Wout Van Aert Team Jumbo-Visma 305 14
Oliver Naesen AG2R La Mondiale 435 14
Bob Jungels Deceuninck - Quick Step 280 14
Mathieu Van Der Poel Corendon - Circus 395 12
Nils Politt Team Katusha - Alpecin 195 8
Kasper Asgreen Deceuninck - Quick Step 250 8
Alberto Bettiol EF Education First Pro Cycling Team 310 6
Anthony Turgis Direct Energie 145 4
2650 96

This team consists of exactly 9 riders and has a cost of 96, which makes it a valid solution. Somewhat surprisingly, its total score of 2650 is only slightly better than that of the leader on the boards at the time of writing, who had 2530 points without any posterior information. Impressive!

what's next

We've seen how to optimally design a team when rider performance is known. This alone isn't enough to create a good fantasy league team for the upcoming Tour de France, but it can be used as a building block in models for the case without perfect information. One possible starting point would be to estimate riders' expected performance and to then use those expected values as weights in an integer program.

In future posts, we may look at how to maximize fantasy league win chances in the presence of uncertainty. It will involve not only predicting rider performance, but also dissecting the score function, computing correlations between riders, and analyzing the competition. That's where things get really interesting!