Machine Learning Model Updates: 12/19
As mentioned in previous posts, we have successfully created a model to predict results of these matches. While we haven’t yet included the scores of the matches to check those odds, we are much more comfortable with our models ability to predict odds. That being said, last weeks model definitely still needed some improvement. The #1 issue I’m facing is predicting the draw, as our model only has around a 3% success rate. Logically, it makes sense as draws are very random, and usually happen when something went very wrong for the favorites. That being said, 27% of English Premier League matches end in a draw, so it would be incredibly valuable to properly predict those outcomes. If I wanted to drastically improve my model I would just determine the result as win vs. loss/draw, but I’m interested in looking at the draw as well.
After making my picks for this weekend, I set to work improving the model. There were a few things that I looked into for improvement, all of which I will detail below. This was a painstaking process that often times led to me adjusting lots of code and making the model worse.
Historical Data: While it is important to realise that last seasons results can be very different from this season, the extra occurences will help give our model more details to take into account. I pulled the 22 and 23 seasons from footystats and very easily added them onto the end of my CSV, coded for seasons and I was good to go.
Engineered Variables: I already created the recent goals per match statistic for the original model, using the average goals through the last 5 matches for the team, I needed more. I used the same code to pull recent opponent goals, recent net goals, and recent possession percentages. These variables provided no value at first. They even made my model a bit worse, though considering our confidence interval it could just be chance. It was clear that we needed more.
Data from Other Sources: I decided to add some data from https://www.transfermarkt.us/ to my dataset to pull the average age and player cost, as well as the total value of the team. Some of these variables were so useless they were removed from the model completely, but combining this with the additional engineered variables it provided great value to the accuracy of my model.
Trees and Tries: The number of trees and number of tries are incredibly important for the accuracy of your model. As far as I know, it is more of an experimental process, which can be annoying. The number of trees is the amount of different paths that your model will attempt to look at. The number of tries (mtry=) is the number of variables each split uses.
NOTE: As data size increases, and ntrees increases, it can take much longer to process the random forest model, and does not always increase the accuracy of your model. In my experience with this particular model going up to 1000 trees took longer to process and reduced accuracy by about 1%, so I decided to stay at 500.
I’m currently much happier with my model, though testing will be the only way to properly check it. As soon as I figure out how, I will upload a table showing my pick history with a unit count. I am still considering whether or not to implement the Kelly Score into the unit count, or just do 1 unit per pick.
Check out my model on Kaggle at the link below and let me know any thoughts you have that could improve our accuracy.
https://www.kaggle.com/code/ulukki/english-premier-league-random-forest-model