I was reading about data analysis in my free time instead of studying for my midterms when I found out about Data Association. The goal of the chapter of the book was to show how to define how we can find how related are elements between eachother. The author said it was useful when a grocery store wants to know what products to put together in the store. We can also make the assumption that it is useful when Youtube or Twitter suggests us content based on what we have previously consumed, or like when amazon suggests us related products when we are about to buy a good.
All those linking and association reading made me think about… HOCKEY surprising! The book showed me a tutorial on how to see how we can find out how related are the products of a store based on what people have bought using the Apriori algorithm. So I did the same process using the NHL play-by-play data. I won’t lie here, I did not know where I was going with that… but I assumed it would be giving me chemistry between players.
Before we go further, we need to understand what is Apriori. Here is what Wikipedia has to say :
Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database.
You get it? If you are not sure, you might get a better idea later when I explain how I implemented Apriori with hockey data.
Basic Methodology
Get the season play-by-play data of the 2021-22 NHL season
Filter the data to only keep the rows where the event is a “GOAL”
For every team, I have filtered the PBP data and applied the Apriori algorithm on the skaters who were on the ice for the team who scored the goal
Merge all the results in one big dataframe and look at the mess I have done
I am filtering for goals because I assume that if players have a good chemistry, they will be on the ice when the team scores goals and we can find associations there. Also I did not know what else it could mean (in all honesty). I have also filtered out goalies because they are always on the ice.
So here is what it gives me when I sort the values by the strongest link in a descending order :
Because I am a Habs blogger and fan, here is the Canadiens view :
Oh! That’s cool! I did not know what to expect, but if my “chemistry tool” tells me that the Tatar-Danault-Gallagher line had the top score, I assume it is good! Actually, what roughly happens is that for each Habs goal, the algorithm sees who were the Habs players on the ice and he gives a score based on the frequence of which they are together relative to the some of all goals Habs scored last season. In simpler terms, for every group of at least 2 players, the algorithm gets their number of occurences divided the the sum of all the goals the Canadiens have scored. Did I just repeat the same thing twice? Yes, I did and hope you understand better.
Adding few steps to the Methodology
Do the same procedure as before but for goals against each team and get the revulsion (I just needed to find a new word who means the opposite of chemistry) of each group of players
Create a few metrics like link or chemistry differential (positive_link - negative_link) and link or chemistry percentage (positive_link / sum of both links)
Merge everything together and look at the mess I have done
What the column values mean
link_x : Proportion of team goals for where the group of players is on the ice
link_y : Proportion of team goals against where the group of players is on the ice
diff : link_x - link_y or +/- of the proportion of team goals for and against
t% : Proportion of the Proportion of team goals for where the group of players is on the ice relative to all the goals scored when the group of players was on the ice. Formula : link_x/(link_x + link_y)
As I am writing this, I am feeling that the Apriori algorithm is giving me on ice GF% stats for any groups of players in the league… and it is really fun. Before I leave, I am giving you the data of Habs groups who have both a link_x and a link_y (because one thing I noticed is that some groups have only one of the links, probably due to the lack of data in the other link).
This is it for this week. I hope you enjoyed the read and finding out with me how we can use the Apriori Algorithm with hockey data. The next steps for this experiments will be to weight events and maybe try it with expected goals. This experiment made me be curious about algorithms and I would like to try to play with more algorythms with hockey data.
If you need more informations about algorithms, I strongly suggest you to read about it. Have a good week and see you next week for the next article of the newsletter.