You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hours after publishing my previous blogpost about the WorldCup Graph, I actually found a better, and more up to date dataset that contained all the data of the actual squads that are going to play in the actual World Cup in Qatar. I found it on this wikipedia page, which lists all the tables with the actual squads, some player details, coaches etc. as they were announced on 10th/11th of November.
So: I figured it would be nice to revisit the WorldcupGraph, and show a simpler and faster way to achieve the results of the previous exercise. So: I have actually put this data in this spreadsheet, and then downloaded a .csv version:
First, let's find the shortest paths between Lionel Messi and Kevin De Bruyne:
MATCHpath=ALLSHORTESTPATHS ((p1:Player{name:"Lionel Messi"})-[*..5]-(p2:Player{short_name:"Kevin De Bruyne"}))
RETURNpathLIMIT100;
Find matches with players from same club on two sides
The basic idea behind this query is simple: would there be any WorldCup matches, where there would be players playing for National teams, that are actually on the same team during the rest of the season. That could be interesting in many different ways, right?
So here's a simple query specifically for Belgium, essentially looking for a cyclic pattern (starting and ending with the Club node):
We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:
I had not even heard of this, but it's SO AMAZING, that I really do have to mention it.
Instead of going through the above import procedure, why not just restore a backup to your server, over http???
Here's how you do that:
in your neo4j.conf, you have to enable the loading of a database backup from a URI: use this entry to allow that
dbms.databases.seed_from_uri_providers=URLConnectionSeedProvider
As with much of the sports data, it's not always that easy to find up to data interesting data. I think that's because much of that data is probably safely hidden behind copyright protection and paywalls - but there's always some stuff around. So after a bit of searching, I found a couple of interesting data sources:
Josh Fjelstul's World Cup database
So that's already quite a bit!
To make life a bit easier for myself, I selected a couple of the most interesting .csv files abd put them together in a Google Spreadsheet.
There's a couple of worksheets in this document, but I will only use some of them:
squad_stats with the definition of the different participating teams
players_22 with the different players' information
Qatar_group_stage with the definitions of the group stage matches
historical_win-loose-draw_ratios_qatar2022_teams with the historical data with regards to the different team's matches
So, now we can proceed to actually importing the data into Neo4j, and have some fun!
Constructing a model
The model will look like this:
Some interesting components to this:
on the left hand side of the model you find the player data, including the Clubs, Leagues, Positions and Traits of the Players.
in the center of the model is the NationalTeam that the Players play for, based on their nationality. Now, the player dataset actually includes a lot of players that will never get selected to their national team - so there's two potential relationships between the Player and the NationalTeam.
HAS_NATIONALITY means that the Player is a potential player for the NationalTeam, but has not necessarily debuted yet.
PLAYS_FOR means that the Player has already played for the NationalTeam
on the right hand side of the model we find the Matches and the Phases of the Worldcup2022 Tournament.
on the far right you will find the stats of the historic Results of the Matches that have been scheduled.
We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:
Last week I was having a conversation with one of my dear Neo4j colleagues, and we were talking about the fact that Graphs are simply so much fun to play around with, and that there's nothing like a great interesting dataset to have people really experiment and acquaint themselves with the technology. I know that to be extremely true, and I think I have demonstrated this elaborately over the years on this runaway blog of mine.
Then the conversation turned to a topic that I know very little about: the FIFA World Cup in Qatar that is starting next week. Now, reading this blog you may know that I am a little addicated to my 2 wheeled #mentalhealthmachine, and that chasing a ball across a field seems like a little bit of a game to me - but hey, that's ok! And with this conversation it actually dawned on me that at Neo4j, we had done "Worldcup Graphs" both in 2014 and in 2018: our friend and former colleague Mark Needham was the driving force behind both of those efforts.
You can still see some of the work that Mark did at the time on Github and
Medium. It was truly another example of how a cool and timely dataset would get people to explore the wonderful world of graphs and get to know the technology in a fun and interesting way.
So: I decide that it would be nice to do that again. With all the new tech that is coming out of Neo4j with the release of Neo4j 5, that could not be very difficult, right? Let's take a look.
Looking for the world cup data
As with much of the sports data, it's not always that easy to find up to data interesting data. I think that's because much of that data is probably safely hidden behind copyright protection and paywalls - but there's always some stuff around. So after a bit of searching, I found a couple of interesting data sources:
Josh Fjelstul's World Cup database
So that's already quite a bit!
To make life a bit easier for myself, I selected a couple of the most interesting .csv files abd put them together in a Google Spreadsheet.
There's a couple of worksheets in this document, but I will only use some of them:
squad_stats with the definition of the different participating teams
players_22 with the different players' information
Qatar_group_stage with the definitions of the group stage matches
historical_win-loose-draw_ratios_qatar2022_teams with the historical data with regards to the different team's matches
So, now we can proceed to actually importing the data into Neo4j, and have some fun!
Constructing a model
Whenever we do any kind of data wrangling, it's always important to ask yourself the question what kind of a data model you want to import the data into. The nice thing about graphs is of course that this super flexible and adaptable, but still you want to take some care to do the import appropriately. After going throught the data, I decided on the following structure:
Some interesting components to this:
on the left hand side of the model you find the player data, including the Clubs, Leagues, Positions and Traits of the Players.
in the center of the model is the NationalTeam that the Players play for, based on their nationality. Now, the player dataset actually includes a lot of players that will never get selected to their national team - so there's two potential relationships between the Player and the NationalTeam.
HAS_NATIONALITY means that the Player is a potential player for the NationalTeam, but has not necessarily debuted yet.
PLAYS_FOR means that the Player has already played for the NationalTeam
on the right hand side of the model we find the Matches and the Phases of the Worldcup2022 Tournament.
on the far right you will find the stats of the historic Results of the Matches that have been scheduled.
So let's get cracking with the import now.
Importing the WorldcupGraph, 2022 edition
Setting up the indexes
Quite a trivial thing to do, but it's always a good idea to set up some indexes before making some imports, as it will make the import transactions more efficient, especially if you are planning to do a lot of MERGEs.
Importing all the FIFA players whose national team is in Qatar2022
Same process, except that this time we will filter on a specific part of the loaded .csv file, specifically the players of countries that are actually present at the 2022 WorldCup tournament:
As a formality, we also changed the date properties on these Player nodes, so that we could use them later on using the right datatypes.
MATCH (p:Player)
SETp.dob=datetime(p.dob)
SETp.club_joined=datetime(p.club_joined);
Connecting players to positions that they play for their teams
Next, we are creating the Position nodes and associating the players with these nodes. That involves splitting and unwiding these positions, as they are currently all put together ion the player_positions field of the Player nodes. Here's how that works:
MATCH (p:Player)
UNWINDsplit(p.player_positions,", ") aspositionsMERGE (po:Position{name:positions})
CREATE (p)-[:PLAYS_TEAM_POSITION]->(po);
Connecting players to Leagues and Clubs
We also want to make sure that we split off the League and Club nodes of the Players - as that will be interesting to use later on when we do some queries around the capabilities of players in the tournament.
Note that we use the COALESCE function quite a bit at this stage of our import: this allows us to efficiently deal with potential NULL values in our dataset. If we know the league or club of a player, we use that value - but if we don't we use the "UNKNOWN" value for that player.
A first scenario, where the player has a Position recorded for the NationalTeam. Then we create the PLAYS_FOR relationship between the Player and the NationalTeam.
MATCH (p:Player), (t:NationalTeam)
WHEREp.nationality_name=t.nameANDp.nation_positionISNOTNULLMERGE (p)-[:PLAYS_FOR]->(t);
A second scenarion where we don't have a Position recorded for the Player in the NationalTeam. In that case, we just record that the Player has the nationality of the team - with the HAS_NATIONALITY relationship.
MATCH (p:Player), (t:NationalTeam)
WHEREp.nationality_name=t.nameANDp.nation_positionISNULLMERGE (p)-[:HAS_NATIONALITY]->(t);
Connecting Players to traits
Because we have the info, we may as well connect the Player to the Traits in our dataset:
MATCH (p:Player)
UNWINDsplit(p.player_traits,", ") astraitsMERGE (tr:Trait{name:traits})
CREATE (p)-[:HAS_TRAIT]->(tr);
Adding Ranking information to Teams of Qatar2022
We can also add the ranking information to the NationalTeam nodes like this:
The last exercise is to add the actual Match information of the tournament to the database. The only thing special here is that every match will also have a IS_HOME_TEAM and an IS_AWAY_TEAM relationship to the teams.
As a very last step in the import process, I have also added historical results to every one of the matches - so that we can potentially better understand likely outcomes of these matches:
First, let's find the shortest paths between Lionel Messi and Kevin De Bruyne:
MATCHpath=ALLSHORTESTPATHS ((p1:Player{short_name:"L. Messi"})-[*..5]-(p2:Player{short_name:"K. De Bruyne"}))
RETURNpathLIMIT100;
The results are unsurprising, through some of the Trait nodes.
If we look at the paths that are not necessarily the shortest ones, we get very different subgraphs, of course.
MATCHpath= ((p1:Player{short_name:"L. Messi"})-[:PLAYS_FOR]-()-[*..4]-(p2:Player{short_name:"K. De Bruyne"}))
RETURNpathLIMIT100;
Finally, I also wanted to take a look at how the matches are going to be played, and how the team composition might have an impact on this.
Find matches with players from same club on two sides
The basic idea behind this query is simple: would there be any WorldCup matches, where there would be players playing for National teams, that are actually on the same team during the rest of the season. That could be interesting in many different ways, right?
So here's a simple query specifically for Belgium, essentially looking for a cyclic pattern (starting and ending with the Club node):
We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:
That's all I have for now. I hope this was an interesting set of ideas and queries for you, and that you can easily experiment with this yourself as well.