LaLiga 2018/19 Season Player Statistics Dataset Creation via Web Scraping

Álvaro Bartolomé
7 min readJun 6, 2019

Kaggle is an online community of data scientists owned by Google and the most relevant factors that led Kaggle to be the world’s largest community of data scientists are both the competitions that encourage users to solve complex data science and machine learning related projects and the dataset hosting that it provides.

In this post I will be focusing on datasets and how can they be created via web scraping with Python so that later we can contribute to the Kaggle community uploading them for research purposes.

Introduction

Web Scraping involves fetching data from a web and extracting it, so the process that needs to be done when developing a Web Scraper implies looking for a reliable source where the data can be extracted from and then analysing its inner HTML in order to develop the Web Scraper.

As LaLiga does not have an API where data can be easily retrieved or a public dataset, here is my contribution in order to cover the needs that LaLiga does not. After some research I found some already created and public datasets of LaLiga statistics (mainly Season Ranking and Season Matches Statistics) but none of them had single football player statistics for every team that participates in LaLiga Santader (spanish first division), so I decided to create that dataset.

The relevance of contributing to the data science community with Open Source projects and datasets is really high as it helps everyone involved in different ways and it can also have a huge learning impact. Some other ideas are presented in “Why is Open Source Contribution Important?” by Vishal Dubey.

Developing a Web Scraper with Python

In this section the main focus is how the data is fetched and extracted, but leaving aside the Web Scraping with Python ins and outs, as it has been explained in detail in “Guidelines for Building an Efficient Web Scraper in Python” by Álvaro Bartolomé del Canto.

Before developing a Web Scraper you need to have a reliable source where the data can be extracted from, in this use case, the selected web to retrieve data from is LaLiga, concretely LaLiga Player Stats.

First of all we need to check the URL where the team name is specified, so we need to list all the web names in order to make a recursive scraper over all the listed webs for every team as we want to retrieve the statistics from every player of every team that plays on LaLiga Santander. So we need to retrieve and list all the available URLs for every team listed on LaLiga Santander, as described below.

Web Scraping Function to retrieve a list with dictionaries containing Team Name and Tag (URL Replacement) of every team from LaLiga Santander

So the function retrieve_team_dict() returns a list of dictionaries containing every team name and tag, where the tag is the URL replacement to retrieve the player statistics from every team. Now we have to define a list with all the generic URLs to retrieve data from with a replacement keyword which we are going to replace with the team tag to retrieve specific data from all the players of every listed team.

Generic URLs to replace with Team Name

Once we list all the sources or URLs where we are going to extract the data from we can go on and develop the Web Scraper to retrieve all the Player Statistics. Note that in this case all the webs are structured the same way because they correspond to different tabs or sections of the same web, buy in case you need to extract data from different sources you should develop one web scraper for every source unless they are structured the same way, even though it is fairly impossible.

Web Scraping Function to retrieve all the Player Statistics of every Team on LaLiga Santander

retrieve_player_stats(team) is the main Web Scraping function that retrieves the Player Statistics, it receives a team dictionary as input, and it returns a list of pandas.DataFrames one for every web listed and scraped. So those DataFrames need to be concatenated in order to generate a single DataFrame containing all the information combined, without duplicated columns.

DataFrame concatenation as Data is stored in multiple DataFrames

As the function to retrieve Player Statistics receives a dictionary team as parameter, the main function need to be defined combining all the functions proposed in order to launch the recursive Web Scraper to build the whole dataset and store it into a CSV file.

Launch function to start the Web Scraper and Dataset as CSV creation

Note that the insights of the Web Scraper have not been explained in detail because they have already been explained in “Guidelines for Building an Efficient Web Scraper in Python” by Álvaro Bartolomé del Canto as mentioned before. But if you want to know any insights of this specific scraper feel free to contact me as detailed later in Additional Information.

Kaggle Dataset and Usage

Once the Web Scraper is developed and the resulting pandas.DataFrame is generated we can dump the data into a CSV file so that we can upload the dataset to Kaggle, as Kaggle suggest to upload the datasets in CSV format. The resulting dataset can be found at https://github.com/alvarob96/laliga-dataset/blob/master/dataset/laliga_player_stats.csv.

After login in, to upload the dataset to Kaggle we need to go to Kaggle Datasets and press “New Dataset” as shown below. Then we just have to add the corresponding dataset and specify some basic information about it, like the dataset name, the URL of the dataset, etc. and we can either create a public dataset or a private one (both public and private have a maximum storage of 20GB total).

The uploaded dataset once it is properly set up and configured should look more or less like shown in figure below. It is really important but not mandatory to add some additional information about the dataset you are uploading such as the source where you get the data from, the expected update frquency, the license, etc. in order to let users know the basics about the dataset they are going to use.

Kaggle also provides an useful tool, Kaggle Kernels, which is a Python kernel image on a Dockerfile and allows users to create their own Jupyter Notebooks and work with datasets directly from Kaggle. Additionally, when you first upload a dataset, Kaggle launches an instance of an automated-bot that runs a simple exploratory analysis over the dataset as if it was a preliminary study of the data, which can be useful to have an overview of it.

Now some automatically generated plots by Kaggle will be shown as to have an overview of LaLiga dataset features.

Plot per column distribution of LaLiga Player Statistics dataset
Correlation Matrix of LaLiga Player Statistics dataset
Scatter Matrix of LaLiga Player Statistics dataset

Additionally, feel free to create any new kernel over LaLiga Player Statistics dataset and share your ideas or studies with the community!

Conclusion and Future Work

Open Source Contributions need to be made in order to help the Data Science community grow and improve and, by developing Web Scrapers of public, available and reliable data in order to create datasets we can do so and make our contribution!

Kaggle offers a perfect platform to do so, that is why Kaggle is the best way to contribute to Open Source Projects, in this case, the contribution is made combining Web Scraping techniques for dataset creation with Kaggle Datasets and Kaggle Kernels.

Due to the lack of detailed public dataset of LaLiga statistics as Future Work the main idea is to retrieve all the available data on LaLiga so that the current dataset contains more data so that exploratory analysis can be enough to create a Machine Learning model and so on the dataset to be a useful dataset when it comes to Data Science or Machine Learning. Additionally, both the scraping technique and the error handling should be improve in order to build an efficient Web Scraper and most importantly, scalable so that in the next LaLiga season (LaLiga Santander 2019/20) real time or weekly data (as LaLiga matches are contested on weekends) can be extracted, so the dataset is up-to-date.

Additional Information

If you are not familiar with Web Scraping with Python and you want to learn the basics and improve on it in order to became a better data scientist I highly recommend you any of this books:

  • Heydt, M., & Zeng, J. (2018). Python Web Scraping Cookbook. Birmingham: Packt Publishing.
  • Mitchell, R. (2018). Web Scraping with Python, 2nd Edition. [S.l.]: O’Reilly Media, Inc.

All the code used and the generated dataset is available on Github and the dataset is also uploaded to Kaggle. For further information or any question feel free to contact me via email at alvarob96@usal.es or via LinkedIn at https://www.linkedin.com/in/alvarob96/

--

--

Álvaro Bartolomé

Machine Learning Engineer. Passionate about CV and NLP, and open-source developer. Also, baller and MMA enjoyer.