Guidelines for Building an Efficient Web Scraper in Python

6 min readMar 25, 2019

Introduction

As described in Wikipedia, data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.

Baumgartner [2009] defines a web data extraction system as “a software extracting, automatically and repeatedly data from Web pages with changing contents and the delivers extracted data to a database or some other application”.

In this post I will be explaining step by step how to develop a Web Scraper of a website, in this case I will be explaining how to scrap the news section from Investing. You will also see multiple ways to retrieve the HTML DOM Tree and multiple ways to scrap it, and turn that information into useful data.

Proposed System

The proposed system is a simple Web Scraping System Python based, where there are three main parts that can be easily distinguished:

The website or collection of websites where you want to scrap the information from.
The Web Scraper developed in Python (or the programming language you feel more comfortable with) that is going to retrieve the HTML DOM and retrieve the data from it, and transform it into useful information.
Finally, the storage system of your preference, where you are going to dump the data.

Requirements

To test the code that is going to be presented in this post, you need to have the following dependencies installed in their latest version, and you can easily install them via PyPI (Python Package Index):

urllib3: urllib3 is a powerful, sanity-friendly HTTP client for Python. Much of the Python ecosystem already uses urllib3 and you should too. urllib3 brings many critical features that are missing from the Python standard libraries.
requests: requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
lxml: lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more.
beautifulsoup4: Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

HTML DOM Tree Retrieval

In this section we are going to start with the HTML DOM Tree retrieval from the specified URL, in this case: https://es.investing.com/news/latest-news. As we are developing this Web Scraper in Python, we have some alternatives like urllib3 or requests. Before scraping a web we need either get the HTML or send a POST request to that URL and get its response, in this use case we are going to retrieve the whole HTML DOM from the specified URL.

urllib3

urllib3 is a powerful, sanity-friendly HTTP client for Python. Much of the Python ecosystem already uses urllib3 and you should too. urllib3 brings many critical features that are missing from the Python standard libraries.

requests

requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.

The implementation for both tools is as it follows:

import requests
import urllib3head = {
    "User-Agent": "Mozilla/5.0 (X11; U; Linux amd64; rv:5.0) Gecko/20100101 Firefox/5.0 (Debian)",
    "X-Requested-With": "XMLHttpRequest"
}url = "https://es.investing.com/news/latest-news"

requests

req = requests.get(url, headers=head)
print(req.text)

urllib3

http = urllib3.PoolManager()
r = http.request('GET', url, headers=head)
print(r.data)

Once we know some HTML DOM retrieval tools built in Python, we are going to test them in order to use the most efficient one, where in this case being efficient means the one that lasts less overall.

Unit testing is made with a stable Internet connection and the HTML DOM retrieval process is done 500 times in order to have an average time efficiency in order to determine which tool performs better for the base case. The results may vary depending on the origin URL, this is just a study of this use case, but you should check all the possible HTML DOM extraction tools before implementing them.

Data Scraping Retrieval

Once we have the HTML code resulting as the response to the previous request, we need to scrap the data from it and store it. Hence we are looking for a fast HTML parsing tool that allows us to retrieve huge loads of data really fast, so the user does not wait too much for the data to be scraped. The main Python packages used for HTML parsing are bs4 and lxml as explained later.

beautifulsoup4 (bs4)

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

lxml

lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more.

The implementation for both tools is as follows:

from bs4 import BeautifulSoup, SoupStrainer
from lxml.html import fromstring, tostring

beautifulsoup4 (bs4)

parse_only = SoupStrainer('div', {'class': 'largeTitle'})
html = BeautifulSoup(req.text, 'html.parser', parse_only=parse_only)
selection = html.select('article > div.textDiv > a')news = list()if selection:
    for element in selection:
        news.append(element.text)

lxml

root_ = fromstring(req.text)
path_ = root_.xpath("/html/body/div[5]/section/div[5]/article/div/a")news = list()if path_:
    for elements_ in path_:
        news.append(elements_.text_content())

Once we know some HTML Scraping tools built in Python, we are going to test them in order to use the most efficient one, where in this case being efficient means the one that lasts less overall.

Unit testing is made with a stable Internet connection and the scraping process is done 500 times in order to have an average time efficiency in order to determine which tool performs better for the base case. The results may vary depending on the origin URL, this is just a study of this use case, but you should check all the possible HTML DOM extraction tools before implementing them.

Conclusion

As we can see, when it comes to HTML DOM retrieval, the most efficient tool is requests, because it outperforms urllib3, lasting 0.13 seconds less average from a use case of up to 500 tests on each. As a scraping tool, we should use lxml, because it also outperforms bs4, lasting x seconds less average.

To conclude, we are going to present every combination possible in order to determine which is the best way to retrieve and scrap a website, in order to retrieve data to turn it into useful information.

So, the best combination overall is using requests for the HTML DOM retrieval, combined with lxml for web scraping. And we get to the conclusion that urllib3 is the main factor that makes the process last longer.

Additional Information

For further information or any question feel free to contact me via email at alvarob96@usal.es or via LinkedIn at https://www.linkedin.com/in/abartt/

Thank you for your support! Stay tuned for more Data Science content!