The phrase “we have enough data” does not exist in data science parlance. I have never encountered anyone who willingly said no to collecting more data for their machine learning or deep learning project. And there are often situations when the data you have simply isn’t enough.
That’s when the power of web scraping comes to the fore. It is a powerful technique that any analyst or data scientist should possess and will hold you in good stead in the industry (and when you’re sitting for interviews!).
There are a whole host of Python libraries available to perform web scraping. But how do you decide which one to choose for your particular project? Which Python library holds the most flexibility? I will aim to answer these questions here, through the lens of five popular Python libraries for web scraping that I feel every enthusiast should know about.
Web scraping is the process of extracting structured and unstructured data from the web with the help of programs and exporting into a useful format. If you want to learn more about web scraping, here are a couple of resources to get you started:
Alright – let’s see the web scraping libraries in Python!
Let’s start with the most basic Python library for web scraping. ‘Requests’ lets us make HTML requests to the website’s server for retrieving the data on its page. Getting the HTML content of a web page is the first and foremost step of web scraping.
Requests is a Python library used for making various types of HTTP requests like GET, POST, etc. Because of its simplicity and ease of use, it comes with the motto of HTTP for Humans.
I would say this the most basic yet essential library for web scraping. However, the Requests library does not parse the HTML data retrieved. If we want to do that, we require libraries like lxml and Beautiful Soup (we’ll cover them further down in this article).
Let’s take a look at the advantages and disadvantages of the Requests Python library.
Advantages:
Disadvantages:
We know the requests library cannot parse the HTML retrieved from a web page. Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.
It combines the speed and power of Element trees with the simplicity of Python. It works well when we’re aiming to scrape large datasets. The combination of requests and lxml is very common in web scraping. It also allows you to extract data from HTML using XPath and CSS selectors.
Let’s take a look at the advantages and disadvantages of the lxml Python library.
Advantages:
Disadvantages:
BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
One of the primary reasons the Beautiful Soup library is so popular is that it is easier to work with and well suited for beginners. We can also combine Beautiful Soup with other parsers like lxml. But all this ease of use comes with a cost – it is slower than lxml. Even while using lxml as a parser, it is slower than pure lxml.
One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. The combination of Beautiful Soup and Requests is quite common in the industry.
Advantages:
Disadvantages:
If you want to learn how to scrape web pages using Beautiful Soup, this tutorial is for you:
There is a limitation to all the Python libraries we have discussed so far – we cannot easily scrape data from dynamically populated websites. It happens because sometimes the data present on the page is loaded through JavaScript. In simple words, if the page is not static, then the Python libraries mentioned earlier struggle to scrape the data from it.
That’s where Selenium comes into play.
Selenium is a Python library originally made for automated testing of web applications. Although it wasn’t made for web scraping originally, the data science community turned that around pretty quickly!
It is a web driver made for rendering web pages, but this functionality makes it very special. Where other libraries are not capable of running JavaScript, Selenium excels. It can make clicks on a page, fill forms, scroll the page and do many more things.
This ability to run JavaScript in a web page gives Selenium the power to scrape dynamically populated web pages. But there is a trade-off here. It loads and runs JavaScript for every page, which makes it slower and not suitable for large scale projects.
If time and speed is not a concern for you, then you can definitely use Selenium.
Advantages:
Disadvantages:
Here is a wonderful article to learn how Selenium works (including Python code):
Now it’s time to introduce you to the BOSS of Python web scraping libraries – Scrapy!
Scrapy is not just a library; it is an entire web scraping framework created by the co-founders of Scrapinghub – Pablo Hoffman and Shane Evans. It is a full-fledged web scraping solution that does all the heavy lifting for you.
Scrapy provides spider bots that can crawl multiple websites and extract the data. With Scrapy, you can create your spider bots, host them on Scrapy Hub, or as an API. It allows you to create fully-functional spiders in a matter of a few minutes. You can also create pipelines using Scrapy.
Thes best thing about Scrapy is that it’s asynchronous. It can make multiple HTTP requests simultaneously. This saves us a lot of time and increases our efficiency (and don’t we all strive for that?).
You can also add plugins to Scrapy to enhance its functionality. Although Scrapy is not able to handle JavaScript like selenium, you can pair it with a library called Splash, a light-weight web browser. With Splash, Scrapy can even extract data from dynamic websites.
Advantages:
Disadvantages:
If you want to learn Scrapy, which I highly recommend you do, you should read this tutorial:
I personally find these Python libraries extremely useful for my requirements. I would love to hear your thoughts on these libraries or if you use any other Python library – let me know in the comment section below.
If you liked the article, do share it along in your network and keep practicing these techniques!
Ótimo artigo, ajuda muito os iniciantes! Parabéns!