This article was published as a part of the Data Science Blogathon.
Web scraping typically extracts large amounts of data from websites for a variety of uses such as price monitoring, enriching machine learning models, financial data aggregation, monitoring consumer sentiment, news tracking, etc. Browsers show data from a website. However, manually copy data from multiple sources for retrieval in a central place can be very tedious and time-consuming. Web scraping tools essentially automate this manual process.
This article intends to get you up to speed on Image scraping using Python.
“Web scraping,” also called crawling or spidering, is the automated gathering of data from an online source usually from a website. While scraping is a great way to get massive amounts of data in relatively short timeframes, it does add stress to the server where the source is hosted.
This is primarily why many websites disallow or ban scraping all together. However, as long as it does not disrupt the primary function of the online source, it is fairly acceptable.
Our objective is to fetch the image data for our deep learning model. That could be anything like Cat and Dog Image classification or classification of Superheroes by using Images. All these problems required a lot of data. Sometimes we have that, sometimes we do not. So in that case we can easily gather the data from any website using web scraping.
Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same. Selenium Python bindings provide a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, Ie, Chrome, Remote, etc. The currently supported Python versions are 3.5 and above.
For more on selenium, check out its documentation here
Now after getting the basic understanding of Web-Scraping. Let’s dive into the coding part.
At first, I would recommend creating a separate environment for this project, and having an IDE Pycharm or anyone would be a plus.
After activating the environment we first need to install the required libraries. For that just type the following command in your terminal after downloading the file from here:
pip install -r requirements.txt
Now open your IDE and follow my lead:
import os ##importing libraries import time import requests from selenium import webdriver
The above lines of code will import the libraries.
def fetch_image_urls(query: str, max_links_to_fetch: int, wd: webdriver, sleep_between_interactions: int = 1): def scroll_to_end(wd): wd.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(sleep_between_interactions)
The first function takes 4 positional arguments. Query, maximum links to be fetched, web driver path, and time gape between scraping two images.
# build the google query search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img" # load the page wd.get(search_url.format(q=query))
The ‘get’ method will load the webpage for the given URL.
image_urls = set() image_count = 0 results_start = 0 while image_count < max_links_to_fetch: scroll_to_end(wd) # get all image thumbnail results thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd") number_results = len(thumbnail_results) print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
The while loop for scrolling through images till the last image. Thumbnails results are the thumbnails of the images searching using the class name defined for images. The class name can be found via the Inspect Element method.
‘number_results’ is the no. of thumbnails belonging to that particular class.
for img in thumbnail_results[results_start:number_results]: # try to click every thumbnail such that we can get the real image behind it try: img.click() time.sleep(sleep_between_interactions) except Exception: continue # extract image urls actual_images = wd.find_elements_by_css_selector('img.n3VNCb') for actual_image in actual_images: if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'): image_urls.add(actual_image.get_attribute('src')) image_count = len(image_urls) if len(image_urls) >= max_links_to_fetch: print(f"Found: {len(image_urls)} image links, done!") break else: print("Found:", len(image_urls), "image links, looking for more ...") time.sleep(30) return load_more_button = wd.find_element_by_css_selector(".mye4qd") if load_more_button: wd.execute_script("document.querySelector('.mye4qd').click();") # move the result startpoint further down results_start = len(thumbnail_results) return image_urls
‘persist_image’ function is to extract the content from the URL using ‘requests’ and then store it into our system.
The ‘search_and_download’ function will search for the target folder if it does not exist then it will create one and with the help of the web driver, the images will be fetched in the folder itself.
def persist_image(folder_path:str,url:str, counter): try: image_content = requests.get(url).content except Exception as e: print(f"ERROR - Could not download {url} - {e}") try: f = open(os.path.join(folder_path, 'jpg' + "_" + str(counter) + ".jpg"), 'wb') f.write(image_content) f.close() print(f"SUCCESS - saved {url} - as {folder_path}") except Exception as e: print(f"ERROR - Could not save {url} - {e}")
def search_and_download(search_term: str, driver_path: str, target_path='./images', number_images=10): target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' '))) if not os.path.exists(target_folder): os.makedirs(target_folder) with webdriver.Chrome(executable_path=driver_path) as wd: res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=0.5) counter = 0 for elem in res: persist_image(target_folder, elem, counter) counter += 1
DRIVER_PATH = './chromedriver_linux64 (1)/chromedriver' ## Give the path of driver you installed search_term = 'iron_man' ## folder name # num of images you can pass it from here by default it's 10 if you are not passing number_images = 5 ## number of images search_and_download(search_term=search_term, driver_path=DRIVER_PATH, number_images = number_images) ## Calling the to search and download
# How to execute this code # Step 1 : pip install -r requirements.txt # Step 2 : make sure you have chrome/Mozilla installed on your machine # Step 3 : Check your chrome version ( go to three dot then help then about google chrome ) # Step 4 : Download the same chrome driver from here " https://chromedriver.storage.googleapis.com/index.html " # Step 5 : put it inside the same folder of this code # chrome version: Check your chrome version and download the driver for that particular version only in the same project folder
Now we have extracted 5 Iron_man images into our system by web scraping using selenium. You can download any image you want using this approach, just make sure to check the class name correctly. The whole code will be the same. That is it for this simple article. If you have any doubts, feedback, or suggestions feel free to comment or can reach me here.
The source code you can get from here.