This article was published as a part of the Data Science Blogathon.
pip install selenium pip install scrapy
You need to store the chromedriver.exe in the folder where you are running your code.
import numpy as np import pandas as pd from scrapy.selector import Selector from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import time from tqdm import tqdm import warnings warnings.filterwarnings("ignore")
driver = webdriver.Chrome('chromedriver.exe') url = 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_sa_3' time.sleep(1) driver.get(url) time.sleep(1) print(driver.title) time.sleep(1) body = driver.find_element(By.CSS_SELECTOR, 'body') body.send_keys(Keys.PAGE_DOWN) time.sleep(1) body.send_keys(Keys.PAGE_DOWN) time.sleep(1) body.send_keys(Keys.PAGE_DOWN)
sel = Selector(text = driver.page_source) review_counts = sel.css('.lister .header span::text').extract_first().replace(',','').split(' ')[0] more_review_pages = int(int(review_counts)/25)
Let’s use selenium to invoke a click on that button to load all the reviews. Since each page contains 25 reviews, we need to click 1937/25 ~ approximately 77 times on the load more button. We will use the variable more_review_pages calculated in step 4.
for i in tqdm(range(more_review_pages)): try: css_selector = 'load-more-trigger' driver.find_element(By.ID, css_selector).click() except: pass
Within that division, we can use inspect element to find the tag location. Let’s consider the 1st review for illustration. Now, the rating is stored within the span tag with class rating-other-user-rating.
To extract this information, we need to use the Scrapy library. We will pass the HTML code for the 1st review to Scrapy Selector and extract the rating value.
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container') first_review = reviews[0] sel2 = Selector(text = first_review.get_attribute('innerHTML')) rating = sel2.css('.rating-other-user-rating span::text').extract_first().strip()
Similarly, we can use the below code to find other metrics for that review.
review = sel2.css('.text.show-more__control::text').extract_first().strip() review_date = sel2.css('.review-date::text').extract_first().strip() author = sel2.css('.display-name-link a::text').extract_first().strip() review_title = sel2.css('a.title::text').extract_first().strip() review_url = sel2.css('a.title::attr(href)').extract_first().strip() helpfulness = sel2.css('.actions.text-muted::text').extract_first().strip() print('nRating:',rating) print('nreview_title:',review_title) print('nAuthor:',author) print('nreview_date:',review_date) print('nreview:',review) print('nhelpfulness:',helpfulness)
It seems, our code is doing a good job at extracting the review information.
rating_list = [] review_date_list = [] review_title_list = [] author_list = [] review_list = [] review_url_list = [] error_url_list = [] error_msg_list = [] reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container') for d in tqdm(reviews): try: sel2 = Selector(text = d.get_attribute('innerHTML')) try: rating = sel2.css('.rating-other-user-rating span::text').extract_first() except: rating = np.NaN try: review = sel2.css('.text.show-more__control::text').extract_first() except: review = np.NaN try: review_date = sel2.css('.review-date::text').extract_first() except: review_date = np.NaN try: author = sel2.css('.display-name-link a::text').extract_first() except: author = np.NaN try: review_title = sel2.css('a.title::text').extract_first() except: review_title = np.NaN try: review_url = sel2.css('a.title::attr(href)').extract_first() except: review_url = np.NaN rating_list.append(rating) review_date_list.append(review_date) review_title_list.append(review_title) author_list.append(author) review_list.append(review) review_url_list.append(review_url) except Exception as e: error_url_list.append(url) error_msg_list.append(e) review_df = pd.DataFrame({ 'Review_Date':review_date_list, 'Author':author_list, 'Rating':rating_list, 'Review_Title':review_title_list, 'Review':review_list, 'Review_Url':review_url })
Voila, we have successfully scraped all the IMDB reviews for a particular movie.
In this article, we learned the importance of data scraping IMDB reviews. We configured and used Selenium to visit the IMDB page for the Harry Potter movie and loaded all the reviews. Then we passed the entire HTML page to a scrapy Selector and extracted relevant information. Some of the key takeaways from the article are below:
I hope you liked my article on scraping IMDB reviews. Share your feedback with me in the comments section below.
Feel free to connect with me on LinkedIn if you want to discuss this with me.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.