Jump to

Web Scraping using Selenium and Python

What is WebScrapping?

In simple words, web-scraping is the automated gathering of content and data from websites or any other resource available on the internet. Most often, the data is fetched in unstructured format and then needs to convert into structured data which can be further utilised.

Now most of the tech giants like Google and Facebook have APIs that allow access to data in structured format. But that is not the case for all the sites – in fact, large websites still won’t allow you to fetch data in structured format. That’s where scraping comes into existence. Web Scraping helps to scrape the website for data. There are many ways to scrape the website for the required structured data, in this post we will discuss one of the most popular UI Automation.

Use of Selenium and Python in Web Scraping

There are many Automation tools and languages available in the market. Before we discuss the reason for using Selenium with Python, let’s take a quick overview about Selenium.

Selenium is an open source project for a range of tools and libraries aimed at supporting browser automation. To extract the data from these browsers, Selenium provides an interface called WebDriver, which is useful for performing various tasks such as automated testing, cookie retrieval, screenshot retrieval. Some common Selenium use cases for web scraping are form submission, auto-login, data addition and deletion, and alert handling.

Now we have a brief idea on usage of Selenium, but the reason why Python has been used as a supported language with Selenium needs to be discussed. Python has libraries for almost all purposes including libraries for tasks such as web scraping.

Implementation with Demo Code

Installation of Selenium & Packages

If you have Selenium and Python already installed, then it is quick to start with the code, but for those who doesn’t have, please follow the steps below:

pip install -U selenium

Now let’s install some supporting packages:

from selenium import webdriver
import pandas as pd import os
import selenium
from selenium import webdriver
import time
from PIL import Image
import io
import requests
from webdriver_manager.chrome import ChromeDriverManager

To start with our scraper code let’s create a selenium webdriver object and launch a Chrome browser and open talent500, just to make sure all installation and webdriver initialization is working as expected:

#Install driver
opts=webdriver.ChromeOptions()
opts.headless=Truedriver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts)driver.get(“https://talent500.co/”)

Let’s create a directory where we will save our images later.

os.chdir(‘C:/sishukla/Blog/WebScrapping’)

Open Specific Search Page

In this step, we’re installing a Chrome driver and using a headless browser for web scraping. Now let’s replace the driver get from talent500 to specific search URL, from where we will fetch the images

search_url = “https://www.google.com/search?q={q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568”
driver.get(search_url.format(q=’Talent500′))

I’ve used this specific URL to scrape copyright-free images.

Write a function to take the cursor to the end of the page

def scroll_to_end(driver):
driver.execute_script(“window.scrollTo(0, document.body.scrollHeight);”)
time.sleep(5)#sleep_between_interactions

This snippet of code will scroll down the page

Write a function to fetch the URL of each Image

def getImageUrls(name,total_images,driver):

search_url = “https://www.google.com/search?q={q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568”
driver.get(search_url.format(q=name))
urls = set()
img_count = 0
results_start = 0

while(img_count<total_images): #Extract actual images now

scroll_to_end(driver)

thumbnail_results = driver.find_elements_by_xpath(“//img[contains(@class,’Q4LuWd’)]”)
totalResults=len(thumbnail_results)
print(f”Found: {totalResults} search results. Extracting links from{results_start}:{totalResults}”)

for img in thumbnail_results[results_start:totalResults]:

img.click()
time.sleep(2)
actual_images = driver.find_elements_by_css_selector(‘img.n3VNCb’)
for actual_image in actual_images:
if actual_image.get_attribute(‘src’) and ‘https’ in actual_image.get_attribute(‘src’):
urls.add(actual_image.get_attribute(‘src’))

img_count=len(urls)

if img_count >= total_images:
print(f”Found: {img_count} image links”)
break
else:
print(“Found:”, img_count, “looking for more image links …”)
load_more_button = driver.find_element_by_css_selector(“.mye4qd”)
driver.execute_script(“document.querySelector(‘.mye4qd’).click();”)
results_start = len(thumbnail_results)
return urls

This function would return a list of URLs for each category (e.g. Cars, horses, etc.)

Write a function to download image

def downloadImages(folder_path,file_name,url):
try:
image_content = requests.get(url).content
except Exception as e:
print(f”ERROR – COULD NOT DOWNLOAD {url} – {e}”)
try:
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert(‘RGB’)file_path = os.path.join(folder_path, file_name)with open(file_path, ‘wb’) as f:
image.save(f, “JPEG”, quality=85)
print(f”SAVED – {url} – AT: {file_path}”)
except Exception as e:
print(f”ERROR – COULD NOT SAVE {url} – {e}”)

This snippet of code will download the image from each of the URLs.

Function to save Image in the Destination directory

Below function will help to save the images at specific locations for further analysis.

def saveInDestFolder(searchNames,destDir,total_images,driver):
for name in list(searchNames):
path=os.path.join(destDir,name)
if not os.path.isdir(path):
os.mkdir(path)
print(‘Current Path’,path)
totalLinks=getImageUrls(name,total_images,driver)
print(‘totalLinks’,totalLinks)if totalLinks is None:
print(‘images not found for :’,name)
continue
else:
for i, link in enumerate(totalLinks):
file_name = f”{i:150}.jpg”
downloadImages(path,file_name,link)searchNames=[‘Talent500′,’Jobs’]
destDir=f’./Dataset2/’
total_images=5

saveInDestFolder(searchNames,destDir,total_images,driver)

Conclusion

In this present world data is the king, and the information can be used in different areas from product enhancements to market analysis. But the most important benefits of web scraping are Research And Analysis, Monitoring, Machine learning and Marketing.

The data fetched using Selenium can be further used for analysis. Remember that using Selenium always has the upper hand as we can get many resources easily and as it is open source so we don’t have to invest any additional amount.

Web Scraping using Selenium and Python

Web Scraping using Selenium and Python

What is WebScrapping?

Use of Selenium and Python in Web Scraping