Web Scraping using Selenium and Python
Use of Selenium and Python in Web Scraping
Installation of Selenium & Packages
Write a function to take the cursor to the end of the page
Write a function to fetch the URL of each Image
Write a function to download Image
Function to save Image in the Destination directory
What is WebScrapping?
In simple words, web-scraping is the automated gathering of content and data from websites or any other resource available on the internet. Most often, the data is fetched in unstructured format and then needs to convert into structured data which can be further utilised.
Now most of the tech giants like Google and Facebook have APIs that allow access to data in structured format. But that is not the case for all the sites – in fact, large websites still won’t allow you to fetch data in structured format. That’s where scraping comes into existence. Web Scraping helps to scrape the website for data. There are many ways to scrape the website for the required structured data, in this post we will discuss one of the most popular UI Automation.
Use of Selenium and Python in Web Scraping
There are many Automation tools and languages available in the market. Before we discuss the reason for using Selenium with Python, let’s take a quick overview about Selenium.
Selenium is an open source project for a range of tools and libraries aimed at supporting browser automation. To extract the data from these browsers, Selenium provides an interface called WebDriver, which is useful for performing various tasks such as automated testing, cookie retrieval, screenshot retrieval. Some common Selenium use cases for web scraping are form submission, auto-login, data addition and deletion, and alert handling.
Now we have a brief idea on usage of Selenium, but the reason why Python has been used as a supported language with Selenium needs to be discussed. Python has libraries for almost all purposes including libraries for tasks such as web scraping.
Implementation with Demo Code
Installation of Selenium & Packages
If you have Selenium and Python already installed, then it is quick to start with the code, but for those who doesn’t have, please follow the steps below:
pip install -U selenium |
Now let’s install some supporting packages:
from selenium import webdriver import pandas as pd import os import selenium from selenium import webdriver import time from PIL import Image import io import requests from webdriver_manager.chrome import ChromeDriverManager |
To start with our scraper code let’s create a selenium webdriver object and launch a Chrome browser and open talent500, just to make sure all installation and webdriver initialization is working as expected:
#Install driver opts=webdriver.ChromeOptions() opts.headless=Truedriver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts)driver.get(“https://talent500.co/”) |
Let’s create a directory where we will save our images later.
os.chdir(‘C:/sishukla/Blog/WebScrapping’) |
Open Specific Search Page
In this step, we’re installing a Chrome driver and using a headless browser for web scraping. Now let’s replace the driver get from talent500 to specific search URL, from where we will fetch the images
search_url = “https://www.google.com/search?q={q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568” driver.get(search_url.format(q=’Talent500′)) |
I’ve used this specific URL to scrape copyright-free images.
Write a function to take the cursor to the end of the page
def scroll_to_end(driver): driver.execute_script(“window.scrollTo(0, document.body.scrollHeight);”) time.sleep(5)#sleep_between_interactions |
This snippet of code will scroll down the page
Write a function to fetch the URL of each Image
def getImageUrls(name,total_images,driver):
search_url = “https://www.google.com/search?q={q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568” while(img_count<total_images): #Extract actual images now scroll_to_end(driver) thumbnail_results = driver.find_elements_by_xpath(“//img[contains(@class,’Q4LuWd’)]”) for img in thumbnail_results[results_start:totalResults]: img.click() img_count=len(urls) if img_count >= total_images: |
This function would return a list of URLs for each category (e.g. Cars, horses, etc.)
Write a function to download image
def downloadImages(folder_path,file_name,url): try: image_content = requests.get(url).content except Exception as e: print(f”ERROR – COULD NOT DOWNLOAD {url} – {e}”) try: image_file = io.BytesIO(image_content) image = Image.open(image_file).convert(‘RGB’)file_path = os.path.join(folder_path, file_name)with open(file_path, ‘wb’) as f: image.save(f, “JPEG”, quality=85) print(f”SAVED – {url} – AT: {file_path}”) except Exception as e: print(f”ERROR – COULD NOT SAVE {url} – {e}”)
|
This snippet of code will download the image from each of the URLs.
Function to save Image in the Destination directory
Below function will help to save the images at specific locations for further analysis.
def saveInDestFolder(searchNames,destDir,total_images,driver): for name in list(searchNames): path=os.path.join(destDir,name) if not os.path.isdir(path): os.mkdir(path) print(‘Current Path’,path) totalLinks=getImageUrls(name,total_images,driver) print(‘totalLinks’,totalLinks)if totalLinks is None: print(‘images not found for :’,name) continue else: for i, link in enumerate(totalLinks): file_name = f”{i:150}.jpg” downloadImages(path,file_name,link)searchNames=[‘Talent500′,’Jobs’] destDir=f’./Dataset2/’ total_images=5 saveInDestFolder(searchNames,destDir,total_images,driver)
|
Conclusion
In this present world data is the king, and the information can be used in different areas from product enhancements to market analysis. But the most important benefits of web scraping are Research And Analysis, Monitoring, Machine learning and Marketing.
The data fetched using Selenium can be further used for analysis. Remember that using Selenium always has the upper hand as we can get many resources easily and as it is open source so we don’t have to invest any additional amount.
Add comment