top of page

Selenium Webscraper (Python)


Image showing Selenium webscraper logo in Python.

In this blog:


What is Selenium?

Selenium is an open-source, JavaScript based, automated web testing tool used to run tests on different web applications and browsers.


One drawback is that Selenium can only test web applications, so mobile and desktop apps are not currently supported, however, it can deploy on Linux, Mac and Windows.


Web applications that are supported include ➡️ Safari, Chrome, Opera, Firefox and tests can be coded in a plethora of languages including PHP, Python, Perl, Ruby and Java.

Downloading Chromedriver


First we have to download either Chrome or Edge – Any Chromium based browser should work.


Then download the webdriver for the version of browser that you have.


To check what your browser version is, go to chrome://version/ in the search bar.

Image showing Chrome version.
Chrome version page.

I will download the chromedriver for win64. Just copy and paste the URL into the search bar and it will download.

Image showing webdriver download page for Chrome.
Stable download versions for Chrome.

Unzip the folder.

Image showing files.
Unzipping files.

Importing webdriver


Run the following, with the path of your chromedriver executable.

from selenium import webdriver

chromeDriver = "C:/Users/shone/OneDrive/Documents/chromedriver-win64/chromedriver-win64/chromedriver.exe"

driver = webdriver. Chrome(executable_path=chromeDriver)

driver.get('https://www.google.com')
driver.quit()

If you get this error, that means the executable_path argument is deprecated and is a change in selenium 4.10.0.

Image showing code error.
Executable path argument deprecated error.

To resolve this issue, we must use the Option argument now.


Since the path I specified wasn’t working, I installed Selenium Manager in the terminal.


The new Manager tool can now detect the version and driver executable that you need and will embed it into the code automatically. So you can remove the executable_path URL from the code. Type ➡️

 pip install selenium 
Image showing code.
Installing Selenium webdriver manager.

Type the following ➡️

option = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=option)
options = Options()

driver.get('https://www/google.com')
driver.quit()

Running the script now will open google.com for a second and close the window by itself.

Image showing code.
Selenium opens Chrome.

Now that we have the webdriver itself working, we can proceed to the webscraping part which utilises the webdriver to retrieve information that we want.

Webscraping Amazon item price


If we want to find the price of an item off Amazon, we can use the classes used in the elements code.


Open Amazon and find a product. I have found this Ring doorbell.

Image showing Amazon webpage.
Amazon webpage.

Highlight the price and click inspect ➡️

Image showing website source code.
Inspect element.

Under the driver.get command, you can type

price = driver.find_element_by_class_name()

But this is the deprecated version.


The new version is ➡️

price = driver.find_element(By.CLASS_NAME, "a-price aok-align-center reinventPricePriceToPayMargin priceToPay")

This is the class I found in the Amazon inspect page.


Next, we need to change the get URL to the current URL ➡️

driver.get(link)
Image showing the Driver.get command.
Driver.get command.

We might get an error with the ‘By’ part. Just import By ➡️

from selenium-webdriver.common.by import By
Image showing code.
'By' keyword throwing error.

Running the script throws a no such element exception error.


I messed around trying to find a different class I could use.

This class allowed me to get the price of the doorbell camera.

Image showing code.
New class name.

Note >>


I have commented out the line that says ➡️

options.page_load_strategy = ‘non’ 

as this closed the window too quick before elements loaded.


I also added

driver.implicitly_wait(10) 

as this keeps the window open for 10 seconds. I needed the window open for this long as I was asked to verify my human identity. I only got the price after I verified myself.

Image showing Amazon page.
Amazon page opened automatically.

Automating the scan


Now say we want to check the price of this item every 5 mins, to see if it has dropped below a certain value.


We can do this by putting our current code into a function, and simply calling it every 5 mins.


In this example, I will set the sleeper time as 5 seconds to demonstrate it.


Create a While loop that runs our function called five_seconds().


Simply put our code that we wrote for the item price into the while loop, and have the Boolean statement to be true so that the loop runs indefinitely.

Image showing code.
Creating a while loop.

Every five seconds, our script will run and print the value of the item. I let it run 3 times before interrupting it.

What's next?


To learn about how to hunt for passwords in GitHub repositories using Selenium, check out this blog.


1 Comment

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Guest
Oct 15, 2023
Rated 4 out of 5 stars.

👍

Like
  • GitHub
  • Twitter
  • LinkedIn
bottom of page