top of page

GitHub secrets scraper (Python)

Updated: May 27, 2024


Image showing webscraper artwork.
Image from https://realpython.com/

In this blog:

Importing Selenium


First things first, we need to import the necessary functions and objects for our project.

import time #1
import selenium #2
from selenium import webdriver #3
from selenium.webdriver.chrome.options import Options #4
from selenium.webdriver.common.by import By #5

#1 We must import time as it will allow us to see what the script is doing at certain points so that we can understand and remediate issues.


#2 We must import Selenium. You might get an error with this one. Just click Optimize imports and PyCharm will sort itself out and remove the import selenium line.

Image showing import error statement.
Import error statement.

#3 We need to import the main Selenium library for web automation. This will be the machine that does all our work in the background.


#4 We need to import options which is the new Selenium function that allows the tool to automatically detect the web driver that is required for our system and configures it accordingly.


#5 And finally, we must import 'By'. We will need this for grabbing classes from the source code.

Setting variables


Type ➡️

scraper = input("which webpage would you like to scrape for credentials?") #1

option = webdriver.ChromeOptions() #2
driver = webdriver.Chrome(options=option) #3
options = Options() #4

driver.get(f"{scraper}") #5
repo = driver.find_elements(By.CLASS_NAME, "repo") #6
time.sleep(1) #7

links = [] #8
complete_link = [] #9

#1 Firstly, start off with an input statement, which will ask the user for the webpage which they would like to scrape for credentials. I will name this variable 'scraper'.


#2/4 Lines 2 to 4 are from this stack overflow page as I was running into an error saying the executable_path was deprecated when I typed this code ➡️

cdp = "C:\Users\shone\OneDrive\Desktop\ChromeSetup.exe"
driver = webdriver.Chrome(executable_path=cdp)

driver.get("https://www.google.com")
driver.quit()

The code simply configures the Chrome web driver with the specified options and initialises the web driver.

Image showing webpage.
Click to see the full conversation on stack overflow.

#5 We are telling the driver, to get the link provided by the user and place it in a formatted string indicated by the 'f' and curly brackets {}.


#6 Next, I will define a variable 'repo' uses the driver to find an element on the webpage by class_name which, in this case, is 'repo'. This is the class name found in the GitHub page that we are scraping so it should be the same for all GitHub pages.



#7 The sleep function allows you to see the web driver grabbing the class name from the website. All we will see however, is Chrome opening for a split second and closing automatically. It also allows the page to load properly before proceding.


#8/9 Lines 8 and 9 simply define two empty lists 'links' and 'complete_link' in which we will store links further in the code.

Image showing GitHub page.
GitHub page we are testing.

The first function


Type ➡️

def link_looper(next_page): #1 
    driver.get(next_page) #2
    page_after_repo = driver.find_elements(By.CLASS_NAME, "js-navigation-open") #3
    if not page_after_repo: #4
        print("Cannot resolve repository from this GitHub page.") #5
    for a in page_after_repo: #6
        if "py" in a.text: #7
            link = f"{next_page}/blob/main/{a.text}" #8
            credentials(link) #9

#1/2 First we need to define our function 'loop_link' with a singular URL argument 'next_page'. As part of the function definition, we will tell the driver to get the next page. This function is called for each each repository URL (repo_link).


#3 We will assign a new variable 'page_after_repo' which will tell the driver to find the class element of the button that takes us to the next page after the repositories page. In this case, 'js-navigation-open' is the name of the class.


#4/5 If the link for the page after the repository cannot be found, Python will print an error statement as shown in the code.


#6/7 The for loop here allows us to loop through the different links that come after the repository page, if there are any, and if these links contain the suffix '.py', we will create a variable 'link' that will hold the new link as a formatted string.


Image showing GitHub repository.
Repository we are testing.

The second function


Type ➡️

def credentials(page_link): #1
    html = f"{driver.page_source}" #2
    if "password" in html: #3
        print(f"Credentials found at {page_link}") #3
    else: #5
        print("No credentials were found at this address.") #6

#1 A new function 'credentials' defines the instructions that we want the script to follow when we are at the final link in the website. We have a single URL argument again 'page_link'.


#2 A new variable 'html' will tell the driver to get the page source of the final page that we are on, and convert it to a formatted string. If you want to see if the source code has been copied, you can insert a print statement below it to verify.


#3/4 If the source of the page has the term 'password' anywhere, we can print 'credentials found' at {page_link}, which will be the


#5/6 Otherwise we will print out that no credentials were found.

Image showing GitHub page.
Source code for the repositories page.

Appending to lists


Type ➡️

for i in repo: #1
    links.append(i.text) #2
for repos in links: #3
    repo_link = f"{scraper}/{repos}" #4
    complete_link.append(repo_link) #5
    link_looper(repo_link) #6

The final part of the code is simply appending repository names and links to the empty lists that we made at the start. We also call functions here.


#1/2 Whatever the repository names are, we will append the name to the list 'links' in text format.


#3/4 This for loop will loop through each of the names of the repositories that we stored inside the links list, and generate a new variable 'repo_link' which is assigned the input link that the user provides and name of the repos as a formatted string.


#5 The flink list will have the new link that we made appended to it.


#6 We will now call the loop_link function from above for each repository URL in the flink list which starts the process of checking for credentials in Python files in those repositories.

Full script


import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

scraper = input("which webpage would you like to scrape for credentials?")

option = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=option)
options = Options()

driver.get(f"{scraper}")
repo = driver.find_elements(By.CLASS_NAME, "repo")
time.sleep(1)

links = []
complete_link = []

def link_looper(next_page):
    driver.get(next_page)
    page_after_repo = driver.find_elements(By.CLASS_NAME, "js-navigation-open")
    if not page_after_repo:
        print("Cannot resolve repository from this GitHub page.")
    for a in page_after_repo:
        if "py" in a.text:
            link = f"{next_page}/blob/main/{a.text}"
            credentials(link)


def credentials(page_link):
    html = f"{driver.page_source}"
    if "password" in html:
        print(f"Credentials found at {page_link}")
    else:
        print("No credentials were found at this address.")


for i in repo:
    links.append(i.text)
for repos in links:
    repo_link = f"{scraper}/{repos}"
    complete_link.append(repo_link)
    link_looper(repo_link)

Thanks for reading!


Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • GitHub
  • Twitter
  • LinkedIn
bottom of page