Creating Your First Crawler with Python

Ghile MAHLEB
3 min readJul 3, 2023

--

A crawler, also known as an indexing robot or spider, is a computer program designed to systematically explore the World Wide Web (WWW). Its primary objective is to automatically and systematically collect data from web pages.

The crawler typically starts with a starting URL, known as a “seed,” from which it retrieves the corresponding HTML content of the page. It then extracts the links present on that page and adds them to a list of links to be explored later. This process is repeated for each link, allowing for the gradual exploration of a large number of web pages.

The crawler plays a crucial role in web penetration testing, especially in the enumeration phase. When conducting penetration tests, it is vital to quickly and effectively gather information about the target scope, including available indexes and endpoints. The crawler automates this process by systematically exploring the specified scope. By collecting data from accessible web pages, the crawler can rapidly catalog available indexes and endpoints, such as URLs, forms, application entry points, and more. This automated approach saves time and effort by avoiding tedious manual searching and facilitates the identification of potentially vulnerable areas that require specific attention during subsequent phases of penetration testing. The crawler becomes a valuable tool for security professionals during enumeration and preparation for web penetration testing.

Let’s Code

The provided code is a Python implementation of a web crawler or spider, specifically designed for scraping web pages. Here’s a breakdown of its functionality:

import requests
from bs4 import BeautifulSoup
from threading import Thread
from urllib.parse import urlparse

class SpiderScrapper :
def __init__(self,url, ssl=True, depth=10) -> None:

self.ssl = ssl
self.depth = depth
self.url = url
self.out = []
self.outOfScoop = []
self.checked_url = []

def same_domain(self, u1, u2):
return urlparse(u1).netloc == urlparse(u2).netloc

def _get(self, u):
body = requests.get(u)
if body.status_code == 200:
html_content = body.content
soup = BeautifulSoup(html_content, 'html.parser')
links = soup.find_all('a')
for link in links:
href = link.get('href')
if href != None and not href.startswith("mailto:") and not href.startswith("#") and "javascript" not in href.lower():

if href not in self.out and (not href.startswith("http://") and not href.startswith("https://")) :
self.out.append(href)

elif (href.startswith("http://") or href.startswith("https://")) and href not in self.outOfScoop and href not in self.out:
if self.same_domain(u1=href, u2=u):
self.out.append(href)
else:
self.outOfScoop.append(href)
else:
print(body.status_code)

def crawler(self):
self._get(u=self.url)

for u in self.out:
if u.startswith("https://") or u.startswith("http://"): self._get(u=u)
elif u.startswith("/"):self._get(u=self.url.strip("/")+u)
else:self._get(u=self.url+u)

def start(self):
self.crawler()
return self.out, self.outOfScoop

The code begins by importing the necessary libraries: requests for making HTTP requests, BeautifulSoup for parsing HTML content, Thread for enabling multithreading, and urlparse for parsing URLs.

Next, a class called SpiderScrapper is defined, which serves as the main crawler implementation. It has an initializer method (__init__) that takes in parameters such as the starting URL, SSL flag, and depth of crawling.

The class includes several methods:

  1. same_domain: This method checks if two URLs belong to the same domain.
  2. _get: This private method is responsible for making an HTTP GET request to a given URL (u) and extracting relevant information from the HTML content. It uses the requests library to retrieve the webpage and the BeautifulSoup library to parse the HTML content. It extracts all the <a> tags from the HTML and filters out unwanted links (e.g., mailto links, anchor tags, JavaScript links). The valid links are then appended to the out list or the outOfScoop list, based on whether they belong to the same domain as the starting URL.
  3. crawler: This method starts the crawling process. It calls the _get method initially with the starting URL. Then, it iterates through the out list, retrieves content from additional links by calling _get again, and appends new links to the out list.
  4. start: This method initiates the crawling process by calling the crawler method. It returns two lists: out (containing links within the same domain as the starting URL) and outOfScoop (containing links outside the initial domain).

Overall, this code sets up a basic web crawler that explores web pages starting from a given URL and retrieves relevant links within a specified depth. It uses multithreading to enhance the crawling speed and collects the discovered links for further analysis or processing.

for more see here !

--

--

Ghile MAHLEB
Ghile MAHLEB

Written by Ghile MAHLEB

👨‍💻🔐 Malware & IoT programmer 🦠👾 Lead RedTeam, CyberSec consultant 🔴🕵️‍♂️ When it pings, it's vulnerable 💥🔓 Protecting the digital realm 🌐🛡️ #BlackH

No responses yet