Creating Your First Crawler with Python
A crawler, also known as an indexing robot or spider, is a computer program designed to systematically explore the World Wide Web (WWW). Its primary objective is to automatically and systematically collect data from web pages.
The crawler typically starts with a starting URL, known as a “seed,” from which it retrieves the corresponding HTML content of the page. It then extracts the links present on that page and adds them to a list of links to be explored later. This process is repeated for each link, allowing for the gradual exploration of a large number of web pages.
The crawler plays a crucial role in web penetration testing, especially in the enumeration phase. When conducting penetration tests, it is vital to quickly and effectively gather information about the target scope, including available indexes and endpoints. The crawler automates this process by systematically exploring the specified scope. By collecting data from accessible web pages, the crawler can rapidly catalog available indexes and endpoints, such as URLs, forms, application entry points, and more. This automated approach saves time and effort by avoiding tedious manual searching and facilitates the identification of potentially vulnerable areas that require specific attention during subsequent phases of penetration testing. The crawler becomes a valuable tool for security professionals during enumeration and preparation for web penetration testing.
Let’s Code
The provided code is a Python implementation of a web crawler or spider, specifically designed for scraping web pages. Here’s a breakdown of its functionality:
import requests
from bs4 import BeautifulSoup
from threading import Thread
from urllib.parse import urlparse
class SpiderScrapper :
def __init__(self,url, ssl=True, depth=10) -> None:
self.ssl = ssl
self.depth = depth
self.url = url
self.out = []
self.outOfScoop = []
self.checked_url = []
def same_domain(self, u1, u2):
return urlparse(u1).netloc == urlparse(u2).netloc
def _get(self, u):
body = requests.get(u)
if body.status_code == 200:
html_content = body.content
soup = BeautifulSoup(html_content, 'html.parser')
links = soup.find_all('a')
for link in links:
href = link.get('href')
if href != None and not href.startswith("mailto:") and not href.startswith("#") and "javascript" not in href.lower():
if href not in self.out and (not href.startswith("http://") and not href.startswith("https://")) :
self.out.append(href)
elif (href.startswith("http://") or href.startswith("https://")) and href not in self.outOfScoop and href not in self.out:
if self.same_domain(u1=href, u2=u):
self.out.append(href)
else:
self.outOfScoop.append(href)
else:
print(body.status_code)
def crawler(self):
self._get(u=self.url)
for u in self.out:
if u.startswith("https://") or u.startswith("http://"): self._get(u=u)
elif u.startswith("/"):self._get(u=self.url.strip("/")+u)
else:self._get(u=self.url+u)
def start(self):
self.crawler()
return self.out, self.outOfScoop
The code begins by importing the necessary libraries: requests
for making HTTP requests, BeautifulSoup
for parsing HTML content, Thread
for enabling multithreading, and urlparse
for parsing URLs.
Next, a class called SpiderScrapper
is defined, which serves as the main crawler implementation. It has an initializer method (__init__
) that takes in parameters such as the starting URL, SSL flag, and depth of crawling.
The class includes several methods:
same_domain
: This method checks if two URLs belong to the same domain._get
: This private method is responsible for making an HTTP GET request to a given URL (u
) and extracting relevant information from the HTML content. It uses therequests
library to retrieve the webpage and theBeautifulSoup
library to parse the HTML content. It extracts all the<a>
tags from the HTML and filters out unwanted links (e.g., mailto links, anchor tags, JavaScript links). The valid links are then appended to theout
list or theoutOfScoop
list, based on whether they belong to the same domain as the starting URL.crawler
: This method starts the crawling process. It calls the_get
method initially with the starting URL. Then, it iterates through theout
list, retrieves content from additional links by calling_get
again, and appends new links to theout
list.start
: This method initiates the crawling process by calling thecrawler
method. It returns two lists:out
(containing links within the same domain as the starting URL) andoutOfScoop
(containing links outside the initial domain).
Overall, this code sets up a basic web crawler that explores web pages starting from a given URL and retrieves relevant links within a specified depth. It uses multithreading to enhance the crawling speed and collects the discovered links for further analysis or processing.
for more see here !