Skip to content

Instantly share code, notes, and snippets.

@darookee
Last active April 27, 2023 15:34
Show Gist options
  • Save darookee/234b38c5bb82004e0fb4ba661fb0133c to your computer and use it in GitHub Desktop.
Save darookee/234b38c5bb82004e0fb4ba661fb0133c to your computer and use it in GitHub Desktop.
simple python crawler

🕷️ Web Crawler

A simple web crawler that recursively crawls all links on a specified domain and outputs them hierarchically along with the header tags (h1, h2, h3, h4, h5, h6) in each page. The crawler only follows links that are HTTP or HTTPS, within the same domain, and have not been crawled before.

🚀 Usage

Running the script

To run the script, use the following command:

python crawler.py <domain>

Replace <domain> with the domain you want to crawl. For example:

python crawler.py example.com

Building the Docker image

To build the Docker image, run the following command:

docker build -t web-crawler .

Running the Docker container

To run the Docker container, use the following command:

docker run web-crawler <domain>

Replace <domain> with the domain you want to crawl. For example:

docker run web-crawler example.com

📦 Dependencies

This script requires the following dependencies to be installed:

  • requests
  • beautifulsoup4

To install the dependencies, run the following command:

pip install -r requirements.txt
import sys
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
ALL_LINK = set()
def get_links(url):
# Make a GET request to the specified URL
response = requests.get(url)
content_type = response.headers.get('Content-Type')
if not content_type or 'html' not in content_type:
return [], [(None, None)]
try:
# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
except:
return [], [(None, None)]
# Find all anchor tags in the page and extract their href attributes
links = []
for a_tag in soup.find_all('a'):
link = a_tag.get('href')
# Ignore links that don't have an href attribute
if link is None:
continue
# Ignore links that are not HTTP or HTTPS
parsed_link = urlparse(link)
if parsed_link.scheme not in ('http', 'https'):
continue
# Ignore links that are not within the same domain
if parsed_link.netloc != urlparse(url).netloc:
continue
if link in ALL_LINK:
continue
links.append(link)
ALL_LINK.add(link)
headers = []
for header_level in range(1, 7):
for header_tag in soup.find_all('h{}'.format(header_level)):
headers.append((header_level, header_tag.text.strip()))
return links, headers
def print_links(url, indent=0):
# Get all the links on the page
links, headers = get_links(url)
if indent == 0:
print('- ' + url)
for header_level, header_text in headers:
if header_level is None:
continue
print(' ' * indent + ' ' + '#'*header_level + ' ' + header_text)
# Print the links with an indent
for link in links:
print(' ' * indent + '- ' + link)
# Recursively print the links on the linked pages
print_links(link, indent + 2)
if len(sys.argv) != 2:
print('Usage: crawler.py <domain>')
exit()
# Get the domain from the command line arguments
domain = sys.argv[1]
# Example usage
print_links('https://'+domain)
# Use an official Python runtime as a parent image
FROM python:3.9-slim-buster
# Set the working directory to /app
WORKDIR /app
# Copy the requirements file into the container
COPY requirements.txt .
# Install the required Python packages
RUN pip install --no-cache-dir -r requirements.txt
# Copy the Python script into the container
COPY crawler.py .
EXPOSE 3306
# Run the Python script when the container starts
ENTRYPOINT ["python", "crawler.py"]
beautifulsoup4==4.12.2
bs4==0.0.1
certifi==2022.12.7
charset-normalizer==3.1.0
idna==3.4
requests==2.29.0
soupsieve==2.4.1
urllib3==1.26.15
#!/bin/bash
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
./venv/bin/python3 crawler.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment