Skip to content

Instantly share code, notes, and snippets.

@vhsu
Created August 13, 2024 13:15
Show Gist options
  • Save vhsu/68b29765175298ff71a81c5fa9f7bf61 to your computer and use it in GitHub Desktop.
Save vhsu/68b29765175298ff71a81c5fa9f7bf61 to your computer and use it in GitHub Desktop.
Script to Remove Duplicate Images from a Folder This script identifies and removes duplicate images within a specified folder. It uses perceptual hashing to generate a unique fingerprint for each image, allowing it to detect duplicates even if the files have different names or formats.
"""
Script to Remove Duplicate Images from a Folder
This script identifies and removes duplicate images within a specified folder.
It uses perceptual hashing to generate a unique fingerprint for each image,
allowing it to detect duplicates even if the files have different names or formats.
What the Code Does:
-------------------
1. Scans the specified folder for image files (supports common formats such as
.jpg, .jpeg, .png, .bmp, .gif, .tiff, .webp).
2. Computes a perceptual hash for each image to identify duplicates.
3. If two images share the same hash, one of the duplicates is removed.
4. Prints the names of the duplicate files that were found and removed.
How to Use:
-----------
1. Save this script as a Python file, e.g., `remove_duplicates.py`.
2. Run the script using Python.
3. You will be prompted to enter the path of the folder containing the images
you want to check for duplicates.
4. The script will process the images in the folder, removing any duplicates
it finds and printing the results to the console.
Example:
--------
$ python remove_duplicates.py
Enter the path to the folder: /path/to/your/image/folder
Dependencies:
-------------
- Python 3.x
- PIL (Pillow)
- imagehash
Make sure to install the required packages using pip if they are not already installed:
$ pip install pillow imagehash
"""
import os
from PIL import Image
import imagehash
def image_hash(filepath):
"""Returns the perceptual hash of the image file.
Args:
filepath (str): The path to the image file.
Returns:
str: The perceptual hash of the image.
"""
with Image.open(filepath) as img:
return str(imagehash.phash(img))
def main(folder_path):
"""Removes duplicate images in the specified folder.
This function computes the perceptual hash of each image in the given folder
and deletes any images that have the same hash (i.e., duplicate images).
Args:
folder_path (str): The path to the folder containing images.
"""
if not os.path.exists(folder_path):
print("Folder not found!")
return
duplicates = []
hash_keys = dict()
# Define a set of common image file extensions
image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff', '.webp'}
# Iterate through each image in the folder
for filename in os.listdir(folder_path):
file_path = os.path.join(folder_path, filename)
# Check if the file is an image by its extension
if os.path.isfile(file_path) and os.path.splitext(filename)[1].lower() in image_extensions:
try:
img_hash = image_hash(file_path)
if img_hash in hash_keys:
print(f"Duplicate found: {filename}")
duplicates.append(file_path)
else:
hash_keys[img_hash] = filename
except IOError:
# Handle the exception if the file is not a valid image
print(f"Cannot open {filename} as an image.")
# Remove duplicates
for duplicate_file in duplicates:
os.remove(duplicate_file)
print(f"Removed: {duplicate_file}")
print("Duplicate removal complete.")
if __name__ == '__main__':
folder_path = input("Enter the path to the folder: ") # Prompt the user for the folder path
main(folder_path)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment