python - Detecting duplicate files based on binary content - Stack Overflow

IT技术

更新时间：2025-01-088

admin管理员组
文章数量:1122832

So here's my current situation I have a folder with textures I extracted from a game, but they weren't exactly stored efficiently so now I have a giant amount of textures of which some are unique but most are just duplicates. But then there's also a few that SEEM like duplicates to the human eye but have miniscule differences.

Currently I'm just dragging 2 similar looking images into HxD and running the file compare function to check if they're different or not.

But I wonder if there is a way to automate this through some windows command or WSL Linux command, or even a simple python script. Most of the results I get when searching for this are just commands that look for the amount of files with the same extension and then saying that THOSE are duplicates, meaning it only looks at the extension and what I'm looking for does NOT look at the file name nor extension but at the actual binary content of each file and removes any duplicate it comes across

So here's my current situation I have a folder with textures I extracted from a game, but they weren't exactly stored efficiently so now I have a giant amount of textures of which some are unique but most are just duplicates. But then there's also a few that SEEM like duplicates to the human eye but have miniscule differences.

Currently I'm just dragging 2 similar looking images into HxD and running the file compare function to check if they're different or not.

But I wonder if there is a way to automate this through some windows command or WSL Linux command, or even a simple python script. Most of the results I get when searching for this are just commands that look for the amount of files with the same extension and then saying that THOSE are duplicates, meaning it only looks at the extension and what I'm looking for does NOT look at the file name nor extension but at the actual binary content of each file and removes any duplicate it comes across

Share Improve this question asked Nov 21, 2024 at 14:50 eman_not_ava 92 bronze badges

How many files, how large in total? – no comment Commented Nov 21, 2024 at 15:29
Do all of the file have the same extension? – SIGHUP Commented Nov 21, 2024 at 15:31
Why not just use a tool like DoubleKiller (I think that's the one I used a while back)? – no comment Commented Nov 21, 2024 at 15:32
Use a tool like fc /b to compare two files based on the content or shasum for a while directory and sort the output by hash, or use one of the other duplicate finder tools. Anyway that question is not on-topic here unless you want to develop your own file comparioson/duplicate checked tool. – Robert Commented Nov 21, 2024 at 21:52
Try jdupes. Hopefully you can find it in your Linux package repo. – Ian Abbott Commented Nov 22, 2024 at 12:19

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

Standard library of python does have filecmp.cmp function, for your use case I suggest using shallow=False. You would need to use loop if you want to compare each possible pair of files, you might use following as starting point

import filecmp
files = ['path/to/file1.png', 'path/to/file2.png', 'path/to/file3.png']
for file1 in files:
    for file2 in files:
        if file1 != file2:  # avoid comparing same paths
            if filecmp.cmp(file1, file2, shallow=False):
                print(file1, 'seems identical to', file2)
            else:
                print(file1, 'is different than', file2)

after you replace files list with actual paths to file it should report which files seems to be identical and which do not.

Assuming all the files have the same extension then you can glob for the files, calculate an MD5 hash for the file contents and maintain a dictionary keyed on the hexdigest with a value that is a list of files that all have the same hash value.

Something like this:

from pathlib import Path
from hashlib import md5

EXTENSION = "*.bmp" # or whatever
DIRECTORY = Path(".") # where to find the files

results = dict()

for file in DIRECTORY.glob(EXTENSION):
    try:
        with file.open("rb") as data:
            content = data.read()
            hd = md5(content).hexdigest()
            results.setdefault(hd, []).append(file.name)
    except Exception:
        # ignore any Path that cannot be opened
        pass

for v in filter(lambda v: len(v) > 1, results.values()):
    # this will print a list of files whose md5 hash is identical
    print(v)

To handle your situation, you can use Python to compare the binary content of files in a folder and identify duplicates, even if the file names or extensions are different.

Python Script to Identify and Remove Duplicate Files

The script will:

Use a hashing algorithm (e.g., MD5 or SHA256) to calculate a unique hash for each file based on its binary content. Keep track of these hashes and remove any file with a duplicate hash.

Here’s the script:

import os
import hashlib

def calculate_hash(file_path):
    hash_md5 = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def find_and_remove_duplicates(folder_path):
    hashes = {}
    duplicates = []
    
    for root, _, files in os.walk(folder_path):
        for file in files:
            file_path = os.path.join(root, file)
            file_hash = calculate_hash(file_path)
            
            if file_hash in hashes:
                duplicates.append(file_path)
                print(f"Duplicate found: {file_path} (Duplicate of {hashes[file_hash]})")
            else:
                hashes[file_hash] = file_path
    
    if duplicates:
        print("\nRemoving duplicate files...")
        for duplicate in duplicates:
            os.remove(duplicate)
            print(f"Removed: {duplicate}")
    else:
        print("No duplicate file found.")

if __name__ == "__main__":
    folder_to_scan = input("Enter the path to the folder you want to scan for duplicates: ").strip()
    if os.path.isdir(folder_to_scan):
        find_and_remove_duplicates(folder_to_scan)
    else:
        print("Invalid folder path. Please check and try again.")

After this run the script as usual. Hope this will help.

本文标签： pythonDetecting duplicate files based on binary contentStack Overflow

版权声明：本文标题：python - Detecting duplicate files based on binary content - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736309746a1934129.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Detecting duplicate files based on binary content - Stack Overflow

3 Answers 3

更多相关文章