admin管理员组文章数量:1122832
So here's my current situation I have a folder with textures I extracted from a game, but they weren't exactly stored efficiently so now I have a giant amount of textures of which some are unique but most are just duplicates. But then there's also a few that SEEM like duplicates to the human eye but have miniscule differences.
Currently I'm just dragging 2 similar looking images into HxD and running the file compare function to check if they're different or not.
But I wonder if there is a way to automate this through some windows command or WSL Linux command, or even a simple python script. Most of the results I get when searching for this are just commands that look for the amount of files with the same extension and then saying that THOSE are duplicates, meaning it only looks at the extension and what I'm looking for does NOT look at the file name nor extension but at the actual binary content of each file and removes any duplicate it comes across
So here's my current situation I have a folder with textures I extracted from a game, but they weren't exactly stored efficiently so now I have a giant amount of textures of which some are unique but most are just duplicates. But then there's also a few that SEEM like duplicates to the human eye but have miniscule differences.
Currently I'm just dragging 2 similar looking images into HxD and running the file compare function to check if they're different or not.
But I wonder if there is a way to automate this through some windows command or WSL Linux command, or even a simple python script. Most of the results I get when searching for this are just commands that look for the amount of files with the same extension and then saying that THOSE are duplicates, meaning it only looks at the extension and what I'm looking for does NOT look at the file name nor extension but at the actual binary content of each file and removes any duplicate it comes across
Share Improve this question asked Nov 21, 2024 at 14:50 eman_not_avaeman_not_ava 92 bronze badges 5 |3 Answers
Reset to default 1Standard library of python does have filecmp.cmp
function, for your use case I suggest using shallow=False
. You would need to use loop if you want to compare each possible pair of files, you might use following as starting point
import filecmp
files = ['path/to/file1.png', 'path/to/file2.png', 'path/to/file3.png']
for file1 in files:
for file2 in files:
if file1 != file2: # avoid comparing same paths
if filecmp.cmp(file1, file2, shallow=False):
print(file1, 'seems identical to', file2)
else:
print(file1, 'is different than', file2)
after you replace files
list with actual paths to file it should report which files seems to be identical and which do not.
Assuming all the files have the same extension then you can glob for the files, calculate an MD5 hash for the file contents and maintain a dictionary keyed on the hexdigest with a value that is a list of files that all have the same hash value.
Something like this:
from pathlib import Path
from hashlib import md5
EXTENSION = "*.bmp" # or whatever
DIRECTORY = Path(".") # where to find the files
results = dict()
for file in DIRECTORY.glob(EXTENSION):
try:
with file.open("rb") as data:
content = data.read()
hd = md5(content).hexdigest()
results.setdefault(hd, []).append(file.name)
except Exception:
# ignore any Path that cannot be opened
pass
for v in filter(lambda v: len(v) > 1, results.values()):
# this will print a list of files whose md5 hash is identical
print(v)
To handle your situation, you can use Python to compare the binary content of files in a folder and identify duplicates, even if the file names or extensions are different.
Python Script to Identify and Remove Duplicate Files
The script will:
Use a hashing algorithm (e.g., MD5 or SHA256) to calculate a unique hash for each file based on its binary content. Keep track of these hashes and remove any file with a duplicate hash.
Here’s the script:
import os
import hashlib
def calculate_hash(file_path):
hash_md5 = hashlib.md5()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def find_and_remove_duplicates(folder_path):
hashes = {}
duplicates = []
for root, _, files in os.walk(folder_path):
for file in files:
file_path = os.path.join(root, file)
file_hash = calculate_hash(file_path)
if file_hash in hashes:
duplicates.append(file_path)
print(f"Duplicate found: {file_path} (Duplicate of {hashes[file_hash]})")
else:
hashes[file_hash] = file_path
if duplicates:
print("\nRemoving duplicate files...")
for duplicate in duplicates:
os.remove(duplicate)
print(f"Removed: {duplicate}")
else:
print("No duplicate file found.")
if __name__ == "__main__":
folder_to_scan = input("Enter the path to the folder you want to scan for duplicates: ").strip()
if os.path.isdir(folder_to_scan):
find_and_remove_duplicates(folder_to_scan)
else:
print("Invalid folder path. Please check and try again.")
After this run the script as usual. Hope this will help.
本文标签: pythonDetecting duplicate files based on binary contentStack Overflow
版权声明:本文标题:python - Detecting duplicate files based on binary content - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736309746a1934129.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
fc /b
to compare two files based on the content orshasum
for a while directory and sort the output by hash, or use one of the other duplicate finder tools. Anyway that question is not on-topic here unless you want to develop your own file comparioson/duplicate checked tool. – Robert Commented Nov 21, 2024 at 21:52jdupes
. Hopefully you can find it in your Linux package repo. – Ian Abbott Commented Nov 22, 2024 at 12:19