Open In App

Finding Duplicate Files with Python

Last Updated : 24 Oct, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will code a python script to find duplicate files in the file system or inside a particular folder. 

Method 1: Using Filecmp

The python module filecmp offers functions to compare directories and files. The cmp function compares the files and returns True if they appear identical otherwise False.

Syntax: filecmp.cmp(f1, f2, shallow)

Parameters:

  • f1: Name of one file
  • f2: Name of another file to be compared
  • shallow: With this, we set if we want to compare content or not.

Note: The default value is True which ensures that only the signature of files is compared not content.

Return Type: Boolean value (True if the files are same otherwise False)

Example:

We’re assuming here for example purposes that “text_1.txt”, “text_3.txt”, “text_4.txt” are files having the same content, and “text_2.txt”, “text_5.txt” are files having the same content.

Python3




# Importing Libraries
import os
from pathlib import Path
from filecmp import cmp
  
  
# list of all documents
DATA_DIR = Path('/path/to/directory')
files = sorted(os.listdir(DATA_DIR))
  
# List having the classes of documents
# with the same content
duplicateFiles = []
  
# comparison of the documents
for file_x in files:
  
    if_dupl = False
  
    for class_ in duplicateFiles:
        # Comparing files having same content using cmp()
        # class_[0] represents a class having same content
        if_dupl = cmp(
            DATA_DIR / file_x,
            DATA_DIR / class_[0],
            shallow=False
        )
        if if_dupl:
            class_.append(file_x)
            break
  
    if not if_dupl:
        duplicateFiles.append([file_x])
  
# Print results
print(duplicateFiles)


Output:

Method 2: Using Hashing and Dictionary

To start, this script will get a single folder or a list of folders, then through traversing the folder it will find duplicate files. Next, this script will compute a hash for every file present in the folder regardless of their name and are stored in a dictionary manner with hash being the key and path to the file as value. 

  • We have to import os, sys, hashlib libraries.
  • Then script iterates over the files and calls FindDuplicate() function to find duplicates. 
Syntax: FindDuplicate(Path)
Parameter: 
Path: Path to folder having files
Return Type: Dictionary
  • The function FindDuplicate() takes path to file and calls Hash_File() function
  • Then Hash_File() function is used to return HEXdigest of that file. For more info on HEXdigest read here
Syntax: Hash_File(path)
Parameters: 
path: Path of file
Return Type: HEXdigest of file
  • This MD5 Hash is then appended to a dictionary as key with file path as its value. After this,the FindDuplicate() function returns a dictionary in which keys has multiple values .i.e. duplicate files.
  • Now Join_Dictionary() function is called which joins the dictionary returned by FindDuplicate() and an empty dictionary. 
Syntax: Join_Dictionary(dict1,dict2)
Parameters: 
dict1, dict2: Two different dictionaries
Return Type: Dictionary
  • After this, we print the list of files having the same content using results.

Example:

We’re assuming here for example purposes that “text_1.txt”, “text_3.txt”, “text_4.txt” are files having the same content, and “text_2.txt”, “text_5.txt” are files having the same content.

Python3




# Importing Libraries
import os
import sys
from pathlib import Path
import hashlib
  
  
def FindDuplicate(SupFolder):
    
    # Duplic is in format {hash:[names]}
    Duplic = {}
    for file_name in files:
        
        # Path to the file
        path = os.path.join(folders, file_name)
          
        # Calculate hash
        file_hash = Hash_File(path)
          
        # Add or append the file path to Duplic
        if file_hash in Duplic:
            Duplic[file_hash].append(file_name)
        else:
            Duplic[file_hash] = [file_name]
    return Duplic
  
# Joins dictionaries
def Join_Dictionary(dict_1, dict_2):
    for key in dict_2.keys():
        
        # Checks for existing key
        if key in dict_1:
            
            # If present Append
            dict_1[key] = dict_1[key] + dict_2[key]
        else:
            
            # Otherwise Stores
            dict_1[key] = dict_2[key]
  
# Calculates MD5 hash of file
# Returns HEX digest of file
def Hash_File(path):
    
    # Opening file in afile
    afile = open(path, 'rb')
    hasher = hashlib.md5()
    blocksize=65536
    buf = afile.read(blocksize)
      
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    afile.close()
    return hasher.hexdigest()
  
Duplic = {}
folders = Path('path/to/directory')
files = sorted(os.listdir(folders))
for i in files:
    
    # Iterate over the files
    # Find the duplicated files
    # Append them to the Duplic
    Join_Dictionary(Duplic, FindDuplicate(i))
      
# Results store a list of Duplic values
results = list(filter(lambda x: len(x) > 1, Duplic.values()))
if len(results) > 0:
    for result in results:
        for sub_result in result:
            print('\t\t%s' % sub_result)
else:
    print('No duplicates found.')


Output:



Previous Article
Next Article

Similar Reads

Finding Md5 of Files Recursively in Directory in Python
MD5 stands for Message Digest Algorithm 5, it is a cryptographic hash function that takes input(or message) of any length and produces its 128-bit(16-byte) hash value which is represented as a 32-character hexadecimal number. The MD5 of a file is the MD5 hash value computed from the content of that file. It's a unique representation of the file's c
3 min read
Deleting Duplicate Files Using Python
In this article, we are going to use a concept called hashing to identify unique files and delete duplicate files using Python. Modules required:tkinter: We need to make a way for us to select the folder in which we want to do this cleaning process so every time we run the code we should get a file dialog to select a folder and we are going to use
6 min read
How to merge multiple excel files into a single files with Python ?
Normally, we're working with Excel files, and we surely have come across a scenario where we need to merge multiple Excel files into one. The traditional method has always been using a VBA code inside excel which does the job but is a multi-step process and is not so easy to understand. Another method is manually copying long Excel files into one w
4 min read
Python | Finding strings with given substring in list
Method #1: Using list comprehension List comprehension is an elegant way to perform any particular task as it increases readability in the long run. This task can be performed using a naive method and hence can be reduced to list comprehension as well. C/C++ Code # Python code to demonstrate # to find strings with substrings # using list comprehens
8 min read
Python | Finding frequency in list of tuples
In python we need to handle various forms of data and one among them is list of tuples in which we may have to perform any kind of operation. This particular article discusses the ways of finding the frequency of the 1st element in list of tuple which can be extended to any index. Let's discuss certain ways in which this can be performed. Method #1
6 min read
Python | Finding relative order of elements in list
Sometimes we have an unsorted list and we wish to find the actual position the elements could be when they would be sorted, i.e we wish to construct the list which could give the position to each element destined if the list was sorted. This has a good application in web development and competitive programming domain. Let's discuss certain ways in
3 min read
Python | Finding Solutions of a Polynomial Equation
Given a quadratic equation, the task is to find the possible solutions to it. Examples: Input : enter the coef of x2 : 1 enter the coef of x : 2 enter the constant : 1 Output : the value for x is -1.0 Input : enter the coef of x2 : 2 enter the coef of x : 3 enter the constant : 2 Output : x1 = -3+5.656854249492381i/4 and x2 = -3-5.656854249492381i/
2 min read
Python | Finding 'n' Character Words in a Text File
This article aims to find words with a certain number of characters. In the code mentioned below, a Python program is given to find the words containing three characters in the text file. Example Input: Hello, how are you ? , n=3 Output: how are you Explanation: Output contains every character of the of length 3 from the input file.Input File: File
3 min read
Finding the largest file in a directory using Python
In this article, we will find the file having the largest size in a given directory using Python. We will check all files in the main directory and each of its subdirectories. Modules required: os: The os module in Python provides a way of using operating system dependent functionality. OS module is available with Python's Standard Library and does
2 min read
Building an undirected graph and finding shortest path using Dictionaries in Python
Prerequisites: BFS for a GraphDictionaries in Python In this article, we will be looking at how to build an undirected graph and then find the shortest path between two nodes/vertex of that graph easily using dictionaries in Python Language. Building a Graph using Dictionaries Approach: The idea is to store the adjacency list into the dictionaries,
3 min read
three90RightbarBannerImg