Accueil > Système > Find Duplicate Files (based on size first, then MD5 hash)

Find Duplicate Files (based on size first, then MD5 hash)

19/06/2015 Categories: Système Tags: ,
Terminal – Find Duplicate Files (based on size first, then MD5 hash)
find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Find Duplicate Files (based on size first, then MD5 hash)

This dup finder saves time by comparing size first, then md5sum, it doesn’t delete anything, just lists them.

Terminal – Alternatives
find -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 33 | cut -c 35-

Calculates md5 sum of files. sort (required for uniq to work). uniq based on only the hash. use cut ro remove the hash from the result.

find -type d -name ".svn" -prune -o -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type d -name ".svn" -prune -o -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Improvement of the command « Find Duplicate Files (based on size first, then MD5 hash) » when searching for duplicate files in a directory containing a subversion working copy. This way the (multiple dupicates) in the meta-information directories are ignored.

Can easily be adopted for other VCS as well. For CVS i.e. change « .svn » into « .csv »:

find -type d -name ".csv" -prune -o -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type d -name ".csv" -prune -o -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

find -not -empty -type f -printf "%-30s'\t\"%h/%f\"\n" | sort -rn -t$'\t' | uniq -w30 -D | cut -f 2 -d $'\t' | xargs md5sum | sort | uniq -w32 --all-repeated=separate

Finds duplicates based on MD5 sum. Compares only files with the same size. Performance improvements on:

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separateThe new version takes around 3 seconds where the old version took around 17 minutes. The bottle neck in the old command was the second find. It searches for the files with the specified file size. The new version keeps the file path and size from the beginning.

find . -type f -exec md5 '{}' ';' | sort | uniq -f 3 -d | sed -e "s/.*(\(.*\)).*/\1/"

This works on Mac OS X using the `md5` command instead of `md5sum`, which works similarly, but has a different output format. Note that this only prints the name of the duplicates, not the original file. This is handy because you can add `| xargs rm` to the end of the command to delete all the duplicates while leaving the original.

Print Friendly, PDF & Email

Related Post

Categories: Système Tags: ,
Les commentaires sont fermés.