Databases, Systems & Networks

Accueil > Système > Find Duplicate Files (based on size first, then MD5 hash)

Find Duplicate Files (based on size first, then MD5 hash)

14/03/2024 Categories: Système Tags: commands, shell

Terminal – Find Duplicate Files (based on size first, then MD5 hash)

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Find Duplicate Files (based on size first, then MD5 hash)

This dup finder saves time by comparing size first, then md5sum, it doesn’t delete anything, just lists them.

Terminal – Alternatives

find -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 33 | cut -c 35-

Find Duplicate Files (based on MD5 hash)

Calculates md5 sum of files. sort (required for uniq to work). uniq based on only the hash. use cut ro remove the hash from the result.

find -type d -name ".svn" -prune -o -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type d -name ".svn" -prune -o -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Find Duplicate Files, excluding .svn-directories (based on size first, then MD5 hash)

Improvement of the command « Find Duplicate Files (based on size first, then MD5 hash) » when searching for duplicate files in a directory containing a subversion working copy. This way the (multiple dupicates) in the meta-information directories are ignored.

Can easily be adopted for other VCS as well. For CVS i.e. change « .svn » into « .csv »:

find -type d -name ".csv" -prune -o -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type d -name ".csv" -prune -o -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

find -not -empty -type f -printf "%-30s'\t\"%h/%f\"\n" | sort -rn -t$'\t' | uniq -w30 -D | cut -f 2 -d $'\t' | xargs md5sum | sort | uniq -w32 --all-repeated=separate

Find Duplicate Files (based on size first, then MD5 hash)

Finds duplicates based on MD5 sum. Compares only files with the same size. Performance improvements on:

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separateThe new version takes around 3 seconds where the old version took around 17 minutes. The bottle neck in the old command was the second find. It searches for the files with the specified file size. The new version keeps the file path and size from the beginning.

find . -type f -exec md5 '{}' ';' | sort | uniq -f 3 -d | sed -e "s/.*(\(.*\)).*/\1/"

Find Duplicate Files (based on MD5 hash) — For Mac OS X

This works on Mac OS X using the `md5` command instead of `md5sum`, which works similarly, but has a different output format. Note that this only prints the name of the duplicates, not the original file. This is handy because you can add `| xargs rm` to the end of the command to delete all the duplicates while leaving the original.

Categories: Système Tags: commands, shell

Les commentaires sont fermés.

OpenVPN Documentation Split OpenVPN configuration files

Find Duplicate Files (based on size first, then MD5 hash)

Tags

Archives

Articles récents

Catégories

Categories

Blogroll

Archives

Meta