How can I compare different files without opening them?

I have two directories A and B; each one contains a lot of sub-directories

geom001, geom002 ....etc

each sub-directory contains a file named results. I want to compare, without opening any of them, each file in A with each file in B and find if there is a file or more in A similar to one or more file in B. How can I use command like the following in a loop to search over all files?

cmp --silent  file1 file2  || echo "file1 and file2 are different"

Answer

If files are exactly the same, then their md5sums will be exactly the same, so you can use:

find A/ B/ -type f -exec md5sum {} + | sort | uniq -w32 -D

An md5sum is always exactly 128 bits (or 16 bytes or 32 hex digits) long, and the md5sum program output uses hex digits. So we use the -w32 option on the uniq command to compare only the first 32 characters on each line.

This will print all files with a non-unique md5sum. i.e. duplicates.

NOTE: this will detect duplicate files no matter where they are in A/ or B/ – so if /A/subdir1/file and A/subdir2/otherfile are the same, they will still be printed. If there are multiple duplicates, they will all be printed.

You can remove the md5sums from the output by piping into, e.g., awk '{print $2}' or with cut or sed etc. I’ve left them in the output because they’re a useful key for an associative array (aka a ‘hash’) in awk or perl etc for further processing.

Leave a Reply

Your email address will not be published. Required fields are marked *