The question is published on by Tutorial Guruji team.
I have two directories A and B; each one contains a lot of sub-directories
geom001, geom002 ....etc
each sub-directory contains a file named results. I want to compare, without opening any of them, each file in A with each file in B and find if there is a file or more in A similar to one or more file in B. How can I use command like the following in a loop to search over all files?
cmp --silent file1 file2 || echo "file1 and file2 are different"
Answer
If files are exactly the same, then their md5sum
s will be exactly the same, so you can use:
find A/ B/ -type f -exec md5sum {} + | sort | uniq -w32 -D
An md5sum is always exactly 128 bits (or 16 bytes or 32 hex digits) long, and the md5sum
program output uses hex digits. So we use the -w32
option on the uniq
command to compare only the first 32 characters on each line.
This will print all files with a non-unique md5sum. i.e. duplicates.
NOTE: this will detect duplicate files no matter where they are in A/ or B/ – so if /A/subdir1/file
and A/subdir2/otherfile
are the same, they will still be printed. If there are multiple duplicates, they will all be printed.
You can remove the md5sums from the output by piping into, e.g., awk '{print $2}'
or with cut
or sed
etc. I’ve left them in the output because they’re a useful key for an associative array (aka a ‘hash’) in awk
or perl
etc for further processing.