The question is published on by Tutorial Guruji team.
I am currently writing a webcrawl bot. It generates a list of URLs, and I need it to remove duplicates, and sort the lines alphabetically. My code looks like this:
#! /bin/bash URL="google.com" while [ 1 ]; do wget --output-document=dl.html $URL links=($(grep -Po '(?<=href=")[^"]*' dl.html)) printf "%sn" ${links[@]} >> results.db sort results.db | uniq -u URL=$(shuf -n 1 results.db) echo $URL done
Spefifcially the line:
sort results.db | uniq -u
Answer
POSIX says of uniq -u
:
Suppress the writing of lines that are repeated in the input.
which means that any line which is repeated (even the original line) will be filtered out. What you meant was probably (done with POSIX also):
sort -u results.db
For sort -u
, POSIX says
Unique: suppress all but one in each set of lines having equal keys. If used with the -c option, check that there are no lines with duplicate keys, in addition to checking that the input file is sorted.
In either case, the following line
URL=$(shuf -n 1 results.db)
probably assumes that the purpose of sort/uniq is to update results.db
(it won’t). You would have to modify the script a little more for that:
sort -u results.db >results.db2 && mv results.db2 results.db
or (as suggested by @drewbenn), combine it with the previous line. However, since that appends to the file (combining the commands as shown in his answer won’t eliminate the duplicates between the latest printf and the file’s contents), a separate command sort/mv looks closer to the original script.
If you want to ensure that $URL
is not empty, that’s (actually another question), and done by the [
test, e.g.,
[ -n "$URL" ] && wget --output-document=dl.html $URL
though simply exiting from the loop would be simpler:
[ -z "$URL" ] && break