Sorting not seeming to work [closed]

I am currently writing a webcrawl bot. It generates a list of URLs, and I need it to remove duplicates, and sort the lines alphabetically. My code looks like this:

#! /bin/bash
URL="google.com"
while [ 1 ]; do
  wget --output-document=dl.html $URL
  links=($(grep -Po '(?<=href=")[^"]*' dl.html))
  printf "%sn" ${links[@]} >> results.db

  sort results.db | uniq -u

  URL=$(shuf -n 1 results.db)
  echo $URL
done

Spefifcially the line:

sort results.db | uniq -u

Answer

POSIX says of uniq -u:

Suppress the writing of lines that are repeated in the input.

which means that any line which is repeated (even the original line) will be filtered out. What you meant was probably (done with POSIX also):

sort -u results.db

For sort -u, POSIX says

Unique: suppress all but one in each set of lines having equal keys. If used with the -c option, check that there are no lines with duplicate keys, in addition to checking that the input file is sorted.

In either case, the following line

URL=$(shuf -n 1 results.db)

probably assumes that the purpose of sort/uniq is to update results.db (it won’t). You would have to modify the script a little more for that:

sort -u results.db >results.db2 && mv results.db2 results.db

or (as suggested by @drewbenn), combine it with the previous line. However, since that appends to the file (combining the commands as shown in his answer won’t eliminate the duplicates between the latest printf and the file’s contents), a separate command sort/mv looks closer to the original script.

If you want to ensure that $URL is not empty, that’s (actually another question), and done by the [ test, e.g.,

  [ -n "$URL" ] && wget --output-document=dl.html $URL

though simply exiting from the loop would be simpler:

[ -z "$URL" ] && break

Leave a Reply

Your email address will not be published. Required fields are marked *