Is there a way to check for duplicate lines within a file using Java?

I’m attempting to read each line within an .inp file and for every non-duplicate, write the line to a new file. The issue I’m running into with the code I have so far is that all lines are written into the output file, regardless of if they’re duplicates of previous line(s) or not. I’m using a Scanner object to read the file and a BufferedReader/FileWriter object to write the output file.

How do I avoid writing the duplicates?

String book = reader.nextLine();
boolean duplicate = false;

while (reader.hasNext() == true) {
    try {
        duplicate = reader.hasNext(book);

        if (duplicate == true) {
            book = reader.nextLine();
        } else {
            writer.write(book + "n");
            book = reader.nextLine();
        }
    } catch (NoSuchElementException ex) {
        break;
    }
}

Answer

Depending on the situation:

  • If the duplicate lines are sequential, maintain a variable to store the previous line and compare against it.
  • If the duplicate lines are not sequential, and there are relatively (*) few short lines, store the lines you’ve already processed in a HashSet and upon processing a line check whether the set already contains() the line or not.
  • If the duplicate lines are not sequential, and there are relatively (*) few but long lines, instead of storing the complete lines in a HashSet, store a hash (e.g. SHA1) of each line, and compare against that.
  • If the duplicate lines are not sequential, and there are a lot of long lines, combine the techniques described above with some form of persistent database or data store.

(*) Relative to available memory