The question is published on by Tutorial Guruji team.
I have a file with multiple columns and have identified lines where specific column values (cols 3-6) have been duplicated using a bash script.
Example input:
A B C D E F G 1 2 T TACA A 3 2 Q 3 4 I R 8 2 Q 9 3 A C 9 3 P 8 3 I R 8 2 Q
I can display both instances of the repeated values. The other column values (cols 1, 2 and 7+) can be different between the 2 lines hence the need for me to view both instances.
I want to save the unique records and the first instance of the duplicated records after sorting these dups have been sorted on col 5 (any order will do) then col 1 (descending order –> largest value first).
Desired ouput:
A B C D E F G 1 2 T TACA A 3 2 Q 9 3 A C 9 3 P 8 3 I R 8 2 Q
NB: The ordering on final output is not important as it will be resorted later. Making sure the desired rows are present is what matters.
My code so far is:
tot=$(awk 'n=x[$3,$6]{print n"n"$0;} {x[$3,$6]=$0;}' oldfilename | wc -l) #counts duplicated records and saves overall count as $tot if [ $tot == "0" ] then awk '{print}' oldfilename >> newfilename #if no dups found, all lines saved in new file else if awk '(!(n=x[$3,$6]{print n"n"$0;} {x[$3,$6]=$0;})' oldfilename >> newfilename #if dups found, unique lines in old file saved in new file else awk 'n=x[$3,$6]{print n"n"$0;} {x[$3,$6]=$0;}' oldfilename > tempfile #save dups in tempfile sort -k1,1, -k5,5 tempfile #sort tempfile on cols 1 then 5 (want descending order) fi
What I am unable to do is take the first instance of each duplicate and save it in newfile and I still have errors in the above code.
Please help.
Answer
sort
itself should suffice. First sort such that rows are “grouped” by field range 3-6
, records within each group further ordered by fields 5
and 1
. Pipe this to sort -u
on 3-6
, this disables last-resort comparison and returns the first record from each 3-6
group. Finally, pipe this to sort
, this time by fields 5
and 1
sort -k3,6 -k5,5r -k1,1r file | sort -k3,6 -u | sort -k5,5r -k1,1r A B C D E F G 1 2 T TACA A 3 2 Q 9 3 A C 9 3 P 8 3 I R 8 2 Q