Distinguishing files on the basis of first column’s content

I have multiple tab-separated files with .cluster extension. I want to classify these files on the basis of their first column content using the following criteria: (2 and 3 are actual digits/content inside the files)

  • class_1: Only 2 AND 3 present on successive lines
  • class_2: Only 2 present
  • class_3: Only 3 present

I want to write their file names in .txt files with the respective class name. How do I do it with shell scripting?

Answer

for filename in *.cluster
do
    class=$(cut -d$'t' -f1)                      # Part 1
    if [ $(wc -l "$filename") -eq 2 ]             # Part 2, start
    then
        class=1
    fi                                            # Part 2, end
    printf '%sn' "$filename" >> class_"$class".txt # Part 3
done

This has three parts:

  1. By default, it classifies the file based on the first field of the only line: the class variable is set to whatever is in the file, up to the first tab character on each line. This will be either 2 or 3 for class 2 & 3, since those files have only one line.

    cut chops files up by delimiters, $'t' is a Bash way of writing a tab character, and -f1 asks cut to output only the first delimited field.

  2. If the file has two lines ($(wc -l "$filename") -eq 2), it must be class 1, so the class variable is forcibly set to 1, replacing its value from step 1. The iffi deals with this.
  3. Finally, the filename is appended to the appropriate class file: printf '%sn' "$filename" >> class_"$class".txt

At the end you will have three files class_N.txt for each N in 1, 2, 3, with one filename per line. If any file has some other contents than what you outlined in the question, like a different first field or length, you will get extra class files created.

In the perverse case where a filename itself contains a newline character, this will fall apart (and give you an opportunity to reconsider your filename choices), but otherwise it should be fine.

Leave a Reply

Your email address will not be published. Required fields are marked *