I have multiple tab-separated files with .cluster
extension. I want to classify these files on the basis of their first column content using the following criteria: (2
and 3
are actual digits/content inside the files)
- class_1: Only
2
AND3
present on successive lines - class_2: Only
2
present - class_3: Only
3
present
I want to write their file names in .txt
files with the respective class name.
How do I do it with shell scripting?
Answer
for filename in *.cluster do class=$(cut -d$'t' -f1) # Part 1 if [ $(wc -l "$filename") -eq 2 ] # Part 2, start then class=1 fi # Part 2, end printf '%sn' "$filename" >> class_"$class".txt # Part 3 done
This has three parts:
By default, it classifies the file based on the first field of the only line: the
class
variable is set to whatever is in the file, up to the first tab character on each line. This will be either2
or3
for class 2 & 3, since those files have only one line.cut
chops files up by delimiters,$'t'
is a Bash way of writing a tab character, and-f1
askscut
to output only the first delimited field.- If the file has two lines (
$(wc -l "$filename") -eq 2
), it must be class 1, so theclass
variable is forcibly set to 1, replacing its value from step 1. Theif
…fi
deals with this. - Finally, the filename is appended to the appropriate class file:
printf '%sn' "$filename" >> class_"$class".txt
At the end you will have three files class_N.txt
for each N in 1, 2, 3, with one filename per line. If any file has some other contents than what you outlined in the question, like a different first field or length, you will get extra class files created.
In the perverse case where a filename itself contains a newline character, this will fall apart (and give you an opportunity to reconsider your filename choices), but otherwise it should be fine.