The question is published on by Tutorial Guruji team.
I have a bunch of paired files with unneeded barcode tags within the middle of the file, for example:
LIB008983_TRA00020080_TAAGGCGA-TATCCTCT_L001_R1.fastq.gz LIB008983_TRA00020080_TAAGGCGA-TATCCTCT_L001_R2.fastq.gz LIB008983_TRA00020081_TAAGGCGA-AGAGTAGA_L001_R1.fastq.gz LIB008983_TRA00020081_TAAGGCGA-AGAGTAGA_L001_R2.fastq.gz LIB008983_TRA00020082_TAAGGCGA-GTAAGGAG_L001_R1.fastq.gz LIB008983_TRA00020082_TAAGGCGA-GTAAGGAG_L001_R2.fastq.gz LIB008983_TRA00020083_TAAGGCGA-ACTGCATA_L001_R1.fastq.gz LIB008983_TRA00020083_TAAGGCGA-ACTGCATA_L001_R2.fastq.gz
I need to get rid of the barcode (which varies throughout the files) without modifying identifiers at the beginning or end of the file.
I have tried writing a script myself from what I’ve read online, but it appears to be a relatively poor attempt:
for f in LIB008983_TRA000{19916..20167}_*_L001_R*.fastq.gz; do newName=${f/_*_ _L001_R*.fastq.gz} mv -i "$f" "$newName"; done
here’s the error message I get:
mv: cannot stat ‘LIB008983_TRA00019917_*_L001_R*.fastq.gz’: No such file or directory
Ideally, my final file name would be, for example:
LIB008983_TRA00020136_L001_R1.fastq.gz LIB008983_TRA00020136_L001_R2.fastq.gz LIB008983_TRA00020137_L001_R1.fastq.gz LIB008983_TRA00020137_L001_R2.fastq.gz .. ..
and so on
Answer
So the problem you’re seeing here is that the for
loop is expanding in ways you’re not expecting. The {...}
range operator is giving a complete list of all possible filenames, not just the ones that exist.
For example, file 19917 doesn’t exist, causing that mv
error message.
You can see this by putting an echo
in the loop:
for f in LIB008983_TRA000{19916..20167}_*_L001_R*.fastq.gz do echo "$f" done
This gives output like:
LIB008983_TRA00019916_*_L001_R*.fastq.gz LIB008983_TRA00019917_*_L001_R*.fastq.gz LIB008983_TRA00019918_*_L001_R*.fastq.gz ... LIB008983_TRA00020078_*_L001_R*.fastq.gz LIB008983_TRA00020079_*_L001_R*.fastq.gz LIB008983_TRA00020080_TAAGGCGA-TATCCTCT_L001_R1.fastq.gz LIB008983_TRA00020080_TAAGGCGA-TATCCTCT_L001_R2.fastq.gz ... LIB008983_TRA00020084_*_L001_R*.fastq.gz LIB008983_TRA00020085_*_L001_R*.fastq.gz LIB008983_TRA00020086_*_L001_R*.fastq.gz
All those lines with *
in them represent files that don’t exist.
There’s two ways to solve this. Firstly, if you want to keep the range then put a test around the mv
:
if [ -f "$f" ] then mv -i "$f" "$newName" fi
Now the mv
command is only run if the file exists.
The second way is if you don’t care about the range, and just let the glob pattern match:
for f in LIB008983_TRA000*_*_L001_R*.fastq.gz do newName=${f/_*_ _L001_R*.fastq.gz} mv -i "$f" "$newName" done
In both cases you’ll no longer try to mv
files that don’t exist.
As a side note; you don’t need some of the ;
so I removed them from my answer.
You have a second problem, that your “$newName” isn’t what you want. I’m an oldschool ksh
coder and there may be better bash
expressions, but I’d do something like
tail=L${f##*_L} head=${f%_*_$tail}_ newName="$head$tail" mv -i "$f" "$newName"
So now given your input file list, we have
LIB008983_TRA00020080_L001_R1.fastq.gz LIB008983_TRA00020080_L001_R2.fastq.gz LIB008983_TRA00020081_L001_R1.fastq.gz LIB008983_TRA00020081_L001_R2.fastq.gz LIB008983_TRA00020082_L001_R1.fastq.gz LIB008983_TRA00020082_L001_R2.fastq.gz LIB008983_TRA00020083_L001_R1.fastq.gz LIB008983_TRA00020083_L001_R2.fastq.gz