Renaming multiple paired files, deleting varying barcode character string in middle?

I have a bunch of paired files with unneeded barcode tags within the middle of the file, for example:

LIB008983_TRA00020080_TAAGGCGA-TATCCTCT_L001_R1.fastq.gz
LIB008983_TRA00020080_TAAGGCGA-TATCCTCT_L001_R2.fastq.gz
LIB008983_TRA00020081_TAAGGCGA-AGAGTAGA_L001_R1.fastq.gz
LIB008983_TRA00020081_TAAGGCGA-AGAGTAGA_L001_R2.fastq.gz
LIB008983_TRA00020082_TAAGGCGA-GTAAGGAG_L001_R1.fastq.gz
LIB008983_TRA00020082_TAAGGCGA-GTAAGGAG_L001_R2.fastq.gz
LIB008983_TRA00020083_TAAGGCGA-ACTGCATA_L001_R1.fastq.gz
LIB008983_TRA00020083_TAAGGCGA-ACTGCATA_L001_R2.fastq.gz

I need to get rid of the barcode (which varies throughout the files) without modifying identifiers at the beginning or end of the file.

I have tried writing a script myself from what I’ve read online, but it appears to be a relatively poor attempt:

for f in LIB008983_TRA000{19916..20167}_*_L001_R*.fastq.gz;
do
  newName=${f/_*_ _L001_R*.fastq.gz}
  mv -i "$f" "$newName";
done

here’s the error message I get:

mv: cannot stat ‘LIB008983_TRA00019917_*_L001_R*.fastq.gz’: No such file or directory

Ideally, my final file name would be, for example:

LIB008983_TRA00020136_L001_R1.fastq.gz
LIB008983_TRA00020136_L001_R2.fastq.gz
LIB008983_TRA00020137_L001_R1.fastq.gz
LIB008983_TRA00020137_L001_R2.fastq.gz
..
..

and so on

Answer

So the problem you’re seeing here is that the for loop is expanding in ways you’re not expecting. The {...} range operator is giving a complete list of all possible filenames, not just the ones that exist.

For example, file 19917 doesn’t exist, causing that mv error message.

You can see this by putting an echo in the loop:

for f in LIB008983_TRA000{19916..20167}_*_L001_R*.fastq.gz
do
  echo "$f"
done

This gives output like:

LIB008983_TRA00019916_*_L001_R*.fastq.gz
LIB008983_TRA00019917_*_L001_R*.fastq.gz
LIB008983_TRA00019918_*_L001_R*.fastq.gz
...
LIB008983_TRA00020078_*_L001_R*.fastq.gz
LIB008983_TRA00020079_*_L001_R*.fastq.gz
LIB008983_TRA00020080_TAAGGCGA-TATCCTCT_L001_R1.fastq.gz
LIB008983_TRA00020080_TAAGGCGA-TATCCTCT_L001_R2.fastq.gz
...
LIB008983_TRA00020084_*_L001_R*.fastq.gz
LIB008983_TRA00020085_*_L001_R*.fastq.gz
LIB008983_TRA00020086_*_L001_R*.fastq.gz

All those lines with * in them represent files that don’t exist.

There’s two ways to solve this. Firstly, if you want to keep the range then put a test around the mv:

  if [ -f "$f" ]
  then
    mv -i "$f" "$newName"
  fi

Now the mv command is only run if the file exists.

The second way is if you don’t care about the range, and just let the glob pattern match:

for f in LIB008983_TRA000*_*_L001_R*.fastq.gz
do
  newName=${f/_*_ _L001_R*.fastq.gz}
  mv -i "$f" "$newName"
done

In both cases you’ll no longer try to mv files that don’t exist.

As a side note; you don’t need some of the ; so I removed them from my answer.

You have a second problem, that your “$newName” isn’t what you want. I’m an oldschool ksh coder and there may be better bash expressions, but I’d do something like

  tail=L${f##*_L}
  head=${f%_*_$tail}_
  newName="$head$tail"
  mv -i "$f" "$newName"

So now given your input file list, we have

LIB008983_TRA00020080_L001_R1.fastq.gz
LIB008983_TRA00020080_L001_R2.fastq.gz
LIB008983_TRA00020081_L001_R1.fastq.gz
LIB008983_TRA00020081_L001_R2.fastq.gz
LIB008983_TRA00020082_L001_R1.fastq.gz
LIB008983_TRA00020082_L001_R2.fastq.gz
LIB008983_TRA00020083_L001_R1.fastq.gz
LIB008983_TRA00020083_L001_R2.fastq.gz

Leave a Reply

Your email address will not be published. Required fields are marked *