How do I change part of a file name when it is a variable in python?

I currently have a python script which takes a file as a command-line argument, does what it needs to do, and then outputs that file with _all_ORF.fsa_aa appended. I’d like to actually edit the file name rather than appending, but I am getting confused with variables. I’m not sure how I can actually do it when the file is a variable.

Here’s an example of the command-line argument:

gL=genomeList.txt   #Text file containing a list of genomes to loop through.             

for i in $(cat ${gL}); do
    #some other stuff ; 
    python ./find_all_ORF_from_getorf.py ${i}_getorf.fsa_aa ; 
    done

Here is some of the python script (find_all_ORF_from_getorf.py):

import re, sys

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

infile = sys.argv[1]

with open(f'{infile}_all_ORF.fsa_aa'.format(), "a") as file_object:
    for sequence in SeqIO.parse(infile, "fasta"):
       #do some stuff
       print(f'{sequence.description}_ORF_from_position_{h.start()},n{sequence.seq[h_start:]}', 
       file=file_object)

Currently, the oupt file is called Genome_file_getorf.fsa_aa_all_ORF.fsa_aa.I’d like to remove the first fsa_aa so that the output looks like this: Genome_file_getorf_all_ORF.fsa_aa. How do I do this? I can’t work out how to edit it.

I have had a look at the os.rename module, but that doesn’t seem to be able to edit the variable name, just append to it.

Thanks,

J

Answer

Regarding your bash code, you might find useful the following snippet, I find it a little bit more readable and I tend to use it a lot when iterating over lines.

while read line; do
    #some other stuff ; 
    python ./find_all_ORF_from_getorf.py ${line}_getorf.fsa_aa ; 
done < genomeList.txt

Now regarding your question and your python code

import re, sys 

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

infile = sys.argv[1]

At this point your infile will look like ‘Genome_file_getorf.fsa_aa’ One option is to split this string through the ‘.’ and get the first item

name = infile.split('.')[0]

In case you know there might be several ‘.’ in the file name, like ‘Myfile.out.old’ and you only want to get rid of the last extension

name = infile.rsplit('.',1)[0]

A third option, if you know that that all your files end with ‘.fsa_aa’ you can just slice the string using negative indices. As ‘.fsa_aa’ has 7 characters:

name = input[:-7] 

These three options are based on the string methods of string handling in python, see more at the official python docs

outfile = f'{name}_all_ORF.fsa_aa' 
# if you wrote f'{variable}' you don't need the ".format()"
# On the other hand you can do '{}'.format(variable)
# or even '{variable}'.format(variable=SomeOtherVariable)

with open(outfile, "a") as file_object:
    for sequence in SeqIO.parse(infile, "fasta"):
       #do some stuff
       file_object.write(f'{sequence.description}_ORF_from_position_{h.start()},n{sequence.seq[h_start:]}')

Another option is to use Path from the pathlib library I do suggest that you play a bit with this library. In this case you would have to do some other minor changes to the code:

import re, sys
from pathlib import Path # <- Here

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

infile = Path(sys.argv[1]) # <- Here
outfile = infile.stem + '_all_ORF.fsa_aa' # <- Here 
# And if you want to use outfile as a path I would suggest instead
# outfile = infile.parent.joinpath(infile.stem)

with open(outfile, "a") as file_object:
    for sequence in SeqIO.parse(infile, "fasta"):
       #do some stuff
       file_object.write(f'{sequence.description}_ORF_from_position_{h.start()},n{sequence.seq[h_start:]}')

Finally as you have seen in both cases I have replaced the print statement with the file_object.write method, it is better practice to write to a file rather than to print to it.