How to sort a FASTA file based on date?

I have a FASTA file that looks like this

>Spike|hCoV-19/Wuhan/WIV04/2019|2019-12-30|EPI_ISL_402124|Original|hCoV-19^^Hubei|Human|Wuhan Jinyintan Hospital|Wuhan Institute of Virology|Shi|China
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/Philippines/PH-PGC-03696/2020|2020-12-23|EPI_ISL_2155626|Original|hCoV-19^^Central Luzon|Human|Research Institute for Tropical Medicine|Philippine Genome Center|Tablizo|Philippines
MFVFLVLLPLVFSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYYPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
>Spike|hCoV-19/Belgium/UZA-UA-8350/2021|2021-01-22|EPI_ISL_940774|Original|hCoV-19^^Berchem|Human|Platform BIS UZA/UAntwerpen|UAntwerp|Xavier|Belgium
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNTVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAQHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCT*

I need to sort these sequences based on the date column, I found this code in stack overflow but it doesn’t do the job for

from Bio.SeqIO.FastaIO import SimpleFastaParser
import pandas as pd

with open('F:/newone.fasta') as fasta_file: 
    identifiers = []
    lengths = []
    seq = []
    for title, sequence in SimpleFastaParser(fasta_file):
        identifiers.append(title.split(None, 3)[0])  
        lengths.append(len(sequence))
        seq.append(sequence)
#converting lists to pandas Series    
s1 = pd.Series(identifiers, name='ID')
s2 = pd.Series(lengths, name='length')
s3 = pd.Series(seq, name='seq')


Qfasta = pd.DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])

this is the error that happens with the second code and I don’t know why this happens

IndexError                                Traceback (most recent call last)
in <module>
     12     SeqIO.write(records, output_file, "fasta")
     13 
---> 14 sort_fasta(input_file, output_file)

 in sort_fasta(input_file, output_file)
      8     def get_data(id_name):
      9         return (id_name.split("|")[2], seguid(id_name))
---> 10     dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_data)
     11     records = (dict_fasta[i] for i in sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-')))))
     12     SeqIO.write(records, output_file, "fasta")

~anaconda3envsdeeplearninglibsite-packagesBioSeqIO__init__.py in index(filename, format, alphabet, key_function)
    873         key_function,
    874     )
--> 875     return _IndexedSeqFileDict(
    876         proxy_class(filename, format), key_function, repr, "SeqRecord"
    877     )

~anaconda3envsdeeplearninglibsite-packagesBioFile.py in __init__(self, random_access_proxy, key_function, repr, obj_repr)
    185             offset_iter = random_access_proxy
    186         offsets = {}
--> 187         for key, offset, length in offset_iter:
    188             # Note - we don't store the length because I want to minimise the
    189             # memory requirements. With the SQLite backend the length is kept

~anaconda3envsdeeplearninglibsite-packagesBioFile.py in <genexpr>(.0)
    181         self._obj_repr = obj_repr
    182         if key_function:
--> 183             offset_iter = ((key_function(k), o, l) for (k, o, l) in random_access_proxy)
    184         else:
    185             offset_iter = random_access_proxy

in get_data(id_name)
      7 def sort_fasta(input_file, output_file):
      8     def get_data(id_name):
----> 9         return (id_name.split("|")[2], seguid(id_name))
     10     dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_data)
     11     records = (dict_fasta[i] for i in sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-')))))

what should I do about this?

Answer

With the following code the fasta entries in the input file are sorted and saved in the output file using the SeqIO index function. So, the function should also work with file having a big size that cannot be fitted in memory.

import re
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid

input_file = "fasta.fasta"
output_file = "out.fasta"

def sort_fasta(input_file: str, output_file: str) -> None:
    def get_index_key(id_name: str) -> tuple:
        try:
            key = (re.search(r'd{4}-d{2}-d{2}', id_name).group(), seguid(id_name))
        except AttributeError:
            key = ('0001-01-01', seguid(id_name))
        return key
    dict_fasta = SeqIO.index(input_file, "fasta", key_function=get_index_key)
    sorted_keys_by_date = sorted(list(dict_fasta), reverse=True, key = lambda d: list(map(int, d[0].split('-'))))
    records = (dict_fasta[i] for i in sorted_keys_by_date if i[0] != '0001-01-01')
    SeqIO.write(records, output_file, "fasta")

sort_fasta(input_file, output_file)