How to merge two files by line names using python

I think this should be easy but yet have not been able to solve it. I have two files as below and I want to merge them in a way that lines starting with > in the file1 to be the header of the lines in the file2

file1:

>seq12
ACGCTCGCA
>seq34
GCATCGCGT
>seq56
GCGATGCGC

file2:

ATCGCGCATGATCTCAG
AGCGCGCATGCGCATCG
AGCAAATTTAGCAACTC

so the desired output should be:

>seq12
ATCGCGCATGATCTCAG
>seq34
AGCGCGCATGCGCATCG
>seq56
AGCAAATTTAGCAACTC

I have tried this code so far but in output, all the lines coming from file2 are the same:

from Bio import SeqIO

with open(file1) as fw:
    with open(file2,'r') as rv:
        for line in rv:
            items = line
        for record in SeqIO.parse(fw, 'fasta'):
            print('>' + record.id)
            print(line)

Answer

If you cannot store your files in memory, you need a solution that reads line by line from each file, and writes accordingly to the output file. The following program does that. The comments try to clarify, though I believe it is clear from the code.

with open("file1.txt") as first, open("file2.txt") as second, open("output.txt", "w+") as output:
    while 1:
        line_first = first.readline()       # line from file1 (header)
        line_second = second.readline()     # line from file2 (body)
        if not (line_first and line_second):
            # if any file has ended
            break

        # write to output file
        output.writelines([line_first, line_second])
        # jump one line from file1
        first.readline()

Note that this will only work if file1.txt has the specific format you presented (odd lines are headers, even lines are useless). In order to allow a bit more customization, you can wrap it up in a function as:

def merge_files(header_file_path, body_file_path, output_file="output.txt", every_n_lines=2):
    with open(header_file_path) as first, open(body_file_path) as second, open(output_file, "w+") as output:
        while 1:
            line_first = first.readline()       # line from header
            line_second = second.readline()     # line from body
            if not (line_first and line_second):
                # if any file has ended
                break

            # write to output file
            output.writelines([line_first, line_second])
            # jump n lines from header
            for _ in range(every_n_lines - 1):
                first.readline()

And then calling merge_files("file1.txt", "file2.txt") should do the trick.

Leave a Reply

Your email address will not be published. Required fields are marked *