Concatenate multiple zipped files, skipping header lines in all but the first file

I have a collection of gzipped files that I want to combine into a single file. They each have identical format. I want to keep the header information from only the first file and skip it in the subsequent files.

As a simple example, I have four identical files with the following content:

$ gzcat file1.gz
# header
1
2

I want to end up with

# header
1
2
1
2
1
2
1
2

In reality, I can have a varying number of files so I would like to be able to do this programatically. Here is the non-programatic solution I have so far…

cat <(gzcat file1.gz) <(tail -q -n +2 <(gzcat file2.gz) <(gzcat file3.gz) <(gzcat file4.gz))

This command works, but it is “hard coded” to handle four files, and I need to generalize it for any number of files.  I am using bash as the shell if that helps. My preference is for performance (in reality the files can be millions of lines long), so I am OK with a less-than-elegant solution if it is speedy.

Answer

If the command that you show in your question basically works (for a hard-coded number of files), then

first=1
for f in file*.gz
do
    if [ "$first" ]
    then
        gzcat "$f"
        first=
    else
        gzcat "$f"| tail -n +2
    fi
done > collection_single_file

should work for you.  I hope the logic is fairly clear.  Look at all the files (change the wildcard as appropriate for your file names).  If it’s the first one in the list, gzcat it, so you get the entire file (including the header).  Otherwise, use tail to strip the header.  After you’ve handled a file, then no other file will be the first.

This invokes tail N−1 times, instead of just once (like your answer).  Aside from that, my answer should perform the same as your answer.

Leave a Reply

Your email address will not be published. Required fields are marked *