How to get content of a PDF file page by page having base64 of the whole file content?

I have content of a PDF file in base64 like JVBERi0xLjIgDSXi48/T....

How can I parse it to get base64 of each page of it?

Assuming the PDF file has 5 pages. How can I get the content of each page in base64? I already google it but could not find anything. Any help is appreciated.

Answer

In general, it is not even possible to separate the contents of a native PDF file page by page (making it impossible to do so when the file is base64 encoded, as you will see).

The most general structure of a PDF file is, in this order:

  1. PDF header
  2. PDF objects (file body)
  3. PDF xref table (table of contents, giving file offset location for each PDF object)
  4. PDF trailer

You cannot assume that the PDF objects appear in the same order inside the file as the pages do appear inside a PDF viewer.

If you extract a single page, this page itself needs to be a valid PDF document: containing (in this same order) header, objects, xref and trailer, where xref and trailer need to be re-constructed newly so they match the new document (xref and trailer cannot simply be copied from the original document).

For this reason you need to de-code the base64-encoded file completely before you can even think of accessing a single page of the resulting PDF.

To get — from a 5-page PDF document that has been encoded with base64 — all individual PDF pages as base64, you have to follow these steps:

  1. De-code the complete base64 file into a valid 5-page PDF document.
  2. Split the 5-page PDF document into 5 separate 1-page PDF documents.
    (you need to know the “rules of the PDF game” for this, or make use of a PDF library that does know)
  3. Encode each 1-page PDF document with base64.

Leave a Reply

Your email address will not be published. Required fields are marked *