I have content of a PDF file in base64 like
How can I parse it to get base64 of each page of it?
Assuming the PDF file has 5 pages. How can I get the content of each page in base64? I already google it but could not find anything. Any help is appreciated.
In general, it is not even possible to separate the contents of a native PDF file page by page (making it impossible to do so when the file is base64 encoded, as you will see).
The most general structure of a PDF file is, in this order:
- PDF header
- PDF objects (file body)
- PDF xref table (table of contents, giving file offset location for each PDF object)
- PDF trailer
You cannot assume that the PDF objects appear in the same order inside the file as the pages do appear inside a PDF viewer.
If you extract a single page, this page itself needs to be a valid PDF document: containing (in this same order) header, objects, xref and trailer, where xref and trailer need to be re-constructed newly so they match the new document (xref and trailer cannot simply be copied from the original document).
For this reason you need to de-code the base64-encoded file completely before you can even think of accessing a single page of the resulting PDF.
To get — from a 5-page PDF document that has been encoded with base64 — all individual PDF pages as base64, you have to follow these steps:
- De-code the complete base64 file into a valid 5-page PDF document.
- Split the 5-page PDF document into 5 separate 1-page PDF documents.
(you need to know the “rules of the PDF game” for this, or make use of a PDF library that does know)
- Encode each 1-page PDF document with base64.