How do I get the font file or PDFont of each word in a PDF file?

Is there a way to get the font of each word of a PDF file using PDFBox? I have tried this but it just lists all the fonts used on that page.

PDDocument pdfDocument = PDDocument.load(new File("xxofd.pdf"));

    PDPageTree pages = pdfDocument.getDocumentCatalog().getPages();
    for (PDPage page : pages) {
        PDResources res = page.getResources();

        for (COSName fontName : res.getFontNames()) {
            PDFont font = null;
            try {
                font = res.getFont(fontName);
            } catch (IOException e) {

There are many different characters in the pdf file, and maybe different characters are different fonts. I want to extract a subset of these fonts. This subset only contains the fonts of the words that have appeared in the pdf file. This will make the font file smaller.So I want get the font file or PDFont structure of each word of a PDF file. Is there any way? Thanks.


Let the PDF file:

enter image description here


PDDocument pdfDocument = PDDocument.load(new File("/home/josejuan/tmp/fonts.pdf"));

PDFTextStripper pdfStripper = new PDFTextStripper() {
    protected void processTextPosition(TextPosition text) {
        System.out.println("Text `" + text.getUnicode() + "` with font `" + text.getFont().getName() + "`");

// force parse

produce the expected output

Text `E` with font `BAAAAA+LiberationSerif`
Text `x` with font `BAAAAA+LiberationSerif`
Text `a` with font `CAAAAA+CantarellRegular`
Text `m` with font `CAAAAA+CantarellRegular`
Text `p` with font `BAAAAA+LiberationSerif`

(you can group by of course)

From that code you can describe every character of text, for example, if you need the font file:


but depending on what exactly you are looking for it will be better to use PDFont, PDFontDescriptor, PDStream, …