Quantcast
Channel: Aspose.Pdf Product Family
Viewing all articles
Browse latest Browse all 3131

Problem with HOCR

$
0
0
Hi

I'm having an issue trying to embed HOCR data into a PDF

My code looks like this 

    Document doc = new Document("c:/bad/1704-01-2012-017-C003-029.pdf");
            doc.convert(new Document.CallBackGetHocr() {
                @Override
                public String invoke(BufferedImage bi) {
                    try {
                        Tesseract instance = Tesseract.getInstance();  // JNA Interface Mapping
                        instance.setHocr(true);
                        instance.setLanguage("spa");
                        String result = instance.doOCR(bi);
                        
                        return result;
                    } catch (TesseractException ex) {
                        ex.printStackTrace();
                    }

                    return null;
                }
            });

            doc.save("c:/bad/1704-01-2012-017-C003-025_ASPOSE.pdf");

Executing this code (with Tess4j dependencies) doesn't produce a Searchable PDF, Document.CallBackGetHocr sees the image, Tesseract generates the HOCR, but when i save the document, the output is not searchable.

I'm attaching the input, the output and the HOCR generated by Tesseract.
You can avoid using Tesseract by making invoke return the contents of the file "hocr.txt".

I wish you could help me with this problem.

Thanks for your attention


Viewing all articles
Browse latest Browse all 3131

Trending Articles