Hi
I'm having an issue trying to embed HOCR data into a PDF
My code looks like this
Document doc = new Document("c:/bad/1704-01-2012-017-C003-029.pdf");
doc.convert(new Document.CallBackGetHocr() {
@Override
public String invoke(BufferedImage bi) {
try {
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
instance.setHocr(true);
instance.setLanguage("spa");
String result = instance.doOCR(bi);
return result;
} catch (TesseractException ex) {
ex.printStackTrace();
}
return null;
}
});
doc.save("c:/bad/1704-01-2012-017-C003-025_ASPOSE.pdf");
Executing this code (with Tess4j dependencies) doesn't produce a Searchable PDF, Document.CallBackGetHocr sees the image, Tesseract generates the HOCR, but when i save the document, the output is not searchable.
I'm attaching the input, the output and the HOCR generated by Tesseract.
You can avoid using Tesseract by making invoke return the contents of the file "hocr.txt".
I wish you could help me with this problem.
Thanks for your attention