I'm trying to convert an image to a searchable PDF.
To do this I use tesseract to make OCR.
The problem is that the Convert method never call the CallbackGetHocr method ..
Here is the code:
private string convertToSearchablePDF(string imagePath)
{
var input = this.imageToPdf(imagePath);
var job = new OcrJob();
var imgPath = imagePath;
using (var doc = new Document(input))
{
doc.Convert(img =>
{
var hocr = job.RunHOCR(new Bitmap(img));
File.WriteAllText(imgPath + ".hocr.html", hocr);
return hocr;
});
var output = imagePath + ".output.pdf";
doc.Save(output);
return output;
}
}
The Convert method correctly call the callback and hocr is returned. I write the HOCR on the file so that I can check the content.
But then the generated PDF is not searchable ... No error is triggered.
I've put some files in attachments :
- invoice13.jpg: the original image
- invoice13.jpg.pdf: the pdf created from the image (non-searchable) by the imageToPdf method
- invoice13.jpg.hocr.html: the output of tesseract ocr
- invoice13.jpg.output.pdf : what should be a searchable PDF