Quantcast
Channel: Aspose.Pdf Product Family
Viewing all articles
Browse latest Browse all 3131

Image to Searchable PDF

$
0
0

I'm trying to convert an image to a searchable PDF.
To do this I use tesseract to make OCR.

The problem is that the Convert method never call the CallbackGetHocr method .. 

Here is the code:

private string convertToSearchablePDF(string imagePath)
{
    var input = this.imageToPdf(imagePath);
    var job = new OcrJob();
    var imgPath = imagePath;

    using (var doc = new Document(input))
    {
        doc.Convert(img => 
        {
            var hocr = job.RunHOCR(new Bitmap(img));
            File.WriteAllText(imgPath + ".hocr.html", hocr);
            return hocr;
        });

        var output = imagePath + ".output.pdf";
        doc.Save(output);

        return output;
    }
}

The Convert method correctly call the callback and hocr is returned. I write the HOCR on the file so that I can check the content. 
But then the generated PDF is not searchable ... No error is triggered.

I've put some files in attachments :
- invoice13.jpg: the original image
- invoice13.jpg.pdf: the pdf created from the image (non-searchable) by the imageToPdf method
- invoice13.jpg.hocr.html: the output of tesseract ocr
- invoice13.jpg.output.pdf : what should be a searchable PDF


Viewing all articles
Browse latest Browse all 3131

Trending Articles