Quantcast
Viewing all articles
Browse latest Browse all 3131

Issue Reading pages from a large PDF

I have been having issues parsing large PDFs (200 - 400) MB.  In this case a series of textbooks that have a lot of images baked in.

This issue is difficult to produce. According to the stack I have (attached) I am stuck in FileStream called by Aspose.

I used the simple Aspose.Pdf.Document(pdfPath)to create the pdfDocument

TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.ExtractionOptions.FormattingMode = Aspose.Pdf.Text.TextOptions.TextExtractionOptions.TextFormattingMode.Raw;
this.pdfDocument.Pages[pageOffsetPlusOne].Accept(textAbsorber);
return textAbsorber.Text;

While this is in operation sometime memory usage gets very very high. See memory_leak*.png attached.

As of right now I do not have permission to host this file.  If that changes I will see what I can do.  I have attached a transcript of chat I had with Tilal Ahmad about this issue as well.

Anyone else have these issues?  I noticed these issue initially in 8.7.0. Just as test I moved onto 9.6.0 and have yet to have the issue.  However, testing is still in early phases and, like I said above, it is difficult to reproduce. What I would really like is way to set a timeout one the Accept(TextAbsorber), if that is possible, with an exception/indication that a timeout occurred.

I do still have some older ghostscript code in the mix reading from the file. Does that cause any known issues? 

Thanks



Viewing all articles
Browse latest Browse all 3131

Trending Articles