Quantcast
Channel: Aspose.Pdf Product Family
Viewing all articles
Browse latest Browse all 3131

PDF column formatting after extracting text

$
0
0
Hi, we've been using the TextAbsorber class to extract text from a PDF document and mostly this has been working well. However we have hit upon an issue if the document contains columns.

The behavior of Aspose does not appear to be consistent from one document to the next and in some cases the text stream is unreadable, as you can't easily tell where the gap between the two columns lie. It seems that if the text in the columns are all of an equal length then we get 3 whitespace characters between columns otherwise we get 1 and the text is not aligned, so the resulting output text looks much like one block of text.

We're using version 8.8 of the PDF library.

Some sample code...

Page p = doc.Pages[PageNumber];
TextAbsorber textAbsorber = new TextAbsorber();
p.Accept(textAbsober);

Then the text in 'textAbsorber.Text' is then used. I've tried constructing the TextAbsorber with both Pure Raw formatting options, but this doesn't seem to make a difference.

Is there anything in code we can do to improve this situation? Or is this a bug? Is it possible to have the code return us all text from the first column then the second instead of returning adjacent lines from each column in turn?

To see the issue I have attached two documents, equalSpacing.pdf ends up with 3 whitespace characters between columns, where as notEqualSpacing.pdf has 1 character and it is not. Both documents were created in Microsoft Word.

Your own demo on the link below, returns output with issues as described if these documents are uploaded.

http://www.aspose.com/demos/.net-components/aspose.pdf/vb.net/PdfDemos/Text/ExtractText.aspx

Viewing all articles
Browse latest Browse all 3131

Trending Articles