PDF column formatting after extracting text

Hi, we've been using the TextAbsorber class to extract text from a PDF document and mostly this has been working well. However we have hit upon an issue if the document contains columns.

The behavior of Aspose does not appear to be consistent from one document to the next and in some cases the text stream is unreadable, as you can't easily tell where the gap between the two columns lie. It seems that if the text in the columns are all of an equal length then we get 3 whitespace characters between columns otherwise we get 1 and the text is not aligned, so the resulting output text looks much like one block of text.

We're using version 8.8 of the PDF library.

Some sample code...

Page p = doc.Pages[PageNumber];
TextAbsorber textAbsorber = new TextAbsorber();
p.Accept(textAbsober);

Then the text in 'textAbsorber.Text' is then used. I've tried constructing the TextAbsorber with both Pure Raw formatting options, but this doesn't seem to make a difference.

Is there anything in code we can do to improve this situation? Or is this a bug? Is it possible to have the code return us all text from the first column then the second instead of returning adjacent lines from each column in turn?

To see the issue I have attached two documents, equalSpacing.pdf ends up with 3 whitespace characters between columns, where as notEqualSpacing.pdf has 1 character and it is not. Both documents were created in Microsoft Word.

Your own demo on the link below, returns output with issues as described if these documents are uploaded.

http://www.aspose.com/demos/.net-components/aspose.pdf/vb.net/PdfDemos/Text/ExtractText.aspx

PDF column formatting after extracting text

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: Ziba Zako ft Rich Bizzy & General Kanene – Chikwati (Prod by: Bicko...

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...