Quantcast
Channel: Aspose.Pdf Product Family
Viewing all articles
Browse latest Browse all 3131

Extracting table TextFragments works inconsistently

$
0
0
Hi,

I'm working on extraction of table data via reading TextFragment / TextSegment objects. I ran into certain type of tables that report all table cell entries as a single fragment, although they render as separate cells in pdf table.

In the sample attached, the first table concatenates all cell entries in a single TextFragment (respective fragment.Segments also show only one segment with concatenated text). On the other hand, the second table reports separate TextFragment entries for each cell.

The fragment texts in question do not have any special characters embedded, just double spaces (\u0020).

Could you shed some light on this issue and advise on any workaround available?

Thanks,


Here's the code snippet used:

[TestCase( @"..\..\..\_asposeCases\20140729_191959_1206173.pdf.deid.pdf" )]
public void ConcatTableHeaderIssue( string pdfFile )
{
Console.WriteLine( pdfFile );
var pdfDoc = new Document( pdfFile );
var absorber = new TextFragmentAbsorber();
pdfDoc.Pages.Accept( absorber );
var frags = absorber.TextFragments.Cast().ToList();

// table in pdf shows as columns
// Home Medication | Dose | Frequency | Last Dose
// but reported TextFragment and TextSegment appear as concatenated text
// "Home Medication Dose Frequency Last Dose "
var tableHeaderFrags = frags
.GroupBy( f => f.Rectangle.ToRect().Top ) // form lines
.First( g => g.Any( f => f.Text.Contains( "Dose" ) ) );
Console.WriteLine( "Table headers : {0}", string.Join( "|", tableHeaderFrags.Select( f => f.Text ) ) );

var tableHeaderSegments = frags
.SelectMany( f => f.Segments.Cast() )
.GroupBy( s => s.Rectangle.ToRect().Top )
.First( g => g.Any( s => s.Text.Contains( "Dose" ) ) );
Console.WriteLine( "Table segments: {0}", string.Join( "|", tableHeaderSegments.Select( f => f.Text ) ) );

Assert.AreEqual( 4, tableHeaderFrags.Count(), "there should be 4 table header fragments." );
Assert.AreEqual( 4, tableHeaderSegments.Count(), "there should be 4 table header segments." );
}
}

Viewing all articles
Browse latest Browse all 3131

Trending Articles