Quantcast
Channel: Aspose.Pdf Product Family
Viewing all articles
Browse latest Browse all 3131

PDF to HTML --> Break each letter to a separate div (JAVA)

$
0
0

Hi.

I used the JAVA Aspose for PDF (version 9.7.1) in order to convert a PDF to html.

My (first) problem is that some of the text is converted to html by separating each character to its own <div >. I think it happen when there is a style change like italic-font. This is a critical issue for me because it affects the search results.

Another issue is that the rest of the text is converted when each line is in its own <div > but each 2-3 words are in a separated <span >.

 

(Note that I used the java-aspose and I got these two issues, but when I used the .Net-aspose the first issue didn't occure, only the second one)

 

I attached a test pdf (copied from Wikipedia).

My JAVA code in order to convert the attached pdf file to html is: 

--------------------------------------------------------------------------------

File mainHtmlFile = createNewHtmlFile();
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(inStream);
HtmlSaveOptions options = new HtmlSaveOptions(SaveFormat.Html);
options.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
options.LettersPositioningMethod = LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
options.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
pdfDocument.save(mainHtmlFile.getAbsolutePath(), options);

---------------------------------------------------------------------------------

Can you help me to fix these two issues? or at least the first one?

Thanks

Tami

 


Viewing all articles
Browse latest Browse all 3131

Trending Articles