Quantcast
Channel: Aspose.Pdf Product Family
Viewing all articles
Browse latest Browse all 3131

Converting PDF to DOC with 4.5.0 Java: Issues and Concerns

$
0
0
Hello.  I'm posting about PDF-to-DOC conversion in the latest 4.5.0 version of Aspose PDF.

In another thread, I mentioned that .docx conversion does not seem to be supported (out of range error on the SaveFormat), so for these examples I used conversion to .doc.

I started with two PDF files.  They are attached.  One is a shorter example, and one is longer.  They are named accordingly.

For reference and comparison, I converted the shorter file using several desktop utilities.  They all performed equivalently, so I chose one of those converted files and included it as well.

The following are some issues we have with the conversion process from aspose:

  • Basic paragraph handling: paragraphs in the .pdf are not converted into proper paragraphs.  Text is converted into floating text boxes or frames in the .doc version.  Every line has a hard line break at the end.  This makes it impossible to reformat text.  For example, take a paragraph in the converted doc and choose a different font size; the line breaks stay fixed and do not allow the text to wrap to fit the new size.  Nor do paragraphs below flow up or down to accommodate the new font size.
    • Contrast this behavior with the version converted with desktop software.  The paragraphs are in proper format and do not have hard-coded line breaks.  This allows the user to apply styles and rearrange the text.  Resizing text causes the remainder of the document to flow as expected.
  • Basic list handling: bulleted lists use strange characters and bizarre overlapping frames to simulate the 'look' of a bulleted list.  But the lists can not be manipulated.  For example, if I want to insert a bullet in between other bullets using e.g. MS Word, it can not be done.  The text moves down but the bullets stay fixed in a separate, horizontally overlapping frame.
    • Contrast this behavior with the version converted with desktop software.  Bulleted lists convert to actual bulleted lists that can be manipulated and edited.
  • More advanced list handling: In the longer document (testLongerTemplateConvertedViaAspose.doc), sometimes lists are broken across multiple frames/text boxes, causing more misformatting.  For example, look at how the list is broken into multiple frames between list items 4.8 and 4.9.
  • Simple graphics: A simple graphic did come over, but it is sized incorrectly and is floating in an absolutely positioned frame.  This makes wrapping text or other functions affecting the flow of the document impossible to perform.
    • Contrast this with the desktop software conversion which inserts the image inline with the text as one would expect; adding lines above this image moves it down the page.  The image is a proper element of the document flow.
  • Random characters: Turn on paragraph markings and look in the blue headers to see strange inserted characters in the converted version; they are indicated in MS Word with a (?).
  • Simple tables: The simple table with the figures did not translate at all; it is entirely misformatted in the resulting document.
    • Contrast this with the desktop software conversion which preserves all aspects of the table correctly, including font, merged cells, and simple borders.
  • Underline: In the longer document, notice how the headers (for example, 4.0) is only partially underlined in the converted version.
    • Contrast this with the desktop version which preserves underline as expected.
  • Performance:  while the shorter document took only a few seconds, the longer document took over 90 seconds to convert.  This is under ColdFusion 10.  CPU usage remained high throughout the duration of the operation.

In short, most users presumably need PDF to DOCX conversion in order to change a document into an editable format.  The Aspose Java PDF conversion attempts to make the resulting document "look" like it was converted, but the result is not readily editable, nor is it particularly accurate to the original version.

Does Apose PDF for .NET perform a better conversion than this?  If so, would it be possible to share the conversion of testLongerTemplate.pdf so that we can view the differences?   

Obviously there are technical challenges converting PDF to DOCX; however, desktop software exists to do so, and it does the job very competently.  I would expect Aspose PDF to Word conversion should offer similar fidelity in conversion, with optimized performance for server-based conversion. 

This feature has been under development for several years.  While we are glad that it has reached a point of being shared with the public, we are disappointed in the results because it does not accomplish the task of creating an editable DOCX document from the PDF version.  I ask that the team consider giving this feature high priority to improve to make Aspose.PDF a much stronger product.

Thank you.



Viewing all articles
Browse latest Browse all 3131

Trending Articles