Hi,
I am trying to extract text from a pdf document. The pdf document has some chinese characters. So when I use the text absorber, I am seeing that it introduces unnecessary spaces between the characters.
private String extractText(byte[] inputContent) throws IOException{
ByteArrayInputStream stream = new ByteArrayInputStream(inputContent);
Document pdfDocument = new Document(stream);
// Create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
// Accept the absorber for all the pages
pdfDocument.getPages().accept(textAbsorber);
// Get the extracted text
String extractedText = textAbsorber.getText();
returnextractedText;
}
I am also attaching the java code and the pdf file.