I am doing the same (albeit java version). It works mostly, but some words come out backwards and others are joined together even though they are far from each other (but with same vertical coordinates) on the page.
I'm considering getting individual characters (regex "[\\S]" without the plus) and then sorting on x and y coordinates and assembling the words myself. But before embarking on this exercise (and also since the resulting code could be quite inefficient), I'd like to know if there is a way to coax the library into doing a better job finding the words.
All help is appreciated!
↧