PDF extraction using Java

December 5, 2012

There are some command line tools e.g “pdftotext” that command comes handy whenever we want to extract text from a PDF file. After the text is extracted we can apply some parsers and filters to get whatever information we need.

When I am working in java environment I find this method not work as I wanted. I could however start a new process, perform the text extraction using above process, read the file and come back to java. That requires a lot of context switch and is an inefficient method. The biggest problem of this is, when do we decide to return from the process it created as a separate thread. Well there are techniques where we can come back from the process with text we parsed, but using external process that needs to return something in a synchronous way is inefficient. Using external methods becomes most useful when you don’t have as efficient method in your programming language of choice or when the external process is independent of current task. E.g One case I would recommend is image conversion using ImageMagick.