PDF extraction using Java

There are some command line tools e.g “pdftotext” that command comes handy whenever we want to extract text from a PDF file. After the text is extracted we can apply some parsers and filters to get whatever information we need.

When I am working in java environment I find this method not work as I wanted. I could however start a new process, perform the text extraction using above process, read the file and come back to java. That requires a lot of context switch and is an inefficient method. The biggest problem of this is, when do we decide to return from the process it created as a separate thread. Well there are techniques where we can come back from the process with text we parsed, but using external process that needs to return something in a synchronous way is inefficient. Using external methods becomes most useful when you don’t have as efficient method in your programming language of choice or when the external process is independent of current task. E.g One case I would recommend is image conversion using ImageMagick.

For this purpose I choose itext PdfReader. Every PDF is not same. For some its just as intemediating as extracting text from an image. There are different ways a program can look at a PDF. These strategies can format the text in the way we want them to be extracted. Most natural strategy for me was to use SimpleTextExtractionStrategy. It would extract text as close to the one extracted using pdftotext. However, I was unable to find a maven repository for the version which had included this class. So I had to manually add itext-pdfa-5.3.3.jar in the path.

Here is how you extract text from PDF.

 

String pdfFile = "/path/to/pdffile.pdf";
try {
PdfReader reader = new PdfReader(pdfFile);
String extractedText = PdfTextExtractor.getTextFromPage(reader, 1,
new SimpleTextExtractionStrategy());
// Log it
// extractedText has all the text that it could parse from the pdf file
System.out.println(extractedText);// If you want to read it line by line you could create a BufferedReader using InputStream as
InputStream is = new ByteArrayInputStream(extractedText.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;while ((line = br.readLine()) != null) {
// Do all your parsing here for each line
// For simple tasks you don't have to initialize regex
// A simple check like line.startsWith, line.replaceAll, 
// StringUtils.trim, StringUtils.substring to remove unnecessary 
// text would be sufficient
// To keep on adding text between some lines you could use 
// boolean markers or integer counters
}
} catch (Exception e) {}

Cheers !!!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s