Technology Tips: Apache PDFBox

Friday, October 9, 2009

Apache PDFBox - PDF version 1.5

Anyone working with lucene that may use Apache PDFBox library to extract text from PDF for indexing. However, the official build was unable to index PDF file version 1.5 and 1.6. Here is the fix for the issue.

1. Download the latest PDFBox 0.8.0 from http://incubator.apache.org/pdfbox/download.html#pdfbox. Please download pdfbox-0.8.0-incubating-src.jar.
2. Extract the jar and open the java file src/main/java/org/apache/pdfbox/pdfparser/PDFXrefStreamParser.java.
3. Modify line 100 to look like
while(pdfSource.available() > 0 && objIter.hasNext())
4. Run ant to build the project and get the pdfbox-0.8.0-incubating.jar from target folder
Reference:
http://issues.apache.org/jira/browse/PDFBOX-533

Technology Tips

Friday, October 9, 2009

Apache PDFBox - PDF version 1.5

No comments:

My Blog List

Category

Blog Archive

About Me