Anyone working with lucene that may use Apache PDFBox library to extract text from PDF for indexing. However, the official build was unable to index PDF file version 1.5 and 1.6. Here is the fix for the issue.
1. Download the latest PDFBox 0.8.0 from http://incubator.apache.org/pdfbox/download.html#pdfbox. Please download pdfbox-0.8.0-incubating-src.jar.
2. Extract the jar and open the java file src/main/java/org/apache/pdfbox/pdfparser/PDFXrefStreamParser.java.
3. Modify line 100 to look like
while(pdfSource.available() > 0 && objIter.hasNext())
4. Run ant to build the project and get the pdfbox-0.8.0-incubating.jar from target folder
Reference:
http://issues.apache.org/jira/browse/PDFBOX-533
Friday, October 9, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment