admin管理员组

文章数量:1315287

I have a project that is using Apache Tika 2.6.0 and want to upgrade to 3.0.0 for performance improvements.

The upgrade is simple enough in that I've not had to change or refactor any code and everything works as is. However, the actual content extraction is behaving differently between 3.0.0 and 2.6.0 and is including information about the document type. I have tried various different approaches to parsing document content, but each way I have tried produces the same result. For context, I am testing with a very simple Word document.

2.6.0 Parse Result

This is a word document with some nonsensical text that makes no sense. Simple Table

Text Here

Why Not

· A very important point

· Another important point

· No one cares about this point

3.0.0 Parse Result

[Content_Types].xml

_rels/.rels

word/document.xml This is a word document with some nonsensical text that makes no sense. Simple Table Text Here Why Not A very important point Another important point No one cares about this point

word/_rels/document.xml.rels

word/theme/theme1.xml

word/settings.xml

word/numbering.xml

word/styles.xml

word/webSettings.xml

word/fontTable.xml

docProps/core.xml

docProps/app.xml

Implementation

Here is the code I am using to run this which has not been changed after moving versions.

String content;
try
{
   parser.parse(inputStream, bodyContentHandler, new Metadata(), new ParseContext());
   content = bodyContentHandler.toString();
   inputStream.close();
}

I have tried other options for parsing such as new Tika().parseToString(inputStream, new Metadata()); but, as mentioned, I am getting the same result.

Has something changed between the above versions, or is this a known thing with a workaround? Any help/tips is appreciated.

Packages and Versions Being Used

tika-core: 3.0.0

tika-parsers-standard-package: 3.0.0

本文标签: javaApache Tika upgrade from 260 to 300 content extraction includes document informationStack Overflow