admin管理员组文章数量:1315287
I have a project that is using Apache Tika 2.6.0 and want to upgrade to 3.0.0 for performance improvements.
The upgrade is simple enough in that I've not had to change or refactor any code and everything works as is. However, the actual content extraction is behaving differently between 3.0.0 and 2.6.0 and is including information about the document type. I have tried various different approaches to parsing document content, but each way I have tried produces the same result. For context, I am testing with a very simple Word document.
2.6.0 Parse Result
This is a word document with some nonsensical text that makes no sense. Simple Table
Text Here
Why Not
· A very important point
· Another important point
· No one cares about this point
3.0.0 Parse Result
[Content_Types].xml
_rels/.rels
word/document.xml This is a word document with some nonsensical text that makes no sense. Simple Table Text Here Why Not A very important point Another important point No one cares about this point
word/_rels/document.xml.rels
word/theme/theme1.xml
word/settings.xml
word/numbering.xml
word/styles.xml
word/webSettings.xml
word/fontTable.xml
docProps/core.xml
docProps/app.xml
Implementation
Here is the code I am using to run this which has not been changed after moving versions.
String content;
try
{
parser.parse(inputStream, bodyContentHandler, new Metadata(), new ParseContext());
content = bodyContentHandler.toString();
inputStream.close();
}
I have tried other options for parsing such as new Tika().parseToString(inputStream, new Metadata());
but, as mentioned, I am getting the same result.
Has something changed between the above versions, or is this a known thing with a workaround? Any help/tips is appreciated.
Packages and Versions Being Used
tika-core: 3.0.0
tika-parsers-standard-package: 3.0.0
版权声明:本文标题:java - Apache Tika upgrade from 2.6.0 to 3.0.0 content extraction includes document information - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741979395a2408305.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论