java - Apache Tika upgrade from 2.6.0 to 3.0.0 content extraction includes document information - Stack Overflow

IT技术

更新时间：2025-03-150

admin管理员组
文章数量:1315287

I have a project that is using Apache Tika 2.6.0 and want to upgrade to 3.0.0 for performance improvements.

The upgrade is simple enough in that I've not had to change or refactor any code and everything works as is. However, the actual content extraction is behaving differently between 3.0.0 and 2.6.0 and is including information about the document type. I have tried various different approaches to parsing document content, but each way I have tried produces the same result. For context, I am testing with a very simple Word document.

2.6.0 Parse Result

This is a word document with some nonsensical text that makes no sense. Simple Table

Text Here

Why Not

· A very important point

· Another important point

· No one cares about this point

3.0.0 Parse Result

[Content_Types].xml

_rels/.rels

word/document.xml This is a word document with some nonsensical text that makes no sense. Simple Table Text Here Why Not A very important point Another important point No one cares about this point

word/_rels/document.xml.rels

word/theme/theme1.xml

word/settings.xml

word/numbering.xml

word/styles.xml

word/webSettings.xml

word/fontTable.xml

docProps/core.xml

docProps/app.xml

Implementation

Here is the code I am using to run this which has not been changed after moving versions.

String content;
try
{
   parser.parse(inputStream, bodyContentHandler, new Metadata(), new ParseContext());
   content = bodyContentHandler.toString();
   inputStream.close();
}

I have tried other options for parsing such as new Tika().parseToString(inputStream, new Metadata()); but, as mentioned, I am getting the same result.

Has something changed between the above versions, or is this a known thing with a workaround? Any help/tips is appreciated.

Packages and Versions Being Used

tika-core: 3.0.0

tika-parsers-standard-package: 3.0.0

本文标签： javaApache Tika upgrade from 260 to 300 content extraction includes document informationStack Overflow

版权声明：本文标题：java - Apache Tika upgrade from 2.6.0 to 3.0.0 content extraction includes document information - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741979395a2408305.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

java - Apache Tika upgrade from 2.6.0 to 3.0.0 content extraction includes document information - Stack Overflow

2.6.0 Parse Result

3.0.0 Parse Result

Implementation

Packages and Versions Being Used

更多相关文章

java - Apache Tika upgrade from 2.6.0 to 3.0.0 content extraction includes document information - Stack Overflow

发表评论

推荐文章

javascript - Cross-domain SSL handshake failure in Firefox using xhr, client-certificate - Stack Overflow

android - Why am I getting duplicate activity instances when I click on a deeplink - Stack Overflow

php - Integration Laravel with API RestLet NetSuite - Stack Overflow

javascript - In my case, How to highlight table row when mouseover? - Stack Overflow

javascript - ChartJS: Uncaught TypeError: Cannot read property &#39;call&#39; of undefined - Stack Overflow

热门文章

updates - Wordpress curl timing out and saying &#39;Moved Permanently&#39;

javascript - Don&#39;t send form data when to type no in confirm window - Stack Overflow

javascript - Canvas flickers when trying to draw image with updated src - Stack Overflow

node.js - Pagination on high amount of rows using Sequelize and MySQL - Stack Overflow

Items from media library won&#39;t get added to a custom taxonomy

&quot;Special&quot; category is displayed in all categories or show parent category on sub category

functions - Ordering posts by publish date not working?

javascript - How do I send POST in place of PUT or DELETE in Ember? - Stack Overflow

Importing something that&#39;s not explicitly exported by a module in Javascript - Stack Overflow

javascript - Fully Closing a phonegap android app - Stack Overflow

最新文章

Wifi连接正常却无法上网怎么回事 原因及解决方法

怎样安装系统win7,安装win7系统教程

老包菜一键重装系统好用吗，经过小编测试，推荐给大家使用

Windows7高级检索功能——搜索筛选器

windows10、windows11访问windows7共享文件各种报错

javascript - Detecting mobile browsers on the web? - Stack Overflow

javascript - Zebra striping to just one table column - Stack Overflow

google oauth - Django can&#39;t display html box of authentifiation - Stack Overflow

javascript fade out (vanilla js example) not working for me - Stack Overflow

php - In a plugin, How to update a json file using ajax

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - ChartJS: Uncaught TypeError: Cannot read property 'call' of undefined - Stack Overflow

updates - Wordpress curl timing out and saying 'Moved Permanently'

javascript - Don't send form data when to type no in confirm window - Stack Overflow

Items from media library won't get added to a custom taxonomy

"Special" category is displayed in all categories or show parent category on sub category

Importing something that's not explicitly exported by a module in Javascript - Stack Overflow

Wifi连接正常却无法上网怎么回事原因及解决方法

google oauth - Django can't display html box of authentifiation - Stack Overflow