nlp - OpenNLP POSTaggerME and ChunkerME synergy - Stack Overflow

IT技术

更新时间：2025-03-072

admin管理员组
文章数量:1288075

I'm trying to use the OpenNLP chunking API to chunk a portuguese sentence. So, first I tokenized a sentence using TokenizerME, then I tagged it with POSTaggerME. For both I used the ready-made models provided by the project here.

For the sentence “Ivo viu a uva”, POSTaggerME returns the tags [PROPN, VERB, DET, NOUN]. The model seems to be using the UD POS Tags.

As there is no ready-made model for ChunkerME in portuguese, I followed the instructions and did the training first using the ChunkerConverter tool (to convert from "arvore deitada" to CoNLL2000) and then generating the model with ChunkerTrainerME tool. Everything worked well. For the sentence above, the chunker produced correct tags ([B-NP, B-VP, B-NP, I-NP]).

But, for more complex sentences, it hasn't produced such good results.

I was trying to identify what I could improve in chunker training, and one of the things I noticed is that there is a difference between the types of tags. The portuguese corpus (Bosque 8.0) seems to be using portuguese tags. For example, instead of PROPN, the corpus uses prop and instead of DET, it uses art.

It seems to me that this could lead to problems, especially since one of the parameters the chunker receives is an array with UD tags, but it has been trained with another type of tag...

But before writing code creating a routine to convert from a portuguese notation to UD (or Penn) I wanted to ask, if

this does indeed have an impact,
there is a tool that already does this translation and
there are any other suggestions for improving the chunker precision/recall.

I'm trying to use the OpenNLP chunking API to chunk a portuguese sentence. So, first I tokenized a sentence using TokenizerME, then I tagged it with POSTaggerME. For both I used the ready-made models provided by the project here.

For the sentence “Ivo viu a uva”, POSTaggerME returns the tags [PROPN, VERB, DET, NOUN]. The model seems to be using the UD POS Tags.

As there is no ready-made model for ChunkerME in portuguese, I followed the instructions and did the training first using the ChunkerConverter tool (to convert from "arvore deitada" to CoNLL2000) and then generating the model with ChunkerTrainerME tool. Everything worked well. For the sentence above, the chunker produced correct tags ([B-NP, B-VP, B-NP, I-NP]).

But, for more complex sentences, it hasn't produced such good results.

I was trying to identify what I could improve in chunker training, and one of the things I noticed is that there is a difference between the types of tags. The portuguese corpus (Bosque 8.0) seems to be using portuguese tags. For example, instead of PROPN, the corpus uses prop and instead of DET, it uses art.

It seems to me that this could lead to problems, especially since one of the parameters the chunker receives is an array with UD tags, but it has been trained with another type of tag...

But before writing code creating a routine to convert from a portuguese notation to UD (or Penn) I wanted to ask, if

this does indeed have an impact,
there is a tool that already does this translation and
there are any other suggestions for improving the chunker precision/recall.

Share Improve this question edited Feb 26 at 16:12 MWiesner 9,06812 gold badges39 silver badges72 bronze badges asked Feb 22 at 16:06 Bob Rivers 5,4956 gold badges51 silver badges60 bronze badges

Once a Jira Issue is opened, you can drop an MRE demonstrating your use case, linking it here for others to follow. – MWiesner Commented Feb 26 at 16:06
1 @MWiesner Put your comment as an answer, so I can select it as the answer to my questions. – Bob Rivers Commented Feb 26 at 23:17

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Q1

Yes, the chosen tag set (UD, Penn, custom) has an impact. Conversion is not possible in a bi-directional manner:

Penn -> UD should work well.
UD -> Penn is not a good idea as it a lossy conversion. UD tag set are less detailed when compared to the "classic' Penn tag set.

Using a custom, language specific tag-set can work, but it is a matter of "mapping" from/to UD correctly. This might work for some tag sets and languages, for others it might be too complicated / lossy.

Q2

No, there isn't. The OpenNLP project takes code donations for upcoming releases, if you want to provide such a mapping/translation for PT lang.

Q3

This needs details/discussion on the Apache OpenNLP user and/or dev mailing lists. Alternatively, feel free to open a Jira issue if you can drill the topic down to a clear idea or proposed code addition.

本文标签： nlpOpenNLP POSTaggerME and ChunkerME synergyStack Overflow

版权声明：本文标题：nlp - OpenNLP POSTaggerME and ChunkerME synergy - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741335336a2373008.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

nlp - OpenNLP POSTaggerME and ChunkerME synergy - Stack Overflow

1 Answer 1

Q1

Q2

Q3

更多相关文章

nlp - OpenNLP POSTaggerME and ChunkerME synergy - Stack Overflow

发表评论

推荐文章

javascript - AWS S3: should I use POST or PUT requests to upload a file? - Stack Overflow

javascript - Jquery hover - keep popup open - Stack Overflow

excel 2010 - vlookup Formula will not return a value - Stack Overflow

javascript - Reusing templates in handlebars - Stack Overflow

javascript - &quot;node:internalmodulescjsloader:1080 throw err; ^&quot; - Stack Overflow

热门文章

hooks - Get rid of &quot;trash can&quot; for custom post type

javascript - How do I redraw a series on HighCharts? - Stack Overflow

Find if widget block is active

javascript - Best algorithm to highlight a list of given words in an HTML file - Stack Overflow

ftp - curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to IBM mainframe - Stack Overflow

javascript - webpack-cli TypeError: Cannot read properties of undefined (reading &#39;getArguments&#39;) - Stack Overflo

javascript - How to check if the application is foreground or background now? - Stack Overflow

javascript - history.back() - how to set a default if no history exists - Stack Overflow

rest api - wp-cli command throws error : &quot;SSL routines:tls_process_server_certificate:certificate verify failed&quo

javascript - CSS Custom radio button not working in IE 8 - Stack Overflow

最新文章

Win7各正式版下载地址和SHA验证

怎么样把中文版的Windows7改成英文版的Windows7

Win7系统笔记本蓝牙打开指南：详细步骤助你轻松连接

win7开机弹计算机,win7开机弹出Windows Installer窗口的解决方法

windows7虚拟机安装vmtools方法

What is the most efficient way to declare functions in Javascript? - Stack Overflow

javascript - Why can&#39;t we have a variable refer to function in python? - Stack Overflow

javascript - Overlaying one div over another, but not knowing the size of the div - Stack Overflow

Javascript body OnClick - Stack Overflow

$wpdb-&gt;insert not working in any way

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - "node:internalmodulescjsloader:1080 throw err; ^" - Stack Overflow

hooks - Get rid of "trash can" for custom post type

javascript - webpack-cli TypeError: Cannot read properties of undefined (reading 'getArguments') - Stack Overflo

rest api - wp-cli command throws error : "SSL routines:tls_process_server_certificate:certificate verify failed&quo

javascript - Why can't we have a variable refer to function in python? - Stack Overflow

$wpdb->insert not working in any way