admin管理员组文章数量:1288075
I'm trying to use the OpenNLP chunking API to chunk a portuguese sentence. So, first I tokenized a sentence using TokenizerME, then I tagged it with POSTaggerME. For both I used the ready-made models provided by the project here.
For the sentence “Ivo viu a uva”, POSTaggerME returns the tags [PROPN, VERB, DET, NOUN]. The model seems to be using the UD POS Tags.
As there is no ready-made model for ChunkerME in portuguese, I followed the instructions and did the training first using the ChunkerConverter tool (to convert from "arvore deitada" to CoNLL2000) and then generating the model with ChunkerTrainerME tool. Everything worked well. For the sentence above, the chunker produced correct tags ([B-NP, B-VP, B-NP, I-NP]).
But, for more complex sentences, it hasn't produced such good results.
I was trying to identify what I could improve in chunker training, and one of the things I noticed is that there is a difference between the types of tags. The portuguese corpus (Bosque 8.0) seems to be using portuguese tags. For example, instead of PROPN, the corpus uses prop and instead of DET, it uses art.
It seems to me that this could lead to problems, especially since one of the parameters the chunker receives is an array with UD tags, but it has been trained with another type of tag...
But before writing code creating a routine to convert from a portuguese notation to UD (or Penn) I wanted to ask, if
- this does indeed have an impact,
- there is a tool that already does this translation and
- there are any other suggestions for improving the chunker precision/recall.
I'm trying to use the OpenNLP chunking API to chunk a portuguese sentence. So, first I tokenized a sentence using TokenizerME, then I tagged it with POSTaggerME. For both I used the ready-made models provided by the project here.
For the sentence “Ivo viu a uva”, POSTaggerME returns the tags [PROPN, VERB, DET, NOUN]. The model seems to be using the UD POS Tags.
As there is no ready-made model for ChunkerME in portuguese, I followed the instructions and did the training first using the ChunkerConverter tool (to convert from "arvore deitada" to CoNLL2000) and then generating the model with ChunkerTrainerME tool. Everything worked well. For the sentence above, the chunker produced correct tags ([B-NP, B-VP, B-NP, I-NP]).
But, for more complex sentences, it hasn't produced such good results.
I was trying to identify what I could improve in chunker training, and one of the things I noticed is that there is a difference between the types of tags. The portuguese corpus (Bosque 8.0) seems to be using portuguese tags. For example, instead of PROPN, the corpus uses prop and instead of DET, it uses art.
It seems to me that this could lead to problems, especially since one of the parameters the chunker receives is an array with UD tags, but it has been trained with another type of tag...
But before writing code creating a routine to convert from a portuguese notation to UD (or Penn) I wanted to ask, if
- this does indeed have an impact,
- there is a tool that already does this translation and
- there are any other suggestions for improving the chunker precision/recall.
- Once a Jira Issue is opened, you can drop an MRE demonstrating your use case, linking it here for others to follow. – MWiesner Commented Feb 26 at 16:06
- 1 @MWiesner Put your comment as an answer, so I can select it as the answer to my questions. – Bob Rivers Commented Feb 26 at 23:17
1 Answer
Reset to default 1Q1
Yes, the chosen tag set (UD, Penn, custom) has an impact. Conversion is not possible in a bi-directional manner:
- Penn -> UD should work well.
- UD -> Penn is not a good idea as it a lossy conversion. UD tag set are less detailed when compared to the "classic' Penn tag set.
Using a custom, language specific tag-set can work, but it is a matter of "mapping" from/to UD correctly. This might work for some tag sets and languages, for others it might be too complicated / lossy.
Q2
No, there isn't. The OpenNLP project takes code donations for upcoming releases, if you want to provide such a mapping/translation for PT lang.
Q3
This needs details/discussion on the Apache OpenNLP user and/or dev mailing lists. Alternatively, feel free to open a Jira issue if you can drill the topic down to a clear idea or proposed code addition.
本文标签: nlpOpenNLP POSTaggerME and ChunkerME synergyStack Overflow
版权声明:本文标题:nlp - OpenNLP POSTaggerME and ChunkerME synergy - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741335336a2373008.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论