admin管理员组

文章数量:1288075

I'm trying to use the OpenNLP chunking API to chunk a portuguese sentence. So, first I tokenized a sentence using TokenizerME, then I tagged it with POSTaggerME. For both I used the ready-made models provided by the project here.

For the sentence “Ivo viu a uva”, POSTaggerME returns the tags [PROPN, VERB, DET, NOUN]. The model seems to be using the UD POS Tags.

As there is no ready-made model for ChunkerME in portuguese, I followed the instructions and did the training first using the ChunkerConverter tool (to convert from "arvore deitada" to CoNLL2000) and then generating the model with ChunkerTrainerME tool. Everything worked well. For the sentence above, the chunker produced correct tags ([B-NP, B-VP, B-NP, I-NP]).

But, for more complex sentences, it hasn't produced such good results.

I was trying to identify what I could improve in chunker training, and one of the things I noticed is that there is a difference between the types of tags. The portuguese corpus (Bosque 8.0) seems to be using portuguese tags. For example, instead of PROPN, the corpus uses prop and instead of DET, it uses art.

It seems to me that this could lead to problems, especially since one of the parameters the chunker receives is an array with UD tags, but it has been trained with another type of tag...

But before writing code creating a routine to convert from a portuguese notation to UD (or Penn) I wanted to ask, if

  1. this does indeed have an impact,
  2. there is a tool that already does this translation and
  3. there are any other suggestions for improving the chunker precision/recall.

I'm trying to use the OpenNLP chunking API to chunk a portuguese sentence. So, first I tokenized a sentence using TokenizerME, then I tagged it with POSTaggerME. For both I used the ready-made models provided by the project here.

For the sentence “Ivo viu a uva”, POSTaggerME returns the tags [PROPN, VERB, DET, NOUN]. The model seems to be using the UD POS Tags.

As there is no ready-made model for ChunkerME in portuguese, I followed the instructions and did the training first using the ChunkerConverter tool (to convert from "arvore deitada" to CoNLL2000) and then generating the model with ChunkerTrainerME tool. Everything worked well. For the sentence above, the chunker produced correct tags ([B-NP, B-VP, B-NP, I-NP]).

But, for more complex sentences, it hasn't produced such good results.

I was trying to identify what I could improve in chunker training, and one of the things I noticed is that there is a difference between the types of tags. The portuguese corpus (Bosque 8.0) seems to be using portuguese tags. For example, instead of PROPN, the corpus uses prop and instead of DET, it uses art.

It seems to me that this could lead to problems, especially since one of the parameters the chunker receives is an array with UD tags, but it has been trained with another type of tag...

But before writing code creating a routine to convert from a portuguese notation to UD (or Penn) I wanted to ask, if

  1. this does indeed have an impact,
  2. there is a tool that already does this translation and
  3. there are any other suggestions for improving the chunker precision/recall.
Share Improve this question edited Feb 26 at 16:12 MWiesner 9,06812 gold badges39 silver badges72 bronze badges asked Feb 22 at 16:06 Bob RiversBob Rivers 5,4956 gold badges51 silver badges60 bronze badges 2
  • Once a Jira Issue is opened, you can drop an MRE demonstrating your use case, linking it here for others to follow. – MWiesner Commented Feb 26 at 16:06
  • 1 @MWiesner Put your comment as an answer, so I can select it as the answer to my questions. – Bob Rivers Commented Feb 26 at 23:17
Add a comment  | 

1 Answer 1

Reset to default 1

Q1

Yes, the chosen tag set (UD, Penn, custom) has an impact. Conversion is not possible in a bi-directional manner:

  • Penn -> UD should work well.
  • UD -> Penn is not a good idea as it a lossy conversion. UD tag set are less detailed when compared to the "classic' Penn tag set.

Using a custom, language specific tag-set can work, but it is a matter of "mapping" from/to UD correctly. This might work for some tag sets and languages, for others it might be too complicated / lossy.

Q2

No, there isn't. The OpenNLP project takes code donations for upcoming releases, if you want to provide such a mapping/translation for PT lang.

Q3

This needs details/discussion on the Apache OpenNLP user and/or dev mailing lists. Alternatively, feel free to open a Jira issue if you can drill the topic down to a clear idea or proposed code addition.

本文标签: nlpOpenNLP POSTaggerME and ChunkerME synergyStack Overflow