admin管理员组文章数量:1349717
I am trying to extract the marked Content on the page. It's not giving the correct mapped marked content of the page. Here is the sample File. The Formula content is marked like below.
The MC0 is not a Cosdictionary because of this reason while extracting Marked contents of the page the formula-related content is unable to read by PDFBox. Here
public void process(Operator operator, List<COSBase> arguments) throws IOException {
COSName tag = null;
COSDictionary properties = null;
Iterator var5 = arguments.iterator();
while(var5.hasNext()) {
COSBase argument = (COSBase)var5.next();
if (argument instanceof COSName) {
tag = (COSName)argument;
} else if (argument instanceof COSDictionary) {
properties = (COSDictionary)argument;
}
}
this.context.beginMarkedContentSequence(tag, properties);
}
But I found there is indirect reference for the MC0 is nothing but MCID -34. As reference shown in below figure.
How can I get the figure related marked content, When I ran the below code?
PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor();
extractor.processPage(page);
Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>();
markedContents.put(page, theseMarkedContents);
for (PDMarkedContent markedContent : extractor.getMarkedContents()) {
addToMap(theseMarkedContents, markedContent);
num++;
}
I am trying to extract the marked Content on the page. It's not giving the correct mapped marked content of the page. Here is the sample File. The Formula content is marked like below.
The MC0 is not a Cosdictionary because of this reason while extracting Marked contents of the page the formula-related content is unable to read by PDFBox. Here
public void process(Operator operator, List<COSBase> arguments) throws IOException {
COSName tag = null;
COSDictionary properties = null;
Iterator var5 = arguments.iterator();
while(var5.hasNext()) {
COSBase argument = (COSBase)var5.next();
if (argument instanceof COSName) {
tag = (COSName)argument;
} else if (argument instanceof COSDictionary) {
properties = (COSDictionary)argument;
}
}
this.context.beginMarkedContentSequence(tag, properties);
}
But I found there is indirect reference for the MC0 is nothing but MCID -34. As reference shown in below figure.
How can I get the figure related marked content, When I ran the below code?
PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor();
extractor.processPage(page);
Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>();
markedContents.put(page, theseMarkedContents);
for (PDMarkedContent markedContent : extractor.getMarkedContents()) {
addToMap(theseMarkedContents, markedContent);
num++;
}
Share
Improve this question
asked Apr 2 at 4:16
fascinating coderfascinating coder
3091 silver badge14 bronze badges
1
- 1 What exactly is the "wrong result" you claim that the stream engine gives you? – mkl Commented Apr 2 at 8:00
1 Answer
Reset to default 2It took some time to understand what the actual issue is here and what the given pieces of information in the question refer to. But indeed, there is a bug in the PDFBox BeginMarkedContentSequenceWithProperties
operator processor.
The process
method the OP quoted in the question turns out to be the process
method of BeginMarkedContentSequenceWithProperties
:
public void process(Operator operator, List<COSBase> arguments) throws IOException
{
COSName tag = null;
COSDictionary properties = null;
for (COSBase argument : arguments)
{
if (argument instanceof COSName)
{
tag = (COSName) argument;
}
else if (argument instanceof COSDictionary)
{
properties = (COSDictionary) argument;
}
}
getContext().beginMarkedContentSequence(tag, properties);
}
The issue is that this method implicitly assumes that there is at most one name parameter and one dictionary parameter of interest to the BDC operation. This is wrong! Actually this operation is specified as tag properties BDC to
Begin a marked-content sequence with an associated property list, terminated by a balancing EMC operator. tag shall be a name object indicating the role or significance of the sequence. properties shall be either an inline dictionary representing the property list or a name object associated with it in the Properties subdictionary of the current resource dictionary (see 14.6.2, "Property lists").
(ISO 32000-2, Table 352 — Marked-content operators)
Thus, the property dictionary argument can also be given by a name. In that case the BDC operation has two name parameters of interest!
For example in the case of the OP's file:
/Formula /MC0 BDC
In this case the process
implementation above drops the Formula name and instead puts the MC0 into the tag
variable. Correctly, though, it should have put Formula into tag
and looked up MC0 in the Properties resources to put the dictionary from there into the properties
variable.
本文标签: pdfboxWhy PDF BOX39s PDFStreamEngineprocessPage giving wrong resultStack Overflow
版权声明:本文标题:pdfbox - Why PDF BOX's PDFStreamEngine.processPage giving wrong result? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743861035a2551739.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论