How to capture all data from part of the input with ANTLR4 visitor - Stack Overflow

IT技术

更新时间：2025-04-083

admin管理员组
文章数量:1394544

I'd like to replicate the 'here' document from the bash scripting language so I need to capture all data between a start and end point.

This is my grammar;


options {   tokenVocab = HereLexer; }

doc : part* EOF ;

part: hereDocument | ID;

hereDocument: HERE_START   DATA_BLOCK  HERE_END ;

This is the lexer (provided in previous question How to capture all data from part of the input);


@members {
  String hereStart = null;

  boolean hereEndAhead() {
    for (int i = 1; i <= hereStart.length(); i++) {
      if (hereStart.charAt(i - 1) != _input.LA(i)) {
        return false;
      }
    }
    return true;
  }
}

ID
 : [a-zA-Z_] [a-zA-Z_0-9]*
 ;

HERE_START
 : '<<' ID {hereStart = getText().substring(2);} -> pushMode(HereMode)
 ;

SPACE
 : [ \t\r\n] -> skip
 ;


mode HereMode;

HERE_END
 : {hereEndAhead()}? [a-zA-Z_] [a-zA-Z_0-9]* -> popMode
 ;

DATA_BLOCK
 : ({!hereEndAhead()}? . )+
 ;

The lexer is able to properly tokenize the input but the parser/visitor still fails to properly parse this input;

When I run this code,

                "there\n" +
                "<<here\n" +
                "    sdfsdf\n" +
                "    !@#$%^&*()_\n" +
                "    1234567890-\n" +
                "    a111\n" +
                "here\n" +
                "done";

        HereLexer lexer = new HereLexer(CharStreams.fromString(source));
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();

        for (Token t : tokens.getTokens()) {
            System.out.printf("%-20s '%s'\n",
                    HereLexer.VOCABULARY.getSymbolicName(t.getType()),
                    t.getText().replace("\n", "\\n"));
        }

        HereDocVisitor<?>   visitor = new HereDocBaseVisitor<>();
        HereDoc parser = new HereDoc(new CommonTokenStream(lexer));
        visitor.visit(parser.hereDocument());

I Get this result;

ID                   'there'
HERE_START           '<<here'
DATA_BLOCK           '\n    sdfsdf\n    !@#$%^&*()_\n    1234567890-\n    a111\n'
HERE_END             'here'
ID                   'done'
EOF                  '<EOF>'
line 9:4 mismatched input '<EOF>' expecting HERE_START

The ANTLR4 tool shows this tree;

I'd like to replicate the 'here' document from the bash scripting language so I need to capture all data between a start and end point.

This is my grammar;


options {   tokenVocab = HereLexer; }

doc : part* EOF ;

part: hereDocument | ID;

hereDocument: HERE_START   DATA_BLOCK  HERE_END ;

This is the lexer (provided in previous question How to capture all data from part of the input);


@members {
  String hereStart = null;

  boolean hereEndAhead() {
    for (int i = 1; i <= hereStart.length(); i++) {
      if (hereStart.charAt(i - 1) != _input.LA(i)) {
        return false;
      }
    }
    return true;
  }
}

ID
 : [a-zA-Z_] [a-zA-Z_0-9]*
 ;

HERE_START
 : '<<' ID {hereStart = getText().substring(2);} -> pushMode(HereMode)
 ;

SPACE
 : [ \t\r\n] -> skip
 ;


mode HereMode;

HERE_END
 : {hereEndAhead()}? [a-zA-Z_] [a-zA-Z_0-9]* -> popMode
 ;

DATA_BLOCK
 : ({!hereEndAhead()}? . )+
 ;

The lexer is able to properly tokenize the input but the parser/visitor still fails to properly parse this input;

When I run this code,

                "there\n" +
                "<<here\n" +
                "    sdfsdf\n" +
                "    !@#$%^&*()_\n" +
                "    1234567890-\n" +
                "    a111\n" +
                "here\n" +
                "done";

        HereLexer lexer = new HereLexer(CharStreams.fromString(source));
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();

        for (Token t : tokens.getTokens()) {
            System.out.printf("%-20s '%s'\n",
                    HereLexer.VOCABULARY.getSymbolicName(t.getType()),
                    t.getText().replace("\n", "\\n"));
        }

        HereDocVisitor<?>   visitor = new HereDocBaseVisitor<>();
        HereDoc parser = new HereDoc(new CommonTokenStream(lexer));
        visitor.visit(parser.hereDocument());

I Get this result;

ID                   'there'
HERE_START           '<<here'
DATA_BLOCK           '\n    sdfsdf\n    !@#$%^&*()_\n    1234567890-\n    a111\n'
HERE_END             'here'
ID                   'done'
EOF                  '<EOF>'
line 9:4 mismatched input '<EOF>' expecting HERE_START

The ANTLR4 tool shows this tree;

Share Improve this question asked Mar 27 at 9:34 user1818726 1318 bronze badges

Why are you doing new CommonTokenStream(lexer) twice? Your lexer has to be reset(), which rewinds the input back to the first char, but you don't do that. And you don't override it to reset your hereStart field. You need to learn how to use a debugger. – kaby76 Commented Mar 27 at 12:26
I probably should have mentioned, I'm using antler 4.13.2 and tokens.reset is deprecated. When I created the parser I wanted to use a new token stream to make sure there were no were effects. Just to make sure I change the code to use tokens.reset and get the same result. – user1818726 Commented Mar 27 at 17:51
The picture of the parse tree shows you called parser.doc() because the root of the tree is doc. The code you give says you called parser.hereDocument(). hereDocument() is the wrong entry point for the parse because the input starts with an ID "there". You call tokens.reset(), which is wrong, not only because it's deprecated, but it's on a token stream that you ignore because you create a new one: new CommonTokenStream(lexer) a second time. Call lexer.reset() after printing the tokens. Do not create a second token stream. Call the correct parser entry point parser.doc(). – kaby76 Commented Mar 27 at 19:03
If you insist on creating a second CommonTokenStream, create a new lexer object and new character stream as well. lexer.reset() works fine. If the grammar writer writes a lexer base class, he must implement a reset() method. Otherwise, the state of the base class will not be reset. – kaby76 Commented Mar 27 at 19:12

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

As mentioned in the comments: if you invoke hereDocument(), the input should only be a a here-document and not the entire input you provided in your question.

I don't know what ANTLR tool/plugin you're using, but many of these tools do not run embedded code, which might be the cause of the error you're getting.

When I run this Java code:

String source = "here\n" +
        "there\n" +
        "<<here\n" +
        "    sdfsdf\n" +
        "    !@#$%^&*()_\n" +
        "    1234567890-\n" +
        "    a111\n" +
        "here\n" +
        "done";

HereLexer lexer = new HereLexer(CharStreams.fromString(source));
HereDoc parser = new HereDoc(new CommonTokenStream(lexer));

System.out.println(parser.doc().toStringTree(parser));

the following output is printed (without errors):

(doc 
  (part here) 
  (part there) 
  (part 
    (hereDocument <<here \n    sdfsdf\n    !@#$%^&*()_\n    1234567890-\n    a111\n here)) 
  (part done) 
  <EOF>)

(I added the indentation afterwards)

本文标签： How to capture all data from part of the input with ANTLR4 visitorStack Overflow

版权声明：本文标题：How to capture all data from part of the input with ANTLR4 visitor - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744099503a2590808.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

How to capture all data from part of the input with ANTLR4 visitor - Stack Overflow

1 Answer 1

更多相关文章