admin管理员组

文章数量:1405570

I'd like to replicate the 'here' document from the bash scripting language so I need to capture all data between a start and end point.

Here's my grammar;

grammar Here;

hereDocument: '<<' IDENTIFIER '\n'  hereContent IDENTIFIER ;

hereContent:
        .*
        ;

IDENTIFIER
    : [a-zA-Z_][a-zA-Z_0-9]* ;

WS
    : [ \t\r\n]+ -> skip ;

Here's the data block

<<here
    sdfsdf
    !@#$%^&*()_
    1234567890-
    a111
here

I want to capture all data between '<<here' and 'here' as hereContent but ANTLR4 falls back the definition of IDENTIFIER and anything that does not match that definition is treated as extraneous input.

I'd like to replicate the 'here' document from the bash scripting language so I need to capture all data between a start and end point.

Here's my grammar;

grammar Here;

hereDocument: '<<' IDENTIFIER '\n'  hereContent IDENTIFIER ;

hereContent:
        .*
        ;

IDENTIFIER
    : [a-zA-Z_][a-zA-Z_0-9]* ;

WS
    : [ \t\r\n]+ -> skip ;

Here's the data block

<<here
    sdfsdf
    !@#$%^&*()_
    1234567890-
    a111
here

I want to capture all data between '<<here' and 'here' as hereContent but ANTLR4 falls back the definition of IDENTIFIER and anything that does not match that definition is treated as extraneous input.

Share Improve this question edited Mar 23 at 5:26 Ken White 126k15 gold badges236 silver badges466 bronze badges asked Mar 23 at 5:15 user1818726user1818726 1318 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

You can't really do that reasonably inside parser rules. You'll need to add some logic inside your lexer to perform checks if you've encountered a start <<... token. You could use lexical modes for this. Here's a quick demo:

lexer grammar HereLexer;

@members {
  String hereStart = null;

  boolean hereEndAhead() {
    for (int i = 1; i <= hereStart.length(); i++) {
      if (hereStart.charAt(i - 1) != _input.LA(i)) {
        return false;
      }
    }
    return true;
  }
}

ID
 : [a-zA-Z_] [a-zA-Z_0-9]*
 ;

HERE_START
 : '<<' ID {hereStart = getText().substring(2);} -> pushMode(HereMode)
 ;

SPACE
 : [ \t\r\n] -> skip
 ;

OTHER
 : .
 ;

mode HereMode;

HERE_END
 : {hereEndAhead()}? [a-zA-Z_] [a-zA-Z_0-9]* -> popMode
 ;

DATA_BLOCK
 : ({!hereEndAhead()}? . )+
 ;

If you now run the Java code:

String source = "here\n" +
        "there\n" +
        "<<here\n" +
        "    sdfsdf\n" +
        "    !@#$%^&*()_\n" +
        "    1234567890-\n" +
        "    a111\n" +
        "here\n" +
        "done";

HereLexer lexer = new HereLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();

for (Token t : tokens.getTokens()) {
    System.out.printf("%-20s '%s'\n",
            HereLexer.VOCABULARY.getSymbolicName(t.getType()),
            t.getText().replace("\n", "\\n"));
}

you'll see the following output being printed:

ID                   'here'
ID                   'there'
HERE_START           '<<here'
DATA_BLOCK           '\n    sdfsdf\n    !@#$%^&*()_\n    1234567890-\n    a111\n'
HERE_END             'here'
ID                   'done'
EOF                  '<EOF>'

本文标签: antlr4How to capture all data from part of the inputStack Overflow