admin管理员组

文章数量:1336281

I was tasked to write a top-down recursive descent parser for the Puck-24.3 programming language.

Task:

Language Token Types:

  • Integers are non-empty sequences of digits optionally preceded with either a ‘+’ or ‘-’ sign.

  • Decimal numbers are Integers followed by a ‘.’, followed by a non-empty sequence of digits.

  • Strings are any non-space sequences of characters enclosed in “”. e.g. “hello” “abc123”.

  • Keywords are the following strings: ‘PRINT’, ‘.’, ‘[’, ‘]’, ‘(’, ‘)’ and ‘^’.

    Notice: keywords are uppercase

  • Operators are the special strings. In this homework we will use ‘:-’, ‘~’, ‘<’, ‘>’, '=', ‘#’, ‘+’, ‘-’, ‘&’, ‘OR’, ‘*’, ‘/’ and ‘AND’.

  • Identifiers are sequences of digits or letters. The first character must be a letter, and an identifier cannot be a Keyword.

Reminder: In Puck-24.3 tokens are always separated by white-spaces.

NOTE: For this assignment, your parser does not need to handle line comments.

Examples

Sample 1

Input:

x :- 2 + 2 .
PRINT ( x * 100 ) .

Expected output:

VALID

Sample 2

Input:

x :- 1 .
a [ i ] :- 2 .
w [ 3 ] ^ ch :- "a" .
t ^ key :- s .
p ^ next ^ data :- alpha .
x :- x + y .
y :- x - y .
c :- c + 1 .
PRINT ( c ) . 

Expected output:

VALID

Sample 3

Input:

PRINT ( a # b ) .
flag :- a > b .
a :- a - b .  b :- b - a .
PRINT ( "Hello" & "World!" ) .

Expected output:

VALID

Sample 4

Input:

foo :- ( PRINT ) ;
PRINT ( * )

Expected output:

INVALID!
Error: factor expected, got "PRINT" 

Note: you may output a different error message.

Sample 4

Input:

void getToken ( ) {
  cin >> token ;
}

Expected output:

INVALID!
Error: ":-" expected, got "getToken" 

Note: you may output a different error message.

My approach

Given a grammar, in order to build a top-down recursive descent parser, I follow the following technique:

For each non-terminal, I designed a corresponding function that parses sentences that can be generated by that non-terminal.

The function getToken() retrieves the next token from the input and stores it in a global variable token.

A non-terminal that has more than one production requires a check of the token to determine which production to apply. The token is compared with the first symbol in each production until a match is found. If no match is found, the program outputs an error and exits.

My code

import sys

# Initialize token and jetton
token = None
jetton = 0
tokens = []

def getToken():
    """Fetches the next token from the input and increments jetton."""
    global token, jetton
    if tokens:
        token = tokens.pop(0)
        jetton += 2  # Increment jetton for each token
    else:
        token = None

def error(expected):
    """Raise an error with an expected message."""
    print("INVALID!")
    print(f"Error: {expected}, got '{token}'")
    sys.exit(1)

def parse_Factor():
    """Parses the Factor rule."""
    global token
    if token.isdigit() or (token[0] in "+-" and token[1:].isdigit()):  # Integer
        getToken()
    elif token.startswith('"') and token.endswith('"'):  # String
        getToken()
    elif token.isidentifier():  # Identifier as a variable or designator
        getToken()
    elif token == "(":
        getToken()
        parse_Expression()
        if token == ")":
            getToken()
        else:
            error("')' expected")
    elif token == "~":
        getToken()
        parse_Factor()
    else:
        error("Factor expected")

def parse_Term():
    """Parses the Term rule."""
    parse_Factor()
    while token in ["*", "/", "AND"]:
        getToken()
        parse_Factor()

def parse_SimpleExpression():
    """Parses the SimpleExpression rule."""
    parse_Term()
    while token in ["+", "-", "OR", "&"]:
        getToken()
        parse_Term()

def parse_Expression():
    """Parses the Expression rule."""
    parse_SimpleExpression()
    if token in ["<", ">", "=", "#"]:  # Relation
        getToken()
        parse_SimpleExpression()

def parse_Designator():
    """Parses the Designator rule."""
    if token.isidentifier():
        getToken()
        while token in ["^", "["]:
            if token == "^":
                getToken()
                if token.isidentifier():
                    getToken()
                else:
                    error("Identifier expected after '^'")
            elif token == "[":
                getToken()
                parse_Expression()
                if token == "]":
                    getToken()
                else:
                    error("']' expected")
    else:
        error("Identifier expected")

def parse_Assignment():
    """Parses the Assignment rule."""
    parse_Designator()
    if token == ":-":
        getToken()
        parse_Expression()
        if token == ".":
            getToken()
        else:
            error("'.' expected")
    else:
        error("':-' expected")

def parse_PrintStatement():
    """Parses the PrintStatement rule."""
    if token == "PRINT":
        getToken()
        if token == "(":
            getToken()
            parse_Expression()
            if token == ")":
                getToken()
                if token == ".":
                    getToken()
                else:
                    error("'.' expected")
            else:
                error("')' expected")
        else:
            error("'(' expected")
    else:
        error("'PRINT' expected")

def parse_Statement():
    """Parses a single Statement."""
    if token.isidentifier():
        parse_Assignment()
    elif token == "PRINT":
        parse_PrintStatement()
    else:
        error("Statement expected")

def parse_StatementSequence():
    """Parses the StatementSequence rule."""
    while token:
        parse_Statement()

def main():
    """Main function to handle input and parsing."""
    global tokens
    print("Enter your program (end input with Ctrl+D):")
    try:
        # Read program from standard input
        program = sys.stdin.read()
        # Tokenize by whitespace
        tokens = program.split()
        getToken()  # Initialize first token
        parse_StatementSequence()  # Start parsing
        print("VALID")
    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

if __name__ == "__main__":
    main()

Problem

For the first 3 input samples, the program prints "INVALID" with an error message when it should print "VALID".

Error messages:

(sample 1)
INVALID!
Error: ':-' expected, got '('

(sample 2)
INVALID!
Error: Factor expected, got 'i'

(sample 3)
INVALID!
Error: Factor expected, got 'i'

What is the problem in my code?

I was tasked to write a top-down recursive descent parser for the Puck-24.3 programming language.

Task:

Language Token Types:

  • Integers are non-empty sequences of digits optionally preceded with either a ‘+’ or ‘-’ sign.

  • Decimal numbers are Integers followed by a ‘.’, followed by a non-empty sequence of digits.

  • Strings are any non-space sequences of characters enclosed in “”. e.g. “hello” “abc123”.

  • Keywords are the following strings: ‘PRINT’, ‘.’, ‘[’, ‘]’, ‘(’, ‘)’ and ‘^’.

    Notice: keywords are uppercase

  • Operators are the special strings. In this homework we will use ‘:-’, ‘~’, ‘<’, ‘>’, '=', ‘#’, ‘+’, ‘-’, ‘&’, ‘OR’, ‘*’, ‘/’ and ‘AND’.

  • Identifiers are sequences of digits or letters. The first character must be a letter, and an identifier cannot be a Keyword.

Reminder: In Puck-24.3 tokens are always separated by white-spaces.

NOTE: For this assignment, your parser does not need to handle line comments.

Examples

Sample 1

Input:

x :- 2 + 2 .
PRINT ( x * 100 ) .

Expected output:

VALID

Sample 2

Input:

x :- 1 .
a [ i ] :- 2 .
w [ 3 ] ^ ch :- "a" .
t ^ key :- s .
p ^ next ^ data :- alpha .
x :- x + y .
y :- x - y .
c :- c + 1 .
PRINT ( c ) . 

Expected output:

VALID

Sample 3

Input:

PRINT ( a # b ) .
flag :- a > b .
a :- a - b .  b :- b - a .
PRINT ( "Hello" & "World!" ) .

Expected output:

VALID

Sample 4

Input:

foo :- ( PRINT ) ;
PRINT ( * )

Expected output:

INVALID!
Error: factor expected, got "PRINT" 

Note: you may output a different error message.

Sample 4

Input:

void getToken ( ) {
  cin >> token ;
}

Expected output:

INVALID!
Error: ":-" expected, got "getToken" 

Note: you may output a different error message.

My approach

Given a grammar, in order to build a top-down recursive descent parser, I follow the following technique:

For each non-terminal, I designed a corresponding function that parses sentences that can be generated by that non-terminal.

The function getToken() retrieves the next token from the input and stores it in a global variable token.

A non-terminal that has more than one production requires a check of the token to determine which production to apply. The token is compared with the first symbol in each production until a match is found. If no match is found, the program outputs an error and exits.

My code

import sys

# Initialize token and jetton
token = None
jetton = 0
tokens = []

def getToken():
    """Fetches the next token from the input and increments jetton."""
    global token, jetton
    if tokens:
        token = tokens.pop(0)
        jetton += 2  # Increment jetton for each token
    else:
        token = None

def error(expected):
    """Raise an error with an expected message."""
    print("INVALID!")
    print(f"Error: {expected}, got '{token}'")
    sys.exit(1)

def parse_Factor():
    """Parses the Factor rule."""
    global token
    if token.isdigit() or (token[0] in "+-" and token[1:].isdigit()):  # Integer
        getToken()
    elif token.startswith('"') and token.endswith('"'):  # String
        getToken()
    elif token.isidentifier():  # Identifier as a variable or designator
        getToken()
    elif token == "(":
        getToken()
        parse_Expression()
        if token == ")":
            getToken()
        else:
            error("')' expected")
    elif token == "~":
        getToken()
        parse_Factor()
    else:
        error("Factor expected")

def parse_Term():
    """Parses the Term rule."""
    parse_Factor()
    while token in ["*", "/", "AND"]:
        getToken()
        parse_Factor()

def parse_SimpleExpression():
    """Parses the SimpleExpression rule."""
    parse_Term()
    while token in ["+", "-", "OR", "&"]:
        getToken()
        parse_Term()

def parse_Expression():
    """Parses the Expression rule."""
    parse_SimpleExpression()
    if token in ["<", ">", "=", "#"]:  # Relation
        getToken()
        parse_SimpleExpression()

def parse_Designator():
    """Parses the Designator rule."""
    if token.isidentifier():
        getToken()
        while token in ["^", "["]:
            if token == "^":
                getToken()
                if token.isidentifier():
                    getToken()
                else:
                    error("Identifier expected after '^'")
            elif token == "[":
                getToken()
                parse_Expression()
                if token == "]":
                    getToken()
                else:
                    error("']' expected")
    else:
        error("Identifier expected")

def parse_Assignment():
    """Parses the Assignment rule."""
    parse_Designator()
    if token == ":-":
        getToken()
        parse_Expression()
        if token == ".":
            getToken()
        else:
            error("'.' expected")
    else:
        error("':-' expected")

def parse_PrintStatement():
    """Parses the PrintStatement rule."""
    if token == "PRINT":
        getToken()
        if token == "(":
            getToken()
            parse_Expression()
            if token == ")":
                getToken()
                if token == ".":
                    getToken()
                else:
                    error("'.' expected")
            else:
                error("')' expected")
        else:
            error("'(' expected")
    else:
        error("'PRINT' expected")

def parse_Statement():
    """Parses a single Statement."""
    if token.isidentifier():
        parse_Assignment()
    elif token == "PRINT":
        parse_PrintStatement()
    else:
        error("Statement expected")

def parse_StatementSequence():
    """Parses the StatementSequence rule."""
    while token:
        parse_Statement()

def main():
    """Main function to handle input and parsing."""
    global tokens
    print("Enter your program (end input with Ctrl+D):")
    try:
        # Read program from standard input
        program = sys.stdin.read()
        # Tokenize by whitespace
        tokens = program.split()
        getToken()  # Initialize first token
        parse_StatementSequence()  # Start parsing
        print("VALID")
    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

if __name__ == "__main__":
    main()

Problem

For the first 3 input samples, the program prints "INVALID" with an error message when it should print "VALID".

Error messages:

(sample 1)
INVALID!
Error: ':-' expected, got '('

(sample 2)
INVALID!
Error: Factor expected, got 'i'

(sample 3)
INVALID!
Error: Factor expected, got 'i'

What is the problem in my code?

Share edited Nov 20, 2024 at 11:34 trincot 352k36 gold badges272 silver badges325 bronze badges asked Nov 19, 2024 at 18:25 DaneShulerDaneShuler 435 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

The cause of the problem you list is that your code will never call parse_PrintStatement. This is because isidentifier evaluates to True when token is "PRINT". Whereever you call isindentifier you should exclude the case where the token is "PRINT" (and also exclude "OR" and "AND").

Note how this bug also affects the result on sample 4: although your program outputs INVALID for that case, it didn't spot the problem with the "PRINT" token, but only complained about the semi-colon. If you replace the semi-colon with a dot, the input should still be considered invalid, but your program will then find no more issues in that input. The reason is the same as described above.

For example, you could create this helper function:

def is_identifier():
    return token.isidentifier() and token not in ("AND", "OR", "PRINT")

And then use that function instead of calling token.isidentifier().

Also, be aware that the rules for identifiers applied by the native isidentifier method are slightly different than what is described in your task. For instance, underscores are allowed as identifier by Python, but maybe not in this Puck language. Similar concerns may arise when using isdigit, isalpha...

Other remarks

Without knowing the grammar rules for this language, there might be more problems in your parser, but:

  • A quoted string literal cannot be just " (one character). Yet your code would consider this a valid string literal. You should make sure there are at least two characters in the token.

  • Your code does not recognise decimal literals, like 3.14.

本文标签: validationRecursive Descent Parser in Python not validating codeStack Overflow