admin管理员组

文章数量:1332382

For reference, I am learning from the book "Compilers principles: Techniques and Tools"(also known as "the dragon book").

I am making a language where you can add two natural numbers. I have finished writing my Lexical Analyser. The tokens for the Lexical Analyser consist of the following.

data Token = Digit String | Plus | Minus | Multiply | Divide
    deriving (Show, Eq)

The tokens do contain other operations but I am only concerned with adding two natural numbers. The way I have built my Lexical Analyser is as follows.

  1. I first read all the input string and get all the words from that string. A word is any continuous string of characters followed by a space. The following function does that.
getAllWords :: String -> String -> [String]
getAllWords string word = case string of
    []     -> [word]
    ' ':xs -> word : (getAllWords xs "")
    x:xs   -> getAllWords xs (word ++ [x])
  1. Then I remove all the whitespaces from this list of strings.
removeWhiteSpace :: [String] -> [String]
removeWhiteSpace list = case list of
    []    -> []
    "":xs -> removeWhiteSpace xs
    x:xs  -> x : (removeWhiteSpace xs)
  1. I have a DFA for every token in the language. I run a function tokeniser that passes every word from the list I get from (removeWhiteSpace (getAllWords list "")) to every DFA. If a DFA recognises the word, it outputs the relevant token. This function returns a list of all the tokens by reading the original string from left to right.
tokeniserHelper :: [String] -> [Token]
tokeniserHelper list = case list of
    [] -> []
    x:xs
        | (digitDFA x S0 == True)    -> (Digit x) : (tokeniserHelper xs)
        | (plusDFA x S0 == True)     -> (Plus)    : (tokeniserHelper xs)
        | (minusDFA x S0 == True)    -> (Minus)   : (tokeniserHelper xs)
        | (multiplyDFA x S0 == True) -> (Multiply) : (tokeniserHelper xs)
        | (divideDFA x S0 == True)   -> (Divide) : (tokeniserHelper xs)
 
tokeniser :: String -> [Token]
tokeniser string = tokeniserHelper (removeWhiteSpace (getAllWords string ""))

I realise that this a bit different from the general idea of combining all the DFA’s into one big DFA. But I found this approach more simpler for a first time. It was also suggested in the book so I decided to go with it.


Now I have moved on to the parser part and I am trying to implement a recursive descent parser to construct a parse tree. For the parse tree, I am using a rose tree as my base data structure.

data ParseTree a = Node a [ParseTree a]
    deriving (Show, Eq)

The grammar that I want to use is as follows.

E -> E + E | 0E | . . . | 9E | \epsilon

But I am not sure how to implement this grammar to use for the recursive descent parser. Could someone guide me on how to do it.

Also, am I heading in the right direction with building a compiler? Is there anything that I am doing fundamentally wrong?

For reference, I am learning from the book "Compilers principles: Techniques and Tools"(also known as "the dragon book").

I am making a language where you can add two natural numbers. I have finished writing my Lexical Analyser. The tokens for the Lexical Analyser consist of the following.

data Token = Digit String | Plus | Minus | Multiply | Divide
    deriving (Show, Eq)

The tokens do contain other operations but I am only concerned with adding two natural numbers. The way I have built my Lexical Analyser is as follows.

  1. I first read all the input string and get all the words from that string. A word is any continuous string of characters followed by a space. The following function does that.
getAllWords :: String -> String -> [String]
getAllWords string word = case string of
    []     -> [word]
    ' ':xs -> word : (getAllWords xs "")
    x:xs   -> getAllWords xs (word ++ [x])
  1. Then I remove all the whitespaces from this list of strings.
removeWhiteSpace :: [String] -> [String]
removeWhiteSpace list = case list of
    []    -> []
    "":xs -> removeWhiteSpace xs
    x:xs  -> x : (removeWhiteSpace xs)
  1. I have a DFA for every token in the language. I run a function tokeniser that passes every word from the list I get from (removeWhiteSpace (getAllWords list "")) to every DFA. If a DFA recognises the word, it outputs the relevant token. This function returns a list of all the tokens by reading the original string from left to right.
tokeniserHelper :: [String] -> [Token]
tokeniserHelper list = case list of
    [] -> []
    x:xs
        | (digitDFA x S0 == True)    -> (Digit x) : (tokeniserHelper xs)
        | (plusDFA x S0 == True)     -> (Plus)    : (tokeniserHelper xs)
        | (minusDFA x S0 == True)    -> (Minus)   : (tokeniserHelper xs)
        | (multiplyDFA x S0 == True) -> (Multiply) : (tokeniserHelper xs)
        | (divideDFA x S0 == True)   -> (Divide) : (tokeniserHelper xs)
 
tokeniser :: String -> [Token]
tokeniser string = tokeniserHelper (removeWhiteSpace (getAllWords string ""))

I realise that this a bit different from the general idea of combining all the DFA’s into one big DFA. But I found this approach more simpler for a first time. It was also suggested in the book so I decided to go with it.


Now I have moved on to the parser part and I am trying to implement a recursive descent parser to construct a parse tree. For the parse tree, I am using a rose tree as my base data structure.

data ParseTree a = Node a [ParseTree a]
    deriving (Show, Eq)

The grammar that I want to use is as follows.

E -> E + E | 0E | . . . | 9E | \epsilon

But I am not sure how to implement this grammar to use for the recursive descent parser. Could someone guide me on how to do it.

Also, am I heading in the right direction with building a compiler? Is there anything that I am doing fundamentally wrong?

Share edited Nov 22, 2024 at 10:05 Seeker asked Nov 22, 2024 at 6:39 SeekerSeeker 2801 silver badge9 bronze badges 4
  • "":xs -> removeWhiteSpace xs looks suspicious. Should it be ' '? – talex Commented Nov 22, 2024 at 7:26
  • @talexmovedtoCodidact When the “getAllWords” function sees a space, it adds “” to the list. – Seeker Commented Nov 22, 2024 at 8:28
  • 3 If you are following the Dragon Book you should also be aware of the Happy tool (hackage.haskell./package/happy) which is the Haskell version of YACC, and Alex (hackage.haskell./package/alex) which is the Haskell version of Lex. – Paul Johnson Commented Nov 22, 2024 at 10:53
  • @PaulJohnson I thought Lex was a pretty cool tool when I read about it. Thanks for sharing those! – Seeker Commented Nov 22, 2024 at 11:09
Add a comment  | 

1 Answer 1

Reset to default 1

The usual approach in Haskell is to use a parser monad. This isn't something covered in the Dragon Book. I'm going to assume you understand monads in general. If you don't then I suggest going and looking at how things like the State, Reader and Random monads work first.

A simple parser monad for your tokens will look like this:

newtype Parser a = Parser { runParser :: [Token] -> [(a, [Token])] }

That means that a Parser takes a list of tokens and outputs a list of possible parses, along with the residual list of tokens that follow each item parsed.

In this case I've made the return type a list. This is the most general solution: it means that every possible parse is returned, allowing you to have an ambiguous grammar. However it can only signal an error by returning an empty list with no clue as to what went wrong.

So now you can write a parser rule like:

digit :: Parser (ParseTree Token)
digit = Parser $ \tokens -> case tokens of
  (Digit str : rest) -> [(ParseTree (Digit str) [], rest)]
  _ -> []

This says that a digit parser will succeed if the first token is a Digit. In this case it returns the digit as a single leaf in a parse tree, along with the rest of the list. If its anything else then the parser fails, returning an empty list of results. The Parser $ \tokens -> bit is just wrapping up an anonymous function in the Parser type.

On its own this isn't much use, but you can combine these primitive parsers into more complex ones. The two main combinators you use are concatenation (i.e. parse this, then that) and alternation (i.e. parse either this or that). This is where the monad instances come in.

First, here is the monad instance for your parser.

instance Monad Parser where
  return a = Parser $ \tokens -> [a, tokens]
  v >>= f =  Parser $ \tokens -> concatMap f2 $ runParser v tokens
    where f2 (v, rest) = runParser f2 v rest

The return function is a null parser: it returns a value without consuming anything.

The >>= (aka "bind") operator takes an existing parser v of type Parser a and a second function f of type a -> Parser b. It converts this into a single parser which does the v parser and passes the result to f, giving a second parser. This parser is then run on the rest of the tokens (i.e. what was left by the first parser) and returns the result.

I'm going to assume you know how do notation desugars into a >>= calls.

So you can write something like this:

expression :: Parser (ParseTree Token)
expression = do
   arg1 <- digit
   op <- operator  -- Exercise: define operator :: Parser (ParseTree Token)
   arg2 <- digit
   return $ ParseTree op [arg1, arg2]

That will parse a simple digit + digit expression.

You will also need an operator for alternatives, and then you can construct your general recursive descent parser. See also the Alternative typeclass, which gives you the general typeclass for this concept.

I said at the top that the parser type I gave isn't very good. It can't do error messages, and it can be very inefficient if there are multiple ways of parsing some tokens which then turn out not to work. So a more usual approach is like this:

data Result a = Result a | Error String

newtype Parser a = Parser { runParser :: [Token] -> (Result a, [Token]) }

This either returns a definite result or an error message. It's still not very sophisticated: there is no way of tracking line numbers for instance. But its an improvement, and will do for a simple expression parser exercise. I'll leave the Monad and other instances for you to figure out. They aren't complicated, except that you will need to decide what to do with multiple error messages in the alternative case. (Clue: make the error case data more sophisticated).

In response to your queries: your lexer isn't very sophisticated, but will do for now. Its actually easier to just make your lexer a recursive-descent parser. The only reason to separate lexer and parser in this case is to make error handling easier.

Your result shouldn't be a rose tree. You should define a custom data type for your expressions, something like:

data Expr = Value String | Plus Expr Expr | Times Expr Expr ....

Then you can mirror this structure in your parser.

本文标签: