nyala

This is a parser generator to ease writing traditional LALR(1) parsers with Common Lisp. It provides a surface syntax to formulate both the phrase grammar and token grammar in one definition. A very simple parser looks like

The LALR(1) table is then genarated by a modified version of Mark Johnson's lalr.cl. The lexical scanner is generated by CLEX.

2 Dictionary

The overall syntax of a grammar is as follows. We use the the Modified BNF Syntax as defined by ANSI-CL.

Within an action symbols named $i ($1, $2, …) refer to the ith semantic action. The package of these symbols is not important but the symbols must appear lexically within the action.

As with loop the symbols ->, =>, ?, *, +, **, ++ are compared with string=, thus the package is not important here.

2.1 Rules

That is a non-terminal name followed by a list of productions right-hand-sides each preceeded by ->. Non-terminals are named by symbols, which may not be keywords. Each such production may include a semantic action preceeded by => which is evaluated and becomes the semantic value of this production.

The RHS items are patterns. In the simplest case such a pattern is a symbol naming another non-terminal or a keyword naming a lexical category. As a convenience strings are accepted as lexical categories as well.

2.1.1 Examples

This rule has two productions. The first says that a sum is either a sum followed by the token "+" and a product, the second says it may also be just a product.

We may want to add a semantic action to the first production so that the semantic value of the sum non-terminal when matched is indeed the sum of the arguments. Like

The Lisp form (+ $1 $3) is evaluated with $1 lexically defined to be the semantic value of the sum mentioned in the RHS and $3 being the semantic value of the third item in the RHS namely the product non-terminal. We do not need an explicit semantic action for the second production -> product because a semantic action returning just $1 is the default.

2.2 Patterns

2.2.1 Tokens

Token categories (terminals) are named by keywords. Any rule that defines such a terminal is taken to be a scanner rule and is passed to the scanner generator CLEX. For example a simple rule to define a token category :integer can look like

As a convience any string appearing in patterns implicitly defines a token category with a scanner rule matching the exact match for that string. That is

are equivalent. The string is not taken as a regular expression. Thus e.g. "." would name the token being exactly one dot character and not the regular expression "." which would be any character but newline.

The tokens are named by the string-upcase version of the token string in the keyword package. There currently is no way to configure that.

2.2.2 Iteration

2.2.3 Combinations

2.3 Precedence

3 Practical Considerations

:no-auto-token-pattern

regular-expression

Parser Option

Specifies a pattern as a regular expression that when applied to keywords given with string notation would not turn into automatic scanner token definitions.

The idea is that when you have a symbol table it is usually faster to not have specific scanner rules for those keywords, but to enter those keywords into the symbol table.

Example

Suppose this tiny grammar:

(defun parse-bool-expr (input)
  (parse (input)
    (:precedence (:left "or") (:left "and"))
    (expr -> expr "and" expr => `(and ,$1 ,$3)
          -> expr "or" expr => `(or ,$1 ,$3)
          -> :identifier
          -> "(" expr ")")
    (:identifier -> "[a-zA-Z]+" => (intern $$))))

This under the hood has two scanner rules like:

(:and -> "and")
(:or => "or")

However, since we call intern on identifiers already, we might want to enter those keywords there as well. For illustrative purposes, we use a second hash *keyword-hash* for this purpose

(defparameter *keyword-hash*
  (let ((ht (make-hash-table :test #'equal)))
    (setf (gethash "and" ht) :and
          (gethash "or" ht) :or)
    ht))

(defun parse-bool-expr (input)
  (parse (input)
    ;; Say that "and" and "or" don't generate auto scanner rules.
    (:no-auto-token-pattern "[a-z]+")
    ︙
    ;; Now match "[a-zA-Z]+" and look up what we got in *keyword-hash*
    ;; first, report an identifier else.
    (-> "[a-zA-Z]+"
     => (return (or (gethash $$ *keyword-hash*)
                    (values :identifier (intern $$)))))))

This technique will significantly reduce the size of the scanner when there are a lot of keywords. It would also speed things up, especially when interning putative identifiers with a single hashtable.

grammar	:=	{↓rule}^*
rule	:=	`(`↓non-terminal {`->` ↓rhs}⁺`)`
		↓option
rhs	:=	{↓pattern}^* [`=>` ↓action]
pattern	:=	↓non-terminal
		↓token
		`(` ↓rhs*`)`
		`(+` ↓rhs`)`
		`(?` ↓rhs`)`
		`(++` ↓pattern ↓rhs`)`
		`(*` ↓pattern* ↓rhs`)`
		`(or` {↓pattern}^*`)`
		`(and` ↓rhs`)`
non-terminal	:=	a symbol but not a keyword
token	:=	a keyword or a string
action	:=	a Lisp form
option	:=	`(:precedence` {↓precedence}^*`)`
precedence	:=	`(`{`:left` \| `:right` \| `:nonassoc`} {↓token}^*`)`