nyala

1 Introduction

This is a parser generator to ease writing traditional LALR(1) parsers with Common Lisp. It provides a surface syntax to formulate both the phrase grammar and token grammar in one definition. A very simple parser looks like

(defun eval-expr (input)
  (parse (input)
    (    sum -> sum "+" product       => (+ $1 $3)
             -> product)
    (product -> product "*" factor    => (* $1 $3)
             -> factor)
    (factor  -> "(" sum ")"           => $2
             -> :number)
    (:number -> "[0-9]+"              => (parse-integer $$))
    (        -> "\\s")))              ;ignore whitespace
   EVAL-EXPR
   (eval-expr "1 + 2 + 3")
   6
   (eval-expr "1+2*3")
   7

The LALR(1) table is then genarated by a modified version of Mark Johnson's lalr.cl. The lexical scanner is generated by CLEX.

2 Dictionary

The overall syntax of a grammar is as follows. We use the the Modified BNF Syntax as defined by ANSI-CL.

grammar    :=    {↓rule}*
rule    :=    (non-terminal {->rhs}+)
      option
rhs    :=    {↓pattern}* [=>action]
pattern    :=    non-terminal
      token
      (*rhs)
      (+rhs)
      (?rhs)
      (++patternrhs)
      (**patternrhs)
      (or {↓pattern}*)
      (andrhs)
non-terminal    :=    a symbol but not a keyword
token    :=    a keyword or a string
action    :=    a Lisp form
option    :=    (:precedence {↓precedence}*)
precedence    :=    ({:left | :right | :nonassoc} {↓token}*)

Within an action symbols named $i ($1, $2, …) refer to the ith semantic action. The package of these symbols is not important but the symbols must appear lexically within the action.

As with loop the symbols ->, =>, ?, *, +, **, ++ are compared with string=, thus the package is not important here.

Other rules or options are passed to CLEX2.

2.1 Rules

Each rule looks like

    (non-terminal {->rhs}+)

That is a non-terminal name followed by a list of productions right-hand-sides each preceeded by ->. Non-terminals are named by symbols, which may not be keywords. Each such production may include a semantic action preceeded by => which is evaluated and becomes the semantic value of this production.

The RHS items are patterns. In the simplest case such a pattern is a symbol naming another non-terminal or a keyword naming a lexical category. As a convenience strings are accepted as lexical categories as well.

2.1.1 Examples

(sum -> sum "+" product
     -> product)

This rule has two productions. The first says that a sum is either a sum followed by the token "+" and a product, the second says it may also be just a product.

We may want to add a semantic action to the first production so that the semantic value of the sum non-terminal when matched is indeed the sum of the arguments. Like

(sum -> sum "+" product     => (+ $1 $3)
     -> product)

The Lisp form (+ $1 $3) is evaluated with $1 lexically defined to be the semantic value of the sum mentioned in the RHS and $3 being the semantic value of the third item in the RHS namely the product non-terminal. We do not need an explicit semantic action for the second production -> product because a semantic action returning just $1 is the default.

2.2 Patterns

2.2.1 Tokens

Token categories (terminals) are named by keywords. Any rule that defines such a terminal is taken to be a scanner rule and is passed to the scanner generator CLEX. For example a simple rule to define a token category :integer can look like

(:integer -> "[0-9]+" => (parse-integer $$))

As a convience any string appearing in patterns implicitly defines a token category with a scanner rule matching the exact match for that string. That is

(expression -> expression "or" expression)

and

(expression -> expression :or expression)
(:or -> "or")

are equivalent. The string is not taken as a regular expression. Thus e.g. "." would name the token being exactly one dot character and not the regular expression "." which would be any character but newline.

The tokens are named by the string-upcase version of the token string in the keyword package. There currently is no way to configure that.

2.2.2 Iteration

(*rhs)

Matches zero or more of rhs. The semantic value is a list of all the semantic value of rhs.

E.g. (* foo) would be a list of foos.

However the rhs can also be multiple patterns matched as concatenation with an optional semantic action. Like

(bar -> … (* foo => `(:foo ,$1)) …)
(bar -> … (* "," foo => $2) …)

The numbering used for $n is within this (*) subpattern.

(+rhs)

Same as (* rhs), but matches one or more of rhs.

(?rhs)

Option. Matches zero or one rhs. The semantic value of this pattern is either nil when rhs did not appear, or the semantic value of rhs. As with other iterations => can be used to specify a semantic action, like

… (? "to" destination => `(:to ,$2)) …
(** separator rhs)

A list of zero or more occurrences of rhs separated by the pattern separator. The semantic value of this pattern is a list of the rhs appeared. The following specifies a possibly empty sequence of comma-separated foos.

… (** "," foo) …
(++ separator rhs)

Same as ** but at least one rhs must be present.

2.2.3 Combinations

(and rhs)

Matches the given rhs. Basically a subpattern also allowing for specifying a semantic action.

(or {pattern}*)

An alternative.

2.3 Precedence

2.4 Source Location Information

3 Practical Considerations


:no-auto-token-pattern regular-expression Parser Option

    

Specifies a pattern as a regular expression that when applied to keywords given with string notation would not turn into automatic scanner token definitions.

The idea is that when you have a symbol table it is usually faster to not have specific scanner rules for those keywords, but to enter those keywords into the symbol table.

Example

Suppose this tiny grammar:

(defun parse-bool-expr (input)
  (parse (input)
    (:precedence (:left "or") (:left "and"))
    (expr -> expr "and" expr => `(and ,$1 ,$3)
          -> expr "or" expr => `(or ,$1 ,$3)
          -> :identifier
          -> "(" expr ")")
    (:identifier -> "[a-zA-Z]+" => (intern $$))))

This under the hood has two scanner rules like:

(:and -> "and")
(:or => "or")

However, since we call intern on identifiers already, we might want to enter those keywords there as well. For illustrative purposes, we use a second hash *keyword-hash* for this purpose

(defparameter *keyword-hash*
  (let ((ht (make-hash-table :test #'equal)))
    (setf (gethash "and" ht) :and
          (gethash "or" ht) :or)
    ht))

(defun parse-bool-expr (input)
  (parse (input)
    ;; Say that "and" and "or" don't generate auto scanner rules.
    (:no-auto-token-pattern "[a-z]+")
    ︙
    ;; Now match "[a-zA-Z]+" and look up what we got in *keyword-hash*
    ;; first, report an identifier else.
    (-> "[a-zA-Z]+"
     => (return (or (gethash $$ *keyword-hash*)
                    (values :identifier (intern $$)))))))

This technique will significantly reduce the size of the scanner when there are a lot of keywords. It would also speed things up, especially when interning putative identifiers with a single hashtable.