This is a parser generator to ease writing traditional LALR(1) parsers with Common Lisp. It provides a surface syntax to formulate both the phrase grammar and token grammar in one definition. A very simple parser looks like
(defun eval-expr (input) (parse (input) ( sum -> sum "+" product => (+ $1 $3) -> product) (product -> product "*" factor => (* $1 $3) -> factor) (factor -> "(" sum ")" => $2 -> :number) (:number -> "[0-9]+" => (parse-integer $$)) ( -> "\\s"))) ;ignore whitespace → EVAL-EXPR (eval-expr "1 + 2 + 3") → 6 (eval-expr "1+2*3") → 7
The LALR(1) table is then genarated by a modified version of Mark
Johnson's lalr.cl
. The lexical scanner is generated by
CLEX.
The overall syntax of a grammar is as follows. We use the the Modified BNF Syntax as defined by ANSI-CL.
grammar | := | {↓rule}* | ||
rule | := |
( ↓non-terminal {->
↓rhs}+)
|
||
↓option | ||||
rhs | := |
{↓pattern}* [=>
↓action]
|
||
pattern | := | ↓non-terminal | ||
↓token | ||||
(* ↓rhs)
|
||||
(+ ↓rhs)
|
||||
(? ↓rhs)
|
||||
(++ ↓pattern
↓rhs)
|
||||
(** ↓pattern
↓rhs)
|
||||
(or
{↓pattern}*)
|
||||
(and ↓rhs)
|
||||
non-terminal | := | a symbol but not a keyword | ||
token | := | a keyword or a string | ||
action | := | a Lisp form | ||
option | := |
(:precedence
{↓precedence}*)
|
||
precedence | := |
( {:left | :right |
:nonassoc }
{↓token}*)
|
Within an action symbols named $
i
($1
, $2
, …) refer to the ith
semantic action. The package of these symbols is not important but the
symbols must appear lexically within the action.
As with loop
the symbols ->
, =>
,
?
, *
, +
, **
,
++
are compared with string=
, thus the package
is not important here.
Other rules or options are passed to CLEX2.
Each rule looks like
(
↓non-terminal
{->
↓rhs}+)
That is a non-terminal name followed by a list of productions
right-hand-sides each preceeded by ->
. Non-terminals are
named by symbols, which may not be keywords. Each such production may
include a semantic action preceeded by =>
which is
evaluated and becomes the semantic value of this production.
The RHS items are patterns. In the simplest case such a pattern is a symbol naming another non-terminal or a keyword naming a lexical category. As a convenience strings are accepted as lexical categories as well.
(sum -> sum "+" product -> product)
This rule has two productions. The first says that a sum
is
either a sum
followed by the token "+"
and a
product
, the second says it may also be just a
product
.
We may want to add a semantic action to the first production so that the
semantic value of the sum
non-terminal when matched is
indeed the sum of the arguments. Like
(sum -> sum "+" product => (+ $1 $3) -> product)
The Lisp form (+ $1 $3)
is evaluated with $1
lexically defined to be the semantic value of the sum
mentioned in the RHS and $3
being the semantic value of the
third item in the RHS namely the product
non-terminal. We
do not need an explicit semantic action for the second production
-> product
because a semantic action returning just
$1
is the default.
Token categories (terminals) are named by keywords. Any rule that
defines such a terminal is taken to be a scanner rule and is passed to
the scanner generator CLEX. For example a simple rule to define a token
category :integer
can look like
(:integer -> "[0-9]+" => (parse-integer $$))
As a convience any string appearing in patterns implicitly defines a token category with a scanner rule matching the exact match for that string. That is
(expression -> expression "or" expression)
and
(expression -> expression :or expression) (:or -> "or")
are equivalent. The string is not taken as a regular expression. Thus
e.g. "."
would name the token being exactly one dot
character and not the regular expression "."
which would be
any character but newline.
The tokens are named by the string-upcase
version of the
token string in the keyword
package. There currently is no
way to configure that.
(*
↓rhs)
Matches zero or more of rhs. The semantic value is a list of all the semantic value of rhs.
E.g. (* foo)
would be a list of foo
s.
However the rhs can also be multiple patterns matched as concatenation with an optional semantic action. Like
(bar -> … (* foo => `(:foo ,$1)) …) (bar -> … (* "," foo => $2) …)
The numbering used for $
n is within this
(*
…)
subpattern.
(+
↓rhs)
Same as (*
rhs)
, but matches
one or more of rhs.
(?
↓rhs)
Option. Matches zero or one rhs. The semantic value of this
pattern is either nil
when rhs did not appear,
or the semantic value of rhs. As with other iterations
=>
can be used to specify a semantic action, like
… (? "to" destination => `(:to ,$2)) …
(**
separator rhs)
A list of zero or more occurrences of rhs separated by the
pattern separator. The semantic value of this pattern is a
list of the rhs appeared. The following specifies a
possibly empty sequence of comma-separated foo
s.
… (** "," foo) …
(++
separator rhs)
Same as **
but at least one rhs must be
present.
(and
rhs)
Matches the given rhs. Basically a subpattern also allowing for specifying a semantic action.
(or
{pattern}*)
An alternative.
:no-auto-token-pattern | regular-expression | Parser Option |
Specifies a pattern as a regular expression that when applied to keywords given with string notation would not turn into automatic scanner token definitions. The idea is that when you have a symbol table it is usually faster to not have specific scanner rules for those keywords, but to enter those keywords into the symbol table. ExampleSuppose this tiny grammar: (defun parse-bool-expr (input) (parse (input) (:precedence (:left "or") (:left "and")) (expr -> expr "and" expr => `(and ,$1 ,$3) -> expr "or" expr => `(or ,$1 ,$3) -> :identifier -> "(" expr ")") (:identifier -> "[a-zA-Z]+" => (intern $$)))) This under the hood has two scanner rules like: (:and -> "and") (:or => "or")
However, since we call (defparameter *keyword-hash* (let ((ht (make-hash-table :test #'equal))) (setf (gethash "and" ht) :and (gethash "or" ht) :or) ht)) (defun parse-bool-expr (input) (parse (input) ;; Say that "and" and "or" don't generate auto scanner rules. (:no-auto-token-pattern "[a-z]+") ︙ ;; Now match "[a-zA-Z]+" and look up what we got in *keyword-hash* ;; first, report an identifier else. (-> "[a-zA-Z]+" => (return (or (gethash $$ *keyword-hash*) (values :identifier (intern $$))))))) This technique will significantly reduce the size of the scanner when there are a lot of keywords. It would also speed things up, especially when interning putative identifiers with a single hashtable. |