META Parser - Classical Parsing System

580 words | 2016-5-22

To build a document tree, we need the HTML parser. Plain. In January 1991, an article by Henry G. Baker appeared “Pragmatic parsing on Common Lisp”, in which he describes META - a classic simple yet effective technique for constructing recursive descending parsers.

META Language

The META compiler is a set of macros that fit into a fifty lines. It is this simplicity that determined the choice of the HTML parser for a toy web-engine.

For real problems never use regular expressions for parsing. If the temptation is not lost, then read the article again.

META expressions consist of characters, strings, the sequence [], the alternative {}, the Kline star $, the symbol test by the condition @, and the evaluation of the expression !.

This is how the parsing of an integer looks like, with the simultaneous calculation of its actual value:

(deftype digit () '(member #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9))

(defun ctoi (d) (- (char-code d) #.(char-code #\0)))

(defun parse-int (&aux (s +1) d (n 0))
   [{#\+ [#\- !(setq s -1)] []}
    @(digit d) !(setq n (ctoi d))
    $[@(digit d) !(setq n (+ (* n 10) (ctoi d)))]])
  (* s n)))

! Is a powerful META construct that allows you to do interesting things, such as modifying grammar on the flight.


META expressions are converted by reader-macros into an internal representation - the meta structure (the print function is only for debugging):

(defstruct (meta
               (lambda (m s d &aux (char (meta-char m)) (form (meta-form m)))
                 (ecase char
                   ((#\@ #\! #\$) (format s "~A~A" char form))
                   (#\[ (format s "[~{~A~^ ~}]" form))
                   (#\{ (format s "{~{~A~^ ~}}" form))))))
(defun meta-reader (s c) (make-meta :char c :form (read s)))

Recognize META statements, nothing but the creation of meta structures:

(mapc #'(lambda (c) (set-macro-character c #'meta-reader)) '(#\@ #\$ #\!))

Recognize the sequence and alternatives:

(set-macro-character #\[
  #'(lambda (s c) (make-meta :char c :form (read-delimited-list #\] s t))))

(set-macro-character #\{
  #'(lambda (s c) (make-meta :char c :form (read-delimited-list #\} s t))))

(mapc #'(lambda (c) (set-macro-character c (get-macro-character #\) nil)))
      '(#\] #\}))

And finally, the META compiler:

(defun compileit (x)
  (typecase x
      (ecase (meta-char x)
        (#\! (meta-form x))
        (#\[ `(and ,@(mapcar #'compileit (meta-form x))))
        (#\{ `(or ,@(mapcar #'compileit (meta-form x))))
        (#\$ `(not (do () ((not ,(compileit (meta-form x)))))))
        (#\@ (let ((f (meta-form x))) `(match-type ,(car f) ,(cadr f))))))
    (t `(match ,x))))

(defmacro matchit (x) (compileit x))

It remains to determine how to feed the parser input data. Article offers options such as reading from the stream, from the string and from the list. Our toy engine will read from the line at least because it’s so much easier to “roll back” a few characters back.

(defmacro match (x)
  (etypecase x
      `(when (and (< index end) (eql (char str index) ',x))
         (incf index)))
      `(let ((old-index index)) ; 'old-index is a lexical variable.
         (or (and ,@(map 'list #'(lambda (c) `(match ,c)) x))
             (progn (setq index old-index) nil))))))

(defmacro match-type (x v)
  `(when (and (< index end) (typep (char str index) ',x))
     (setq ,v (char str index)) (incf index)))

In order for these macros to work, the string should be described as lexical variables str, index and end.

And one note of farewell: the kernel uses reader-macros, so you need to make sure that all the functions that are called when the META compiler is running are available at compile time. You should use (eval-when (: compile-toplevel) ...) in such cases.