Motivated by the C-string use case. And doing it in Python -- not character by character. In Python, it's also natural to evaluate (or "unescaped") the string in the lexer. Tokens can be of arbitrary type.
Python's built in tokenizer has a nice regex. But it doesn't evaluate the string.
I saw the lexer state idea in Pygments. But pygments is for syntax highlighting -- it also it doesn't evaluate the tokenized string.
So I implemented a very simple lexing class with states.
Later I saw that PLY supports it -- last thing in the manual, "Conditional Lexing". Somehow I didn't read that part. That deserves to be of higher importance!
PLY says its implementation is based on a similar feature in Flex.
However, I don't like how PLY names things 'tstatetoken'. It's not intuitive. In general the metaprogramming in PLY obscures the plumbing -- what depends on what?
Also, I want to generate a graph of lexer states. I think the approach that Annex (and pygments) use is clearer than PLY.
Overall, it seems kind of ad hoc, but is very natural for the use case of mini-languages inside a language. Strings are a mini-language within your language. So are character classes.
How it is self-hosted.
You could just print out modules -- pydoc sort of does this, although the output is too busy.
Comparison to PLY and flex and so forth.
# TPE described in TPE syntax. This is analogous to a PEG described in PEG # syntax. The difference is that TPE is described in terms of tokens. # # CAP words are tokens. Start <- Definition+ !. Definition <- IDENTIFIER '<-' Choice # These 5 are mutually recursive. Choice <- Sequence ('/' Sequence)* Sequence <- Prefix+ Prefix <- (AND / NOT)? Suffix Suffix <- Primary (QUESTION / STAR / PLUS)? Primary <- TOKEN / DOT / IDENTIFIER ! '<-' / '(' Choice ')'
Annex also contains these components:
Those components are used to lex and parse the CRE language.
They are bundled together because they are mostly implemented using each other, in a self-hosting fashion. Together the 3 parts are only a couple thousand lines of code.
The most widely applicable part is probably CREs, but the other components may be useful as well.
annex implements TPEs in a form similar to PLY (Python Lex-Yacc), although the algorithms are different. There is no code generation -- the parsers are compiled into data structures and then interpreted.
The annex.Lexer class is used to implement both the TPE and CRE lexers.
The annex.Lexer abstraction is inspired by the Pygments lexer. A slight difference is that lexers meant to be fed in a parser (specified with TPE or otherwise) typically do more work than lexers meant for syntax highlighting.
It easily implements the CRE and TPE lexers, and can be used for many other other programming languages. Most languages are actually composed of small sublanguages (e.g. for literal strings, character classes, etc.), and this lexer conveniently expresses this with a state machine. (This is distinct from any state machine implied by a regular expression).
TPE is used to implement both the CRE language parser and TPE language itself. Like PEGs, it is self-describing.
In addition, I wanted to implement lexers for programming languages in Python. For performance reasons, this is best done with regular expressions, as opposed to a character-by-character loop in C.
Lexers tend to require more complicated regular expressions than other applications do. They were further into "line noise" territory. So I decided to create a readable, self-documenting syntax.
Last modified: 2013-01-27 10:36:20 -0800