Thursday, November 27, 2008

Parsing heredocs with PIRC

The PIR language allows you to write so-called heredocs, as known from Perl (and I believe the idea was stolen from some older languages). If you're not a Perl programmer (like me), you might wonder what the heck I'm talking about. So, let's start with a simple example (using PIR syntax):
$S0 = <<'CODE'
This is a heredoc string
So, using the special <<"CODE" is stored in register $S0. So, the string stored in $S0 is:
"\nThis is a heredoc string\nover\nmultiple\nlines\n"
Want to see for yourself? 
cd compilers/pirc
export LD_LIBRARY_PATH=../../blib/lib
./pirc <file>
Not too difficult, eh? Well... you're right, until you use multiple heredocs in, for instance, subroutine arguments. Many Parrot tests use Perl scripts to specify the input (code) and the expected output. That's done like this:
pir_output_is(<<'CODE', <<'OUTPUT');
.sub main
   say 42
PIR also allows multiple ("nested" if you want, but they're not symmetrically nested; a better word would be "overlapping") heredocs. The current PIR compiler (IMCC) cannot handle this, but the new PIR compiler (currently still under development) PIRC, can. In an attempt to maximize the "truck number" of PIRC (which is currently estimated to be 1), I'm trying to do as much documentation of PIRC as possible. In this article, I'll be discussing how heredoc parsing is done in PIRC.

Single heredoc parsing
Let's start out simple with a single heredoc string. The heredoc preprocessor is implemented in the file compilers/pirc/new/hdocprep.l. Yes, that's right, that's not a C file, but a Lex specification file. Using Lex (and it's probably Flex that you'll be running, a faster implementation of Lex), the .l file is converted into a C file.

Now, for this discussion I'll assume that you know a bit about using Flex. A particular (advanced) feature of Flex-generated scanners (or "lexer" if you want) is the use of states. Using states you can implement several scanners in a single specification. Based on the state that the lexer is in, it'll recognize a subset of the rules. If no rule is specified, then the built-in, default INITIAL state is assumed. States can be pushed and popped of a stack, also managed by the generated C file. Whenever a new state is pushed onto the stack, that becomes the active state, and all rules that are specific to that state will become active. Once the state is popped of the stack, the rules fall out of scope again, and the state at the top of the stack (after popping) will become active.

So, the trick that's used is, as soon as the heredoc marker is read (e.g. <<'CODE'), we enter a new state, which will read the input line by line. All rules of the previous state are no longer active, so strings such as "<<'OUTPUT'" won't be matched, but instead are stored in a string buffer. After reading each line, it will check whether the heredoc end marker was read. As soon as the end marker is read, the heredoc scanning state will be popped, and scanning will continue in the initial state.

Single heredoc strings as subroutine arguments
Consider the following PIR line:
foo(<<'CODE', 42, 'hi')
After reading <<'CODE', the scanner will continue scanning in the heredoc state. However, the rest of the line, containing ", 42, 'hi')" should be stored somewhere, as this is not part of the heredoc.  So, the "rest of the line" is scanned, stored in a buffer, and then the heredoc scanning can start. After reading the end marker of heredoc, the scanned heredoc is printed to the output, after which the "rest of the line" can be scanned. Not so hard, eh?

Scanning multiple heredocs
Things get more interesting when you need to scan input like the following:
foo(<<'CODE', 42, <<'OUTPUT', 'hi')
.sub main
 say 42
Scanning the first heredoc string works fine; the "rest of the line" is then:
", 42, <<'OUTPUT', 'hi')"
After scanning the first heredoc, we tell the lexer to start retrieving the next characters from the "rest of the line" buffer. When scanning this line buffer, we'll encounter the <<'OUTPUT' marker, which indicates another heredoc string. At this point, we need to re-save the "rest of the line", which will contain:
, 'hi')
At this point, we need to continue scanning from the input file, after the point we left off after scanning the first heredoc. So, off we go, again switching input buffers in the lexer, so that the contents of the second heredoc string are scanned. Again, scanning line by line, checking for the heredoc end marker. Again, once we find the end marker, we'll switch back to the "rest of line" buffer, and finish scanning that string. Once we read EOF on the line buffer, again we switch back to the input file, and the rest of the file is scanned as if nothing happened.

Heredoc preprocessing
The heredoc handling is implemented as a separate pass over the input file. This was done to keep the complexity of the lexer manageable. Although this does result in more I/O operations, having better readable code is, IMHO anyway, a more important goal than presumably faster code, that is hard to maintain. Currently, the output of the heredoc preprocessor is written to a temporary file, of which the PIR lexer is reading.

Including files
The PIR language has a C-preprocessor-like #include directive, spelled as ".include". Logically, it's part of the macro layer, as the include directive is replaced by the contents of the specified file. However, as an included file may contain heredoc strings as well, the whole invocation scheme of the PIR compiler becomes more complex, as that included file must be handled by the heredoc preprocessor as well. 

So, PIRC cheats a bit. Instead of implementing the .include directive in the macro layer, it is moved "forward" to the heredoc preprocessor. This way, the heredoc preprocessor converts all heredoc strings into normal strings (a process I like to call "flattening the string", as the multi-line string is now flat, meaning it's a one-line string), and stores the contents of all included files into the temporary file. PIRC's normal lexer only ever sees one single input file, and does not need to handle the .include directive.

This article discusses the implementation of the heredoc preprocessor. PIRC allows you to use multiple heredoc strings in a single PIR statement, which introduces some complexity to the lexer. For that reason, heredocs are processed in a separate pass over the input. Besides handling heredocs, the .include directive is handled in the heredoc preprocessor as well, to make life a bit easier for myself.