Kalami: Parsing

Next up in the presentation of the little toy language Kalami is parsing. I remember how we used SableCC during my undergraduate years, and I also remember how we simply couldn’t make it pretty no matter what measure we used! Perhaps it can be made better in the Java environment than we ever managed to, but after I saw how it is done in the Caml environment I never looked back: all of a sudden it became fun to write parsers and interpreters!

So, parsing is basically about how to turn a input text string into an in-memory representation in the form of an abstract syntax tree. I don’t want to go into a lot of background but a good source for slightly more details is Chapter 12 in the Ocaml Manual.

The files needed can be found in the ZIP file kalami-parsing.zip; out of these the most important are structure.ml, lexer.mll, and parser.mly.

Starting with structure.ml we first have a piece of Caml code defining what we would like the abstract syntax tree of a Kalami programme to look like:

type identifier = string
type number = int
type bound = int

type expression =
	ExprNumber of number
	| ExprVariable of identifier
	| ExprProduct of expression * expression
	| ExprDivision of expression * expression
	| ExprSum of expression * expression
	| ExprSubtraction of expression * expression
	| ExprPower of expression * expression

type condition =
	CndTrue
	| CndFalse
	| CndEqual of expression * expression
	| CndLessthan of expression * expression
	| CndNegation of condition
	| CndOr of condition * condition
	| CndAnd of condition * condition

type statement =
	StmtLet of identifier * expression * statement
	| StmtIf of condition * statement * statement
	| StmtGuess of identifier * statement
	| StmtGuessLowerBound of identifier * ...
	| StmtGuessLowerUpperBound of identifier * ...
	| StmtAccept
	| StmtReject

Essentially, the above code defines expressions, conditions, and statements as labelled tuples, such that when we for instance write “4*5” in Kalami then this can be represented in memory as expression ExprProduct(ExprNumber(4), ExprNumber(5)). We shall later see that in the end a Kalami programme is simply a statement.

By the way, note the elegance with which we can express an abstract syntax tree in Caml: were we to express the same in Java it would look a lot less intuitive, at least in my experience. This is one of the great strengths of Caml, of which there are plenty more when we start evaluating and analysing the abstract syntax tree in later posts.

Next up is the lexer responsible for the first layer of abstraction of the input, namely turning the raw string into a stream of tokens. Looking in lexer.mll we see that several regular expressions are used to abstract strings, from the very basic labelling of characters and keywords:

	| '('				{ LPAREN }
	| ')'				{ RPAREN }
	| '+'				{ PLUS }
	| '-'				{ MINUS }

	| "let"				{ LET }
	| "in"				{ IN }
	| "if"				{ IF }
	| "then"			{ THEN }

to the slightly more advanced regular expressions:

	| (['0'-'9']+ as i)		{ INT(int_of_string i) }

that also includes a native Caml instruction for turning a string into a number.

Besides this abstraction of strings as tokens not much happens in the lexer, perhaps apart from the little extra code needed for allowing nested comments, and the insistence that STRs, and in turn identifiers, start with a digit (an easy way to distinguish them from INTs).

However, in parser.mly we may use these tokens to put together a grammar. In particular we see from rule main that an input (now a string of tokens) is a Kalami programme if it can be parsed according to rule “stmt” and its sub-rules as given by:

id:
	STR				{ $1 }
;

expr:
	INT				{ ExprNumber($1) }
	| id				{ ExprVariable($1) }
	| expr MULT expr		{ ExprProduct($1, $3) }
	| expr DIV expr			{ ExprDivision($1, $3) }
	| expr PLUS expr		{ ExprSum($1, $3) }
	| expr MINUS expr		{ ExprSubtraction($1...) }
	| expr EXP expr			{ ExprPower($1, $3) }
	| LPAREN expr RPAREN		{ $2 }
;

cnd:
	expr EQ expr			{ CndEqual($1, $3) }
	| expr LT expr			{ CndLessthan($1, $3) }
	| NOT cnd			{ CndNegation($2) }
	| cnd OR cnd			{ CndOr($1, $3) }
	| cnd AND cnd			{ CndAnd($1, $3) }
;

stmt:
	LET id ASSIGN expr IN stmt	{ StmtLet($2, $4, $6) }
	| IF cnd THEN stmt OTHERWI...	{ StmtIf($2, $4, $6) }
	| GUESS id IN stmt		{ StmtGuess($2, $4) }
	| GUESS id FROM expr IN ...	{ StmtGuessLowerBo... }
	| GUESS id FROM expr TO ...	{ StmtGuessLowerUp... }
	| ACCEPT			{ StmtAccept }
	| REJECT			{ StmtReject }
;

Besides these rules the file also contains instructions defining the precedence of for example the mathematical operations as well as their associativity, as needed to help the parser solve ambiguity.

Having looked at the primary files for the parser we turn to the plumbing needed to put them together. File printing.ml simply takes a statement and recursively prints it on the screen, and file kalamiparsing.ml contains Caml code that reads the input, invokes the lexer and the parser, and prints the statement in case no error occurred:

open Printing

let _ =
	try
		let lexbuf = Lexing.from_channel stdin in
		let stmt = Parser.main Lexer.token lexbuf in

		print_string "\n*** Statement ***\n\n";
		print_stmt stmt;
		print_string "\n\n"

	with 
		Lexer.Eof ->
			print_string "End of file\n";
			exit 0
		| Parsing.Parse_error ->
			print_string "Parsing error\n";
			exit 1

The only thing missing now is how to compile everything. This is not completely straight-forward since the lexer and the parser are referencing each other, and as a result we must execute the compilation commands in the specific order as seen in file compile.sh. Doing so will produce the kalamiparsing binary which we can tell to either load a programme from a file through

./kalamiparsing < "inputfile"

or run “interactively” by simply executing

./kalamiparsing

and typing a programme directly into it (ending with an EOF signal by pressing control + d).

In the next post we’ll look at how we may evaluate an in-memory Kalami programme in order to find a satisfying guess; as mentioned previously this requires something slightly more involved than a trivial brute-force.

Leave a Reply

Your email address will not be published.