Section notes: Week 3

CS 164, Fall 2005

Follow-up from last time

Lexing approaches

CL lexing and regex tools

Strategies for the MJ lexer

MiniJava lexical structure is simple. You have words (which can be divided into keywords or identifiers by table lookup), numbers, operators and punctuation, comments, and white space. The only operator that uses more than one character is &&; the only non-word token types which cannot be distinguished from one lookahead character are the two types of comments (both start with /). It's easy to lex by hand; if you use a tool, make sure it saves you work.

The caveat about tools holds generally. If your tools get in your way, think about changing tools, or enhancing an existing tool, or just using a different tool for the hard part. For instance, if you ever read a flex book, you'll see mention of different lexer states; this is a hack to make it easy to process things like C comments. As another example, you'll find that you can sometimes get smaller, simpler state tables (for lexing or for parsing) by deferring a little bit of the work to a later stage: distinguishing identifiers from keywords by a table lookup, or making the grammar accept some illegal syntax which is caught in a post-processing step.

More Lisp-y goodness

You can look ahead one character in a Lisp stream using peek-char. However, it's helpful to spend time thinking about an abstraction which allows more than one character lookahead. The lstream.lisp file in the software directory shows you how to implement an augmented stream with row-number and col-number tracking and with arbirtrary lookahead. This file illustrates a few useful Lisp techniques you may not have seen in assignment 0: structures, macros, backquoted list builders, and multiple return values.

Regular expression exercises

Question for next time

The Nickle programming language (nickle.org) began life two decades ago when Keith Packard of HP decided he needed a system for arbitrary precision numeric computation. Nickle has an interesting rational number system: for display purposes, rational numbers may be written as int.fixed{repeat}. For example, the fraction 4/3 would be written (in decimal) as 1.{3}. Write a regular expression describing Nickle rational number tokens.

Representing rational numbers by repeating decimals takes a lot more space than representing rational numbers by relatively prime (numerator, denominator) pairs, like Common Lisp does. (For the mathematically inclined: how long can the repeating part be as a function of the denominator?) It's probably easiest to do rational arithmetic as in Common Lisp, but you could use something else for an integer representation. One option for something else is to think of each number as a finite state machine which produces an infinite (though eventually repeating) list of digits. Suppose that we were to use such a state machine representation:

Can you think how you might convert a Lisp rational number representation (in terms of a pair of integers) to and from a Nickle rational number string?