Section notes: Week 3

CS 164, Fall 2005

Follow-up from last time

What C lexical structure does cpp care about?
- C comments, strings, identifiers, commas, parens
- Current line number and file name
- Enough about numbers to handle #if
Outside #if lines, can handle ill-formed C input
- Doesn't complain on 1g2 in main text
- Means cpp can be used for more than C
What about precompiled headers?
- Included in Xcode (Apple's version of GCC 4), Visual C++
- Reduces I/O cost, even if lexing is free

Lexing approaches

Ad hoc
- Just break string based on white space, commas, etc.
- Example: C strtok function
- Recognize any keywords by table lookup
- Great for little tasks (e.g. config files)
Regular expression libraries
- Built into many languages (e.g. Perl, Python, Allegro CL)
- Regular expression -> NFA -> DFA -> compressed DFA (more next week)
Lexer generators
- Classic example: lex/flex (compiles to C code)
- What if there are multiple matching patterns? Disambiguate by longest match, order of declarations.
- Can attach semantic actions to different tokens (e.g. ignore comments, convert integers to internal format)
Lexing as top-down parsing
- Example: Henry Baker's Pragmatic Parsing in Common Lisp (Meta)
- Idea: Write a grammar for tokens in terms of characters
- Can use automated tools, often simple enough to do by hand

CL lexing and regex tools

Regex libraries
- Allegro CL regex package
- nregex (also included in SLIME)
- CL-PPCRE (portable Perl-compatible regular expressions for Common Lisp)
- Regex libraries from Dorai Sitaram, Michael Parker, Jeff Massung, ...
Lexing
- Pragmatic Parsing in Common Lisp describes the Meta system.
- Macros based on any old regex system (Cl-PPRCE example)
- Deflexer macros from David Bindel, Michael Parker, Jeff Massung, ...

Strategies for the MJ lexer

MiniJava lexical structure is simple. You have words (which can be divided into keywords or identifiers by table lookup), numbers, operators and punctuation, comments, and white space. The only operator that uses more than one character is &&; the only non-word token types which cannot be distinguished from one lookahead character are the two types of comments (both start with /). It's easy to lex by hand; if you use a tool, make sure it saves you work.

The caveat about tools holds generally. If your tools get in your way, think about changing tools, or enhancing an existing tool, or just using a different tool for the hard part. For instance, if you ever read a flex book, you'll see mention of different lexer states; this is a hack to make it easy to process things like C comments. As another example, you'll find that you can sometimes get smaller, simpler state tables (for lexing or for parsing) by deferring a little bit of the work to a later stage: distinguishing identifiers from keywords by a table lookup, or making the grammar accept some illegal syntax which is caught in a post-processing step.

More Lisp-y goodness

You can look ahead one character in a Lisp stream using peek-char. However, it's helpful to spend time thinking about an abstraction which allows more than one character lookahead. The lstream.lisp file in the software directory shows you how to implement an augmented stream with row-number and col-number tracking and with arbirtrary lookahead. This file illustrates a few useful Lisp techniques you may not have seen in assignment 0: structures, macros, backquoted list builders, and multiple return values.

Structures give you a place to store mutable state information. I use structures for simple object-oriented designs: a structure represents the internal state of the object, which may change over time.
Macros let you turn the language you've got into the language you wanted. For example, if you're processing a stream char-by-char, it's handy to have specialized support for loops to read and process characters; this is the purpose of the while-lstream macro in lstream.lisp. For an enormous treasury of macros, read On Lisp.
Backquotes let you build lists without wordy calls to cons and list. Backquotes are like ordinary quotes, except that they have an escape mechanism to turn evaluation back on. For example, `(a . ,(+ 1 1)) produces the dotted pair (a . 2).
Multiple return values give you a way to return auxilary information (like row and column number) that you won't always care about. Yes, you could also return everything you care about in a list; but with multiple returns, you can get the extra values if you want them, and if you don't care, you don't need to invest in extra unpacking of a returned structure.

Regular expression exercises

MiniJava identifiers?
Decimal numbers which can be written p/4 for some integer p?
String containing exactly three 'b's and any number of 'a's?
Positive C integer constants (remember that 0x starts a hex number and 0 starts an octal number)?

Question for next time

The Nickle programming language (nickle.org) began life two decades ago when Keith Packard of HP decided he needed a system for arbitrary precision numeric computation. Nickle has an interesting rational number system: for display purposes, rational numbers may be written as int.fixed{repeat}. For example, the fraction 4/3 would be written (in decimal) as 1.{3}. Write a regular expression describing Nickle rational number tokens.

Representing rational numbers by repeating decimals takes a lot more space than representing rational numbers by relatively prime (numerator, denominator) pairs, like Common Lisp does. (For the mathematically inclined: how long can the repeating part be as a function of the denominator?) It's probably easiest to do rational arithmetic as in Common Lisp, but you could use something else for an integer representation. One option for something else is to think of each number as a finite state machine which produces an infinite (though eventually repeating) list of digits. Suppose that we were to use such a state machine representation:

Can you write a state machine which will generate the string associated with the number 5/6?
Suppose you wanted to check whether a finite decimal string was a correctly-rounded approximation to 5/6. Can you write a finite state machine to recognize this? What if you wanted to recognize p/6 for any integer p? Hint: this is easiest to do with a non-deterministic machine; we'll soon see how to convert such nondeterministic finite automata (NFAs) to deterministic equivalents (DFAs).
There are multiple finite state machines that will generate the infinite strings associated with the same number. How might you check that these machines produce the same strings? That is, how can you check that 0.{6} and 0.{66} represent the same number?

Can you think how you might convert a Lisp rational number representation (in terms of a pair of integers) to and from a Nickle rational number string?