Section notes: Week 3
CS 164, Fall 2005
Follow-up from last time
- What C lexical structure does cpp care about?
- C comments, strings, identifiers, commas, parens
- Current line number and file name
- Enough about numbers to handle #if
- Outside #if lines, can handle ill-formed C input
- Doesn't complain on 1g2 in main text
- Means cpp can be used for more than C
- What about precompiled headers?
- Included in Xcode (Apple's version of GCC 4), Visual C++
- Reduces I/O cost, even if lexing is
free
Lexing approaches
- Ad hoc
- Just break string based on white space, commas, etc.
- Example: C strtok function
- Recognize any keywords by table lookup
- Great for little tasks (e.g. config files)
- Regular expression libraries
- Built into many languages (e.g. Perl, Python, Allegro CL)
- Regular expression -> NFA -> DFA -> compressed DFA
(more next week)
- Lexer generators
- Classic example: lex/flex (compiles to C code)
- What if there are multiple matching patterns?
Disambiguate by longest match, order of declarations.
- Can attach semantic actions to different tokens (e.g. ignore
comments, convert integers to internal format)
- Lexing as top-down parsing
- Example: Henry Baker's
Pragmatic Parsing in Common Lisp
(Meta)
- Idea: Write a grammar for tokens in terms of characters
- Can use automated tools, often simple enough to do by hand
CL lexing and regex tools
Strategies for the MJ lexer
MiniJava lexical structure is simple. You have words (which can be
divided into keywords or identifiers by table lookup), numbers,
operators and punctuation, comments, and white space. The only
operator that uses more than one character is &&; the only
non-word token types which cannot be distinguished from one
lookahead character are the two types of comments (both start with
/). It's easy to lex by hand
; if you use a tool, make sure it
saves you work.
The caveat about tools holds generally. If your tools get in your
way, think about changing tools, or enhancing an existing tool, or
just using a different tool for the hard
part. For instance,
if you ever read a flex book, you'll see mention of different lexer
states; this is a hack to make it easy to process things like C
comments. As another example, you'll find that you can sometimes
get smaller, simpler state tables (for lexing or for parsing) by
deferring a little bit of the work to a later stage: distinguishing
identifiers from keywords by a table lookup, or making the grammar
accept some illegal syntax which is caught in a post-processing step.
More Lisp-y goodness
You can look ahead one character in a Lisp stream using peek-char.
However, it's helpful to spend time thinking about an abstraction which
allows more than one character lookahead. The lstream.lisp file
in the software directory shows you how to implement an augmented stream
with row-number and col-number tracking and with arbirtrary lookahead.
This file illustrates a few useful Lisp techniques you may not have
seen in assignment 0: structures, macros, backquoted list builders, and
multiple return values.
-
Structures give you a place to store mutable state information.
I use structures for simple object-oriented designs: a structure
represents the internal state of the object, which may change over
time.
-
Macros let you turn the language you've got into the language
you wanted. For example, if you're processing a stream char-by-char,
it's handy to have specialized support for loops to read and process
characters; this is the purpose of the while-lstream macro
in lstream.lisp. For an enormous treasury of macros,
read On Lisp.
-
Backquotes let you build lists without wordy calls to
cons and list. Backquotes are like ordinary
quotes, except that they have an escape mechanism to turn
evaluation back on. For example, `(a . ,(+ 1 1))
produces the dotted pair (a . 2).
-
Multiple return values give you a way to return auxilary
information (like row and column number) that you won't always care
about. Yes, you could also return everything you care about in a list;
but with multiple returns, you can get the extra values if you want them,
and if you don't care, you don't need to invest in extra unpacking
of a returned structure.
Regular expression exercises
- MiniJava identifiers?
- Decimal numbers which can be written p/4 for some integer p?
- String containing exactly three 'b's and any number of 'a's?
- Positive C integer constants (remember that 0x starts a hex number and
0 starts an octal number)?
Question for next time
The Nickle programming language (nickle.org)
began life two decades ago when Keith Packard of HP decided he needed a
system for arbitrary precision numeric computation. Nickle has an
interesting rational number system: for display purposes, rational numbers
may be written as int.fixed{repeat}.
For example, the fraction 4/3 would be written (in decimal) as
1.{3}.
Write a regular expression describing Nickle rational number tokens.
Representing rational numbers by repeating decimals takes a lot more
space than representing rational numbers by relatively prime
(numerator, denominator) pairs, like Common Lisp does. (For the
mathematically inclined: how long can the repeating part be as a
function of the denominator?) It's probably easiest to do rational
arithmetic as in Common Lisp, but you could use something else for
an integer representation. One option for something else
is
to think of each number as a finite state machine which produces an
infinite (though eventually repeating) list of digits. Suppose that
we were to use such a state machine representation:
- Can you write a state machine which will generate the string
associated with the number 5/6?
- Suppose you wanted to check whether a finite decimal string was a
correctly-rounded approximation to 5/6. Can you write a finite
state machine to recognize this? What if you wanted to recognize
p/6 for any integer p? Hint: this is easiest to do with a
non-deterministic machine; we'll soon see how to convert such
nondeterministic finite automata (NFAs) to deterministic equivalents
(DFAs).
- There are multiple finite state machines that will generate
the infinite strings associated with the same number. How might
you check that these machines produce the same strings? That is,
how can you check that 0.{6} and 0.{66} represent
the same number?
Can you think how you might convert a Lisp rational number representation
(in terms of a pair of integers) to and from a Nickle rational number
string?