Chapter 2

2.1 It is actually unfair to talk about what a lexical analyzer will
return without being able to talk about data structures.  What could
it mean for Java or any programming language to "return the stream"
(bottom of page 17)

FLOAT ID(match0) LPAREN  ... STRING(0.0) ... REAL (0.0) etc?

It is not the case that the lexer should return, in sequence, the
strings "FLOAT"...  "STRING(0.0)"...  since this would hardly be much
more useful than the source.

For your information, there is a Java class  java.io.StreamTokenizer 
http://java.sun.com/products/jdk/1.1/docs/api/java.io.StreamTokenizer.html

which describes a lexical analysis program, customizable in various
ways that are recognizable to you if you write your own; it provides a
method, nextToken, which returns an integer for a token type, and sets
one of two global variables, a string or a number.  It is rather
heavily biased toward Java/C source code lexical conventions, included
hard-wiring in the /* */ and // conventions for comments.

What should be returned is some useful reference to a data structure.
An entry in the symbol table is a good idea.

That is, if the symbol table is implemented as a hash table, and the
sample program read in, then some hash table entry with printname
"match0" and tag ID might be useful.  Also associated with that hash
table entry might be other information accumulated during lexical
analysis... like the lines/columns where the identifier match0 occurs.
Further processing might associate additional information to that
entry, like the enclosing block in which match0 occurs.  Indeed, there
may be several distinct uses of the same name: variables in different
scopes.  These will ultimately have to be distinguished.

In lisp, where symbols are well supported, it is fairly easy to talk
about returning a sequence of symbols, one each time the lex program
is called.  Or constructing a list of the symbols, or printing them to
a disk file where they can be trivially read back in.  Regular symbols
are in fact references to hash table entries: they are allocated space
in the Lisp symbol table.

2.2 One of the great successes of computer science theory that we
encounter in CS164 is our ability to use formal language theory and
the mathematics of finite state machines, also known as deterministic
automata (DA), and regular expressions (RE) to describe important
parts of programming languages.  We first see this in this section on
regular expressions. 

In reality, the work on DA and RE is more about teaching you an
academic discipline and a way of thinking somewhat more rigorously
than you might otherwise. The notion of a "language" is extremely
powerful, and the intuitive notion is inadequate for "programming
languages".  On the other hand, the rigor associated with the
mathematical definitions tends to be uncomfortable: If you were forced
to write all programs strictly using these techniques, you would be
very uncomfortable and would eventually find out that you have to
violate some rules. More about this later.

A language is a set of strings.  A string is a finite sequence of
character from a finite set known as an alphabet.

These are languages:  the set of all Pascal languages accepted by
a particular compiler. Integers composed of decimal digits.

REGULAR EXPRESSIONS!

Regular expressions are explained on p 19.  Note that the language
denoted by regular expression R is apparently also denoted by (R) or
((R)) or (((R))) etc.  The author neglected to mention the
meta-symbols () used in ((a|b).a)*.  [middle of p 19].  This
expression means choice one of a or b, followed by an a.  concatenate
any number of separate instance of this, including none.

Also, at the risk of being overly pedantic, If we use "." for
concatenation, then the concatenation of two
regular expressions M . N denotes the language of all the expressions
{m . n} for which m is in the language denoted by M and n is in the
language denoted by N. 

The regular expression epsilon denotes the language with one string,
namely the empty string. There is only one such language. That's
an odd statement to be able to make, isn't it?

There is also a language with no strings at all, the empty SET
sometimes written as a O with / through it. Let's call it NULL.
There's only one of these as well.

You can make a nice algebra system with these objects and the
operator concatenation (.).  Thus 
Epsilon . X = X
X . Epsilon = X
NULL . X  = NULL

Think about that last one.  NULL is like 0, Epsilon is like 1.

There is no logical reason to be so awful about the syntax
of regular expressions; it is a historical mishmash.

We could have a meta language and 
use, say, lisp.
Strings such as "0"  stand for themselves. 
The concatenation operator could be DOT.
Thus (DOT "0" "1") is the string "01".
and  (DOT (DOT "0" "1") "2") is the string "012".
We will be generous and allow (DOT "0" "1" "2") to be "012" also.
Any time a binary (two argument) operation like DOT can be
naturally extended in this way, we will do so.

We could do a bunch of other things with strings, as you
can imagine. For the moment we don't need to see if
two strings are equal, pick them apart, etc.

Let's move on to REs.

Let any string, say "01" is an RE meaning the language consisting
of the one string.  This is an "abuse of notation" to allow
a string to be a RE also.  If this hadn't previously confused you, there
are several possibilities: 

1. You figured it out already.
2. You are used to sloppy notation from other courses and this
didn't catch your attention one way or the other.

Let NIL be the empty set. It is an RE representing the language with no
strings in it.
We know that "", a string, is also a RE, representing the language with
exactly one string in it.

Let X and Y be REs. 
(UNION X Y) is the lisp notation for a RE for the union of the two
languages.  We will just make a lisp list consisting of all the elements
in X and in Y.  

 example
(UNION "012" "345")  is ( "012" "345") 

  The text uses the term Alternation and the | symbol for this. (infix)

(DOT X Y)   is the lisp notation for concatenation.

  The text uses the term Concatenation and the . symbol for this. (infix)
Also (see figure 2.1) sometimes the . is left out.

example

(DOT "012" "345")  is ( "012"345") 
(DOT '("012" "345") '("ab" "cd")) is ( "012ab" "345ab" "012cd" "345cd") 

(STAR X) is the lisp notation for Kleene closure.

  The text uses * as a postfix (i.e. to the right) operator.

(STAR X) is (UNION Epsilon X (DOT X X) (DOT X X X) ...)

Exercises for the reader: If X = ("0" "1")
Describe each of these languages. In particular, how many
elements (also known as "sentences in the language") are
there?
(DOT X X) (STAR X)
 (DOT NIL X)  (DOT Epsilon X)  (STAR Epsilon) (STAR NIL)

Here is a lisp RE for binary numbers divisible by 2 as suggested
by the text, page 19.

(DOT (STAR(UNION "0" "1")) "0")

Here's a better lisp RE for binary numbers divisible by 2.
See if you can tell why this is better... try listing all
sentence of length 3 in the RE above, and in the RE below:

(UNION "0" (DOT "1" (STAR (UNION "0" "1")) "0"))

Here is another RE, describing a string which begin with
"(*" and end with "*)" and has any number of repetitions of "A" in
the middle.

(DOT "(*" (STAR "A") "*)" )


It is possible to extend this representation
to  include other of the text's notation

 [abcd] means (UNION "a" "b" "c" "d")
 X?     means (UNION Epsilon X)
 X+     means (DOT X (STAR X))  or (DOT (STAR X) X)

Unforgiveably confusing, in my view, figure 2.1 adopts the notation
that "."  also stands for any single character except newline. That
is, PERIOD = (UNION "a" "b" "c" ... "A" "B" ... "Z" .. "0" "1" ...)
except not the string containing the newline character.

A RE description of a line of source code is  (STAR PERIOD).
If you want to include the newline, then
(DOT (STAR PERIOD) "\n")   where \n is the [otherwise difficult to
show on a page] newline character.  Lisp calls this #\newline.


Lets use our language to describe alphanumeric characters

ALPHA = (UNION "a" "b" ... "A" ... "Z")
NUMER = (UNION "0" ... "9")
ALPHANUM = (UNION ALPHA NUMER)

and now we can define identifiers as 
IDS = (DOT ALPHA NUMER)

Can we write a program that, given the descriptions above,
recognizes  "FOO34B" as an identifier, but rejects "12BAR" 
and rejects ".com" as not legal?  Sure.

It would look like this. There is a string S and a RE X.
There are two integers index and end, which refer to
locations in the string S.  initially index=0, end= (length S).

(defun matchitem (x)
  (cond ((>= index end) nil)
	((stringp x)
	 (and (string= x (subseq s index (+ index (length x))))
	      (incf index (length x))))))


(defun mymatch (x) ; x is a RE
  (typecase x
    (list        ;; x is (union...) (dot ..) etc...
     (ecase (car x)
       (dot (every #'mymatch (cdr x)))
       (union (some #'mymatch (cdr x)))
       (star (not (do ()((not (mymatch (cadr x))) )))))))
    ;; it is not a list. match a string,
        (t (matchitem x)))

That's about it.  You can change these in various ways to
allow for "wildcard" characters inside strings, etc, or
to have special matching criteria like "matches only if (f e) is true"
for some function f and position e.

(see myscript line 102 etseq).

;;; Big change in approach.

Instead of writing a program to do matching, let's write a 
program that takes a RE as input, and WRITES A PROGRAM.

This is a meta-programming system.  This is relatively
easy in LISP.

Instructions to the meta-program:

write a program FOO that matches (UNION "AB" "cd")

answer:

  (defun  FOO(s) (or (string= s "AB")
                     (string= s "cd")))


write a program BAR that matches (DOT P Q)

answer:
  write a program BAR1 that
    is a program to match P
    (and advances an index into the string S after P)
  write a program BAR2 that
    is a program to match Q
        (and advances an index into the string S after Q)
  write a program BAR3
    that checks the index is at the end of string S.

NOW, write a program that is, in effect...

(defun BAR(s)(AND (BAR1 s)(BAR2 s)(BAR3 s)))

Now we have a program that only and exactly matches the
given RE.  The program can be compiled etc.

We can even go a step further.  We made our world
simpler by "pre-parsing" the textbook REs like A.B*
into (DOT A (STAR B)).
Why not allow the input to the program-generating program
to be strings like "A.B*".  Then we can go directly
from an infix language for REs .... a specialized programming
language all of its own... into raw seething executable
code. This is what the program LEX, and its Java equivalent
JLEX, and similar programs do.

Now if collect a set of REs to describe the tokens of our language, we
can compile for each RE, and then for all of them,
a special program to recognize each token type. Something like:

(LOOP (if eof (return 'done))
      (OR (id s)(number s)(operator s) ....))

will go through the string s and match up things.

In reality, this is a totally useless program because
all it says is "yep, that string was described by that
RE". 

We want to do useful things like:  if we encounter a
number-like string "123.45e10"   we want to change it
to a number.  If we encounter an id-like string which
is a keyword like "else"  we want to make some kind
of recognition of that.  If we encounter "/*"  we wish
to skip through everything until we encounter a matching
*/.

An augmented matching program can be written to do
all of this.  It is much more convenient to write
ordinary programs for "augments" than to try to live
within the finite automata framework.  

The way it generally works is you have another program
(This is called a  parser)
that calls a program with a fixed name, like "nextToken"
which clambers over the input stream to isolate a token.
Based on the augments, it does some computation to produce
a useful object to return to the parser.


We promised to come back to an issue we raised earlier under the
subject of convenience.  In fact it is more than convenience. The DA
can't do everything we need.

If our language lexical structure has comments that nest:

/* this is a comment /*here it is nested */ and this is
also within a comment */
And if it nests arbitrarily deeply, then a DA can't
recognize this language.  (Note: The way C handles
this is wrong. It does not allow nesting!)

Why? essentially, DA can't count arbitrarily high.
They have a finite number of states, and this limits
their computational power.  We, however, can easily
write a program with a counter to which we add 1 every
time we see /*  and from which we subtract 1 when we
see */   When the counter reaches back to 0 we are out
of the comments.

So finite state machines are not such a great formalism
for tying up all the loose ends. but they are an OK way
of thinking about the recognition process at the level
of tokens.

It is a nice lisp exercise to write fsm simulation
programs.  If we choose a neat representation and
ignore data abstraction:


(defstruct (state (:type list)) transitions final)

;;Here is a finite state machine example:
;;set up a machine with 3 states

(setf Mach1 (make-array 3)) 

;;The first machine, with 3 states we will denote 0,1,2 will be stored
;; in an array called Mach1.  This machine accepts cc*d and that's
;; all

(setf (aref Mach1 0)			; initial state
  (make-state :transitions
	      '((#\c 1)  ;; if you read a c go to state 1
		(#\d 1)) ;; if you read a d go to state 1
	      ;; if you read anything else it is a error
	      :final nil))

(setf (aref Mach1 1)	
  (make-state :transitions
	      '((#\c 1)	
		(#\d 2))
	      :final t))

(setf (aref Mach1 2)	  ;; dead end state. no way out
  (make-state :transitions
	      '( (#\c 2)  ;
		 (#\d 2))
	      :final nil))

;; fsm simulates a deterministic finite state machine.
;; given a state number 0,1,2,... returns t for accept, nil for reject.

(defun fsm (state state-table input)
  (cond ((string= input "") 
	 (state-final (aref state-table state))) 
	(t(let ((trans  
		(assoc 
		 (elt input 0) 
		 (state-transitions
		  (aref state-table state)))))
	    (and trans (fsm (cadr trans) state-table (subseq input 1)))))))

;; that's all.
;; but see fsm.cl for much more

..........

What's all this about conversion of NFA to DA? Understand the
idea of an NFA. NFAs expand upon the notation by allowing
transitions from a state to SEVERAL states. It is also easy
to mechanically convert a RE to an NFA. When it is all
boiled down, we have a shorthand for what is
usually a far larger DA, but equivalent. Why is this interesting?
If NFAs and REs were really the perfect tool for describing
some task, we could reduce that task to entirely automated
procedures.
  As it is, selecting tokens from an input stream is nearly
characterized by these formalisms. Not perfectly though.