Chapter 2 2.1 It is actually unfair to talk about what a lexical analyzer will return without being able to talk about data structures. What could it mean for Java or any programming language to "return the stream" (bottom of page 17) FLOAT ID(match0) LPAREN ... STRING(0.0) ... REAL (0.0) etc? It is not the case that the lexer should return, in sequence, the strings "FLOAT"... "STRING(0.0)"... since this would hardly be much more useful than the source. For your information, there is a Java class java.io.StreamTokenizer http://java.sun.com/products/jdk/1.1/docs/api/java.io.StreamTokenizer.html which describes a lexical analysis program, customizable in various ways that are recognizable to you if you write your own; it provides a method, nextToken, which returns an integer for a token type, and sets one of two global variables, a string or a number. It is rather heavily biased toward Java/C source code lexical conventions, included hard-wiring in the /* */ and // conventions for comments. What should be returned is some useful reference to a data structure. An entry in the symbol table is a good idea. That is, if the symbol table is implemented as a hash table, and the sample program read in, then some hash table entry with printname "match0" and tag ID might be useful. Also associated with that hash table entry might be other information accumulated during lexical analysis... like the lines/columns where the identifier match0 occurs. Further processing might associate additional information to that entry, like the enclosing block in which match0 occurs. Indeed, there may be several distinct uses of the same name: variables in different scopes. These will ultimately have to be distinguished. In lisp, where symbols are well supported, it is fairly easy to talk about returning a sequence of symbols, one each time the lex program is called. Or constructing a list of the symbols, or printing them to a disk file where they can be trivially read back in. Regular symbols are in fact references to hash table entries: they are allocated space in the Lisp symbol table. 2.2 One of the great successes of computer science theory that we encounter in CS164 is our ability to use formal language theory and the mathematics of finite state machines, also known as deterministic automata (DA), and regular expressions (RE) to describe important parts of programming languages. We first see this in this section on regular expressions. In reality, the work on DA and RE is more about teaching you an academic discipline and a way of thinking somewhat more rigorously than you might otherwise. The notion of a "language" is extremely powerful, and the intuitive notion is inadequate for "programming languages". On the other hand, the rigor associated with the mathematical definitions tends to be uncomfortable: If you were forced to write all programs strictly using these techniques, you would be very uncomfortable and would eventually find out that you have to violate some rules. More about this later. A language is a set of strings. A string is a finite sequence of character from a finite set known as an alphabet. These are languages: the set of all Pascal languages accepted by a particular compiler. Integers composed of decimal digits. REGULAR EXPRESSIONS! Regular expressions are explained on p 19. Note that the language denoted by regular expression R is apparently also denoted by (R) or ((R)) or (((R))) etc. The author neglected to mention the meta-symbols () used in ((a|b).a)*. [middle of p 19]. This expression means choice one of a or b, followed by an a. concatenate any number of separate instance of this, including none. Also, at the risk of being overly pedantic, If we use "." for concatenation, then the concatenation of two regular expressions M . N denotes the language of all the expressions {m . n} for which m is in the language denoted by M and n is in the language denoted by N. The regular expression epsilon denotes the language with one string, namely the empty string. There is only one such language. That's an odd statement to be able to make, isn't it? There is also a language with no strings at all, the empty SET sometimes written as a O with / through it. Let's call it NULL. There's only one of these as well. You can make a nice algebra system with these objects and the operator concatenation (.). Thus Epsilon . X = X X . Epsilon = X NULL . X = NULL Think about that last one. NULL is like 0, Epsilon is like 1. There is no logical reason to be so awful about the syntax of regular expressions; it is a historical mishmash. We could have a meta language and use, say, lisp. Strings such as "0" stand for themselves. The concatenation operator could be DOT. Thus (DOT "0" "1") is the string "01". and (DOT (DOT "0" "1") "2") is the string "012". We will be generous and allow (DOT "0" "1" "2") to be "012" also. Any time a binary (two argument) operation like DOT can be naturally extended in this way, we will do so. We could do a bunch of other things with strings, as you can imagine. For the moment we don't need to see if two strings are equal, pick them apart, etc. Let's move on to REs. Let any string, say "01" is an RE meaning the language consisting of the one string. This is an "abuse of notation" to allow a string to be a RE also. If this hadn't previously confused you, there are several possibilities: 1. You figured it out already. 2. You are used to sloppy notation from other courses and this didn't catch your attention one way or the other. Let NIL be the empty set. It is an RE representing the language with no strings in it. We know that "", a string, is also a RE, representing the language with exactly one string in it. Let X and Y be REs. (UNION X Y) is the lisp notation for a RE for the union of the two languages. We will just make a lisp list consisting of all the elements in X and in Y. example (UNION "012" "345") is ( "012" "345") The text uses the term Alternation and the | symbol for this. (infix) (DOT X Y) is the lisp notation for concatenation. The text uses the term Concatenation and the . symbol for this. (infix) Also (see figure 2.1) sometimes the . is left out. example (DOT "012" "345") is ( "012"345") (DOT '("012" "345") '("ab" "cd")) is ( "012ab" "345ab" "012cd" "345cd") (STAR X) is the lisp notation for Kleene closure. The text uses * as a postfix (i.e. to the right) operator. (STAR X) is (UNION Epsilon X (DOT X X) (DOT X X X) ...) Exercises for the reader: If X = ("0" "1") Describe each of these languages. In particular, how many elements (also known as "sentences in the language") are there? (DOT X X) (STAR X) (DOT NIL X) (DOT Epsilon X) (STAR Epsilon) (STAR NIL) Here is a lisp RE for binary numbers divisible by 2 as suggested by the text, page 19. (DOT (STAR(UNION "0" "1")) "0") Here's a better lisp RE for binary numbers divisible by 2. See if you can tell why this is better... try listing all sentence of length 3 in the RE above, and in the RE below: (UNION "0" (DOT "1" (STAR (UNION "0" "1")) "0")) Here is another RE, describing a string which begin with "(*" and end with "*)" and has any number of repetitions of "A" in the middle. (DOT "(*" (STAR "A") "*)" ) It is possible to extend this representation to include other of the text's notation [abcd] means (UNION "a" "b" "c" "d") X? means (UNION Epsilon X) X+ means (DOT X (STAR X)) or (DOT (STAR X) X) Unforgiveably confusing, in my view, figure 2.1 adopts the notation that "." also stands for any single character except newline. That is, PERIOD = (UNION "a" "b" "c" ... "A" "B" ... "Z" .. "0" "1" ...) except not the string containing the newline character. A RE description of a line of source code is (STAR PERIOD). If you want to include the newline, then (DOT (STAR PERIOD) "\n") where \n is the [otherwise difficult to show on a page] newline character. Lisp calls this #\newline. Lets use our language to describe alphanumeric characters ALPHA = (UNION "a" "b" ... "A" ... "Z") NUMER = (UNION "0" ... "9") ALPHANUM = (UNION ALPHA NUMER) and now we can define identifiers as IDS = (DOT ALPHA NUMER) Can we write a program that, given the descriptions above, recognizes "FOO34B" as an identifier, but rejects "12BAR" and rejects ".com" as not legal? Sure. It would look like this. There is a string S and a RE X. There are two integers index and end, which refer to locations in the string S. initially index=0, end= (length S). (defun matchitem (x) (cond ((>= index end) nil) ((stringp x) (and (string= x (subseq s index (+ index (length x)))) (incf index (length x)))))) (defun mymatch (x) ; x is a RE (typecase x (list ;; x is (union...) (dot ..) etc... (ecase (car x) (dot (every #'mymatch (cdr x))) (union (some #'mymatch (cdr x))) (star (not (do ()((not (mymatch (cadr x))) ))))))) ;; it is not a list. match a string, (t (matchitem x))) That's about it. You can change these in various ways to allow for "wildcard" characters inside strings, etc, or to have special matching criteria like "matches only if (f e) is true" for some function f and position e. (see myscript line 102 etseq). ;;; Big change in approach. Instead of writing a program to do matching, let's write a program that takes a RE as input, and WRITES A PROGRAM. This is a meta-programming system. This is relatively easy in LISP. Instructions to the meta-program: write a program FOO that matches (UNION "AB" "cd") answer: (defun FOO(s) (or (string= s "AB") (string= s "cd"))) write a program BAR that matches (DOT P Q) answer: write a program BAR1 that is a program to match P (and advances an index into the string S after P) write a program BAR2 that is a program to match Q (and advances an index into the string S after Q) write a program BAR3 that checks the index is at the end of string S. NOW, write a program that is, in effect... (defun BAR(s)(AND (BAR1 s)(BAR2 s)(BAR3 s))) Now we have a program that only and exactly matches the given RE. The program can be compiled etc. We can even go a step further. We made our world simpler by "pre-parsing" the textbook REs like A.B* into (DOT A (STAR B)). Why not allow the input to the program-generating program to be strings like "A.B*". Then we can go directly from an infix language for REs .... a specialized programming language all of its own... into raw seething executable code. This is what the program LEX, and its Java equivalent JLEX, and similar programs do. Now if collect a set of REs to describe the tokens of our language, we can compile for each RE, and then for all of them, a special program to recognize each token type. Something like: (LOOP (if eof (return 'done)) (OR (id s)(number s)(operator s) ....)) will go through the string s and match up things. In reality, this is a totally useless program because all it says is "yep, that string was described by that RE". We want to do useful things like: if we encounter a number-like string "123.45e10" we want to change it to a number. If we encounter an id-like string which is a keyword like "else" we want to make some kind of recognition of that. If we encounter "/*" we wish to skip through everything until we encounter a matching */. An augmented matching program can be written to do all of this. It is much more convenient to write ordinary programs for "augments" than to try to live within the finite automata framework. The way it generally works is you have another program (This is called a parser) that calls a program with a fixed name, like "nextToken" which clambers over the input stream to isolate a token. Based on the augments, it does some computation to produce a useful object to return to the parser. We promised to come back to an issue we raised earlier under the subject of convenience. In fact it is more than convenience. The DA can't do everything we need. If our language lexical structure has comments that nest: /* this is a comment /*here it is nested */ and this is also within a comment */ And if it nests arbitrarily deeply, then a DA can't recognize this language. (Note: The way C handles this is wrong. It does not allow nesting!) Why? essentially, DA can't count arbitrarily high. They have a finite number of states, and this limits their computational power. We, however, can easily write a program with a counter to which we add 1 every time we see /* and from which we subtract 1 when we see */ When the counter reaches back to 0 we are out of the comments. So finite state machines are not such a great formalism for tying up all the loose ends. but they are an OK way of thinking about the recognition process at the level of tokens. It is a nice lisp exercise to write fsm simulation programs. If we choose a neat representation and ignore data abstraction: (defstruct (state (:type list)) transitions final) ;;Here is a finite state machine example: ;;set up a machine with 3 states (setf Mach1 (make-array 3)) ;;The first machine, with 3 states we will denote 0,1,2 will be stored ;; in an array called Mach1. This machine accepts cc*d and that's ;; all (setf (aref Mach1 0) ; initial state (make-state :transitions '((#\c 1) ;; if you read a c go to state 1 (#\d 1)) ;; if you read a d go to state 1 ;; if you read anything else it is a error :final nil)) (setf (aref Mach1 1) (make-state :transitions '((#\c 1) (#\d 2)) :final t)) (setf (aref Mach1 2) ;; dead end state. no way out (make-state :transitions '( (#\c 2) ; (#\d 2)) :final nil)) ;; fsm simulates a deterministic finite state machine. ;; given a state number 0,1,2,... returns t for accept, nil for reject. (defun fsm (state state-table input) (cond ((string= input "") (state-final (aref state-table state))) (t(let ((trans (assoc (elt input 0) (state-transitions (aref state-table state))))) (and trans (fsm (cadr trans) state-table (subseq input 1))))))) ;; that's all. ;; but see fsm.cl for much more .......... What's all this about conversion of NFA to DA? Understand the idea of an NFA. NFAs expand upon the notation by allowing transitions from a state to SEVERAL states. It is also easy to mechanically convert a RE to an NFA. When it is all boiled down, we have a shorthand for what is usually a far larger DA, but equivalent. Why is this interesting? If NFAs and REs were really the perfect tool for describing some task, we could reduce that task to entirely automated procedures. As it is, selecting tokens from an input stream is nearly characterized by these formalisms. Not perfectly though.