Navigation
Introduction
In this homework, you'll learn some things that we won't talk about in lecture: classes and methods dedicated to searching strings for selected patterns and for reading formatted input.
Warning: There's a lot of jargon for this homework. Sorry! Focus on the experiments and writing code and it should all come together.
B. Scanners
Intro to Scanners
If you're already familiar with scanners, you can skip straight to the tasks below.
The class java.util.Scanner
gives you a way to read
substrings of text, also called tokens, sequentially
from a stream of text that is furnished to the Scanner by its
constructor. Typically, the stream of text comes from a file or from
a terminal, but there are ways to convert any source of characters
into a stream that a Scanner
can process.
One of Scanner's constructors accepts an InputStream
—a
stream of bytes (8-bit characters). Since System.in
, which is
normally the standard input stream to your program, is a
kind of InputStream
(that is, its type is a subtype of
InputStream
), you can write
java.util.Scanner inp = new java.util.Scanner(System.in);
to get something that scans the input from your terminal. (Normally,
of course, you'd put import java.util.Scanner;
at the
beginning of your source file and just write Scanner
instead
of java.util.Scanner
).
One can also create Scanners from Strings (e.g. Scanner s = new
Scanner("hello");
), input files, and more.
The simplest way to use a Scanner
is treat the input stream
as a sequence of tokens separated by text that matches a
delimiter pattern. By default, the delimiter pattern
matches stretches of whitespace (blanks, tabs, newlines, carriage
returns). For example, consider the input below:
hello i am a half human half
horse
In this case, the tokens separated by our delimiter pattern are
"hello"
, "i"
, "am"
, "a"
,
"half"
, "human"
, "half"
, and
"horse"
.
Delimiter patterns can be arbitrary, for example if our delimiter pattern were stretches of ; and , then the string:
hello; i am just a, horse
Would yield the tokens "hello"
, " i am just a"
, and
" horse"
.
Open up ReadInts.java
and TestReadInts.java
. ReadInts
provides one
complete method printInts
and two incomplete methods
testReadInts
and testSmartReadInts
. TestReadInts
calls all three methods, testing as appropriate.
Experiment #1: TestReadInts
Try running TestReadInts
. Look at TestReadInts.java
, and you'll see
the contents of the input String inp
are printed out as you'd
expect by the test. Yay, the provided code works.
Next open up ReadInts.java
and look at the source of the printInts
method. You'll see that it uses the hasNext()
and
nextInt()
methods.
These methods, along with their most common brethren are described below.
inp.hasNext()
is true iff there is another token (that is, something other than a delimiter) before the end of the input.inp.next()
returns the next token, and advancesinp
past it.inp.hasNextInt()
is true iffinp.hasNext()
and the next token has the syntax of a (possibly signed) decimal numeral.inp.nextInt()
does ainp.next()
and then parses the token into anint
.- Likewise, there are
inp.hasNextDouble()
, which returnsdouble
values, and several other similarly named methods for other types. inp.hasNextInt(RADIX)
is true iff the next token exists and has the syntax of a (possibly signed) base-RADIX numeral.inp.nextInt(RADIX)
reads the next token as a base- RADIX numeral.inp.hasNextLine()
is true iff there is any more input.inp.nextLine()
returns the next line of input (that is, everything up to, but not including, the next end-of-line character, or the end of the input if there isnt an end-of-line at the end). It then positionsinp
past the end-of-line character. Thus, it differs from the othernext
methods in that it uses a different delimiter (end of line instead of whitespace).
Programming Tasks
Task 1: readInts()
When you run make check, you'll see that your code fails in readInts. Head to the readInts method, and you'll see that one line of code is missing.
Using the documentation for the ArrayList
class, figure out how to
modify the code so that the readInts method works correctly.
The TestReadInts
test feeds your
method a bad input, and actually expects an exception.
For those of
you who are particularly keen on the idea of testing exceptions, Junit
supports a less unwieldy syntax based on the @Rule tag that you're
free to use.
Task 2: smartReadInts()
Complete the smartReadInts method so that it works as described in the comments and TestReadInts. Use the readInts method as a guide.
C. Patterns
Your code for part 1 looked through input token by token, accepting tokens that were integers. How does Java know if a string represents an integer? As you might imagine, it looks for sequences containing only the digits 0123456789.
What if we want to match, for example, numbers that only contain numbers less than 5? Or five digit numbers only?
Java provides a faculty for this known as Pattern Matching, and
supports a rich syntax for specifying patterns. For example, numbers
less than 5 could be expressed by the pattern "[01234]". Five digit
numbers could be expressed by "[0-9][0-9][0-9][0-9][0-9]"
or "[0-9]{5}"
.
These patterns are sometimes referred to as regular expressions, though strictly speaking, they're a somewhat different animal than the formal notion of a regular expression that might be discussed, for example, in an upper-division CS class in your future.
The full pattern language is quite rich, and is documented under java.util.regex.Pattern in the on-line Java library documentation. Here are just a few that you might find particularly useful. You are not expected to read this entire list for this homework. However, you may find it useful moving forward in 61B.
- Most characters (letters, digits, most punctuation) simply match themselves.
- A period (".") matches any character other than (usually)
newline. To get "." to match newline as well, include
(?s)
at the beginning of your pattern. - A sequence such as "[abe]" denotes a character
class: in this case, "any of the (single) characters
a
,b
, ore
". As a shorthand, you can represent a range of characters with a hyphen, as in"[abd-qs-z]"
to meana
, b,d
throughq
, ands
throughz
. - A sequence such as
[^abe]
matches any single character other than those listed. - There are several useful two-character shorthands for certain
character classes.
\d
is short for[0-9]
,\s
is short for \t\n\r. Unfortunately, in order to put an actual\
in a string, you must double it. Thus, a pattern that matches any two-digit string would be written as the string literal"\\d\\d"
. [The Python language got around this problem with "raw strings" such asr"\s"
, and Perl got around it by having patterns be a part of the syntax, distinct from strings. The Java designers, however, apparently never saw the need.] - If
P
represents a pattern, thenP*
represents "0 or more repetitions of P". Thus,x*
matches the empty string, "x", "xx", "xxx", etc.[a-c]*
matches "", "a", "ab", "aa", "bac", "ccc", etc.*
has the strongest binding of any character (a la order of operations), soab*
means "onea
followed by 0 or moreb
s". To get the effect of "0 or moreab
s", use(ab)*
- Similarly, P+ means "1 or more Ps."
P?
denotes an optional P (0 or 1 Ps).- If
P
andQ
denote patterns, thenP|Q
denotes "aP
or aQ
". (P)
denotes the same thing as P. It also serves to define a group, a subpattern whose match you can retrieve later.(?:P)
also denotes the same thing as P, but it does not define a group that you can retrieve later (see hw5 for the full details).- Following a
*
,+
, or?
with a?
creates a "non- greedy" version, meaning that it matches as few characters as possible to make the match work. This affects what part of a string gets matched, but usually not whether a string gets matched. For example, if you are matching the string "1, 2, 3, 4" against the pattern string1"(\\d).*(\\d).*"
, the first group will match1
and second will match4
. But with the pattern string"(\\d).*?(\\d).*"
, the second group will match2
. - Boundary matchers match the empty string, but only
at certain places.
^
and$
match the beginning and end of a string, respectively (but see below).\G
matches the point at which the last match ended (this makes sense only inScanners
,Matchers
, or other kinds of things that have a notion of "the last thing that was matched").\b
matches a "word boundary", a place where a word begins or ends. - The sequence
(?m)
always matches the empty string, but has a side effect of causing^
and$
to match the beginnings and ends of lines as well as of entire strings. - The two-character escape sequences
\?
,\*
,\.
,\+
, etc., match the character after the backslash, ignoring their special significance. Thus, the patternwho\?
matches the string "who?", and would be written in a program as the string literal "who\?".
Experiment #2: Matching
Compile and run the Matching
class. This class allows you to
type in strings and patterns and see if the entire string matches the
pattern. If you include any groups (read ahead if you're
curious), it will also print those. For example:
$ java Matching
Alternately strings to match and patterns to
match against them. Use \ at the end of line
to enter multi-line strings or patterns (\
are removed, leaving newlines).
The program will indicate whether each pattern matches
the ENTIRE preceding string.
Enter QUIT to end the program.
String: 123456
Pattern: [0-9]{6}
Matches.
String: 123456
Pattern: [0-9]{5}
No match.
String: 12345
Pattern: [0-9]{6}
No match.
String: abdeffff
Pattern: ab(c|de)f+
Matches.
Group 1: 'de'
String: abbbbdefefgg*h
Pattern: a(b+)d(ef)+gg\*h
Matches.
Group 1: 'bbbb'
Group 2: 'ef'
String: QUIT
Use this class to experiment with how patterns work. Try writing patterns that match the following. Sample answers are given for each problem (drag the mouse over the white area after "Answer:" to see it).
- A single digit between 5 and 8. Answer: [5-8].
- Sequences of lower case letters. Answer: [a-z]+
- Sequences of lower case letters except the letter j. Answer: [a-i|k-z]+
- Sequences of characters that start with the uppercase letter A and end with the letter f. Answer: A.*f
- Sequences of three words separated by spaces, where a word is defined as a sequence of lower case letters. Answer: [a-z]+ +[a-z]+ +[a-z]
- Sequences of three words separated by spaces, and where group 1 corresponds to the second word. Answer: [a-z]+ +([a-z]+) +[a-z]
Programming task
Open P2Pattern.java
, and fill in the string with a pattern that
matches lists of non-negative (>= 0) integers in the notation we used
in homework
2 (e.g. (1, 2, 33, 1, 63)
). Each number should be followed
by a comma then any number of spaces.
Run P2PatternTester to verify that your pattern is correct.