CS 162 Lecture Notes

Prof. Alan Jay Smith


                  Topic: File Structure, I/O Optimization


     +   File:  a named collection of bits (usually stored on disk).

         +   From the OS' standpoint, the file consists of a bunch  of

             blocks stored on the device.

         +   Programmer may actually see a different interface  (bytes

             or  records),  but this doesn't matter to the file system

             (just pack bytes into blocks, unpack them again on  read-

             ing).

         +   File may have attributes and  properties-  e.g.  name(s),

             protection, type (numeric, alphabetic, binary, c program,

             fortran progrm, data, etc.), time of  creation,  time  of

             last  use, time of last modification, owner, length, link

             count, layout (format).


     +   How do we (or can we) use a file?

         +   Sequential:  information is processed in order, one piece

             after  the  other.   This is by far the most common mode:

             e.g. editor writes out new file,  compiler  compiles  it,

             etc.

         +   Random Access:  can address any block in the file direct-

             ly  without  passing  through its predecessors.  E.g. the

             data set for demand paging, libraries,  databases.   Need

             to  know  what  block we want (e.g. some sort of index or

             address needed.)

         +   Keyed:  search for blocks with  particular  values,  e.g.


                                  - .1 -


             hash  table,  associative  database, dictionary.  Usually

             not provided by operating system (but is provided in some

             IBM  systems).   Keyed access can be considered a form of

             random access.


     +   Modern file and I/O systems must address four  general  prob-

         lems:

         +   1. Disk Management:

             +   Efficient use of disk space

             +   Fast access to files

                 +   File structures

                 +   Device use optimization.

             +   User has hardware independent view of disk.  (Mostly,

                 so does OS).

         +   2. Naming:  how do users refer to files?

             +   This concerns directories, links, etc.

         +   3. Protection:  all users are not equal.

             +   Want to protect users from each other.

             +   Want to have files from various users on same disk.

             +   Want to permit controlled sharing.

         +   4. Reliability

             +   Information must last  safely  for  long  periods  of

                 time.


     +   Disk Management:

         +   How should the blocks of the file be placed on the disk?

         +   How should the map to find and access the blocks look?


                                  - .2 -


             +   File Descriptor- A data structrure  that  gives  file

                 attributes and contains the map which tells you where

                 the blocks of your file are.

             +   File descriptors are stored on disk  along  with  the

                 files (when the files are not open).


     +   Some system, user and file characteristics:

         +   Most files are small.

             +   In Unix, most files are very small.   Lots  of  files

                 with a few commands in them, etc.

         +   Much of the disk is allocated to large files.

         +   Many of the I/O operations are made to large files.

         +   Most (between 60% and 85%) of the I/Os are reads.

         +   Most I/Os are sequential.

         Thus, per-file cost must be low but  large  files  must  have

         good performance.


     +   File Block Layout and Access

         +   Contiguous

         +   Linked

         +   Indexed or tree structured

     +   Note - this is just standard data structures  stuff,  but  on

         disk.


     +   Contiguous allocation:

         +   Allocate file in a contiguous set of blocks or tracks.

         +   Keep a free list of  unused  areas  of  the  disk.   When


                                  - .3 -


             creating  a file, make the user specify its length, allo-

             cate all the space at once.  Descriptor contains location

             and size.

         +   Advantages:

             +   Easy access, both sequential and random,

             +   Low overhead

             +   Simple.

             +   Few seeks.

             +   Very good performance for sequential access.

         +   Drawbacks:

             +   Horrible fragmentation will make large files impossi-

                 ble.

             +   Hard to predict needs at file creation time.

             +   May over allocate.

             +   Hard to enlarge files.

         +   Can improve this scheme by permitting files to  be  allo-

             cated  in  extents.  I.e. ask for contiguous block; if it

             isn't enough, get another contiguous block.

         +   Example:  IBM OS/360 permits up  to  16  extents.   Extra

             space  in last extent can be released after file is writ-

             ten.


     +   Linked files:  Link the blocks of  the  file  together  as  a

         linked  list.  In file descriptor, just keep pointer to first

         block.  In each block of file keep pointer to next block.

         +   Advantages?  Files can be extended, no external  fragmen-

             tation  problems.  Sequential access is easy:  just chase


                                  - .4 -


             links.

         +   Drawbacks?   Random  access  requires  sequential  access

             through  list.   Lots  of seeking, even in sequential ac-

             cess.  Some overhead in block for link.

         Example:  TOPS-10, sort of.  Alto, sort of.


     +   (Simple) Indexed files: Simplest approach is to just keep  an

         array  of  block pointers for each file.  File maximum length

         must be declared when it is created.  Allocate  an  array  to

         hold  pointers  to  all  the  blocks,  but don't allocate the

         blocks.  Then fill in the pointers dynamically using  a  free

         list.

         +   Advantages?

             +   Not as much  space  wasted  by  overpredicting,  both

                 sequential  and  random  access are easy.  Only waste

                 space in the index.

         +   Drawbacks?

             +   May still have to set maximum file size (Can have  an

                 overflow scheme if file is larger than predicted max-

                 imum.)

             +   Blocks are probably allocated randomly over disk sur-

                 face,  and there will be lots of seeks.

             +   Index array may be large, and may require large  file

                 descriptor.


     +   Multi-level indexed files:  the VAX  Unix  solution  (version

         4.4).


                                  - .5 -


         +   In general, any sort of multi-level tree structure.  More

             specifically, we describe what Berkeley 4.3BSD Unix does:

         +   File descriptors:  15 block pointers.  The first 12 point

             to  data  blocks,  the  next  three  to indirect, doubly-

             indirect, and triply-indirect  blocks  (256  pointers  in

             each  indirect block).  Maximum file length is fixed, but

             large.  Descriptor space isn't allocated until needed.

         +   Advantages:  simple, easy to implement,  incremental  ex-

             pansion,  easy access to small files.  Good random access

             to blocks.  Easy to insert block in middle of file.  Easy

             to append to file.  Small file map.

         +   Drawbacks:

             +   Indirect mechanism doesn't provide very efficient ac-

                 cess  to large files:  3 descriptor ops for each real

                 operation.  (When we "open" the file, we can keep the

                 first  level or two of the file descriptor around, so

                 we don't have to read it each time.)

             +   File isn't generally allocated  contiguously,  so  we

                 have to seek between blocks.


     +   Block Allocation:

         +   If all blocks are same size, can use bit map solution.

             +   One bit per disk block.

             +   Cache parts of bit map in memory.   Select  block  at

                 random (or not randomly) from bitmap.

         +   If blocks are variable size, can use free list.

             +   This requires free storage area management.  Fragmen-


                                  - .6 -


                 tation and compaction.

             +   In Unix, blocks are grouped in groups for efficiency:

                 each block on the free list contains pointers to many

                 free blocks, plus a pointer to the next  list  block.

                 Thus there aren't many references involved in alloca-

                 tion or deallocation.

             +   Block-by-block organization of free list  means  that

                 file data gets spread around the disk.


     +   A more efficient solution (used in DEMOS system built at  Los

         Alamos.):

         +   Allocate groups of sequential  blocks.   Use  multi-level

             index  scheme  described above, but each pointer isn't to

             one block - it is to a sequence of blocks.

         +   When we need another block for a file, we attempt to  al-

             locate   the   next  physical  block  on  the  track  (or

             cylinder).

             +   If we can't do it sequentially, we try to do it near-

                 by.

         +   If we have detected a pattern of sequential writing, then

             we  grab  a  bunch  of  blocks at a time (release them if

             unused).  (The size of the bunch will depend on how  many

             sequential writes have occurred so far.)

         +   Keep part of the disk unallocated always  (as  Unix  does

             now)  -  then probability we can find sequential block to

             allocate is high.


                                  - .7 -


     +   I/O Optimization


     +   Block Size Optimization

         +   Small blocks

             +   Small I/O buffers

                 +   Discuss I/O buffers- used for reads and writes.

             +   Are quickly transferred

             +   Require lots more transfers for  a  fixed  amount  of

                 data.

             +   High overhead on disk - wasted bytes for  every  disk

                 block. (Inter record gaps, header bytes, ERC bytes).

             +   More entries in file descriptor to  point  to  blocks

                 (Inode).

             +   Less internal fragmentation.

             +   If random allocation, more seeks.

         +   Optimal block sizes tend to range from 2K  to  8K  bytes.

             Optimum increasing with improvements in technology.

         +   Berkeley Unix uses 4K blocks.  (now 8K?) Basic (hardware)

             block size in VAX is 512 bytes.

             +   Berkeley Unix also uses fragments that  are  1/4  the

                 size of the logical block size.


     +   Disk Arm Scheduling: in timesharing systems, it may sometimes

         be  the  case  that there are several disk I/O's requested at

         the same time.


                                  - .8 -


         +   First come first served (FIFO, FCFS):  may  result  in  a

             lot of unnecessary disk arm motion under heavy loads.

             +   diagram

         +   Shortest seek time first (SSTF):  handle nearest  request

             first.   This  can  reduce  arm  movement  and  result in

             greater overall disk efficiency, but  some  requests  may

             have to wait a long time.

             +   Problem is starvation.  Imagine that disk is  heavily

                 loaded,  with 3 open files.  Two of the files located

                 near center of disk.  Other file near edge.  Disk can

                 be fully busy servicing first two files, and ignoring

                 last file.

         +   Scan:  -like an elevator. Move arm in one direction, ser-

             vicing  requests,  until there are no additional requests

             in that direction.  Then, reverse direction and continue.

             +   This algorithm doesn't get hung up in any  one  place

                 for very long.  It works well under heavy load. But -

                 it may not get the shortest seek.

             +   Also, tends to neglect files at periphery of disk.

         +   CScan - (circular scan) like a  one-way  elevator.  Moves

             only in one direction.  When it finds no further requests

             in the scan direction,  it  returns  immediately  to  the

             furthest  request  in the other direction, and it resumes

             the scan.

             +   This treats all files (and tracks) equally, but some-

                 what higher mean access time than Scan.

     +   SSTF has best mean access time.  Scan or CScan can be used if


                                  - .9 -


         there is a danger of starvation.

     +   Most of the time there aren't very many disk requests in  the

         queue, so this isn't a terribly important decision.

     +   Also, if contiguous allocation is used (as with OS/360), then

         seeks are seldom required.


     +   Rotational Scheduling

         +   It is rare to have more than one request outstanding  for

             a  given  cylinder.   (This  was more relevant when drums

             were used.)

         +   SRLTF (shortest rotational latency first) works well.

         +   But rotational scheduling can be useful for writing data,

             if  we  don't  have to write back to same location.  (log

             structured file system.)

         +   Rotational scheduling is hard using logical block address

             (LBA)  -  since you don't know the rotational position or

             the number of blocks per track.

         +   Rotational and seek scheduling can be  usefully  combined

             (into shortest time to next block) if done in the onboard

             disk controller, which should know the angular and radial

             position.


     +   Skip-Sector or Interleaved disk allocation.

         +   Imagine that you are reading the blocks of a file sequen-

             tially and quickly, and file is allocated sequentially.

         +   Usually, will find that you try  to  read  a  block  just

             after the start of the block has been passed.


                                  - .10 -


         +   Solution is to allocate file  blocks  to  alternate  disk

             blocks  or  sectors.   Then  haven't passed block when we

             want to read it.

             +   Note that if all bits  read  are  immediately  placed

                 into a semiconductor buffer, this is unnecessary.


     +   Track offset for head and cylinder switching

         +   It takes time to switch between heads on different tracks

             or  cylinders.  Thus  we  may want to skip several blocks

             when moving sequentially between  tracks,  to  allow  the

             head to be selected.


     +   File Placement

         +   Seek distances will be minimized if commonly  used  files

             are located near center of disk.

             +   Even  better  results  if  reference   patterns   are

                 analyzed and files that are frequently referenced to-

                 gether are placed near each other.

         +   Freqency of seeks, and queueing for disks will be reduced

             if  commonly  used files (or files used at the same time)

             are located on different disks.

             +   E.g. spread the paging data sets and  operating  sys-

                 tems data sets over several disks.


     +   Disk Caching

         +   Keep a cache of recently used disk blocks in main memory.

             +   Recently read blocks are retained in cache until  re-


                                  - .11 -


                 placed.

             +   Writes go to disk cache, and are later written back.

             +   Typically would include  index  blocks  for  an  open

                 file.

         +   Also use the cache for read ahead and write behind.

             +   Can load entire disk tracks into cache at once.

         +   Typically works quite well - hit ratios of 70-90%.

         +   Can also do caching in disk controller - most controllers

             these  days have 64K-4MB of cache/buffer in the controll-

             er.  Mostly useful as buffer, not cache, since  the  main

             memory cache is so much larger.


     +   Prefetching and Data Reorganization

         +   Since disk blocks are often read  (and  written)  sequen-

             tially,  it  can be very helpful to prefetch ahead of the

             current read point.

         +   It is also therefore useful to make sure that the  physi-

             cal  layout of the data reflects the logical organization

             of the data - i.e. logically sequential blocks  are  also

             physically sequential.  Thus it is useful to periodically

             reorganize the data on the disk.


     +   Data Replication

         +   Frequential used data can be replicated at multiple loca-

             tions on the disk.

         +   This means that on writes, extra copies  must  either  be

             updated or invalidated.


                                  - .12 -


     +   ALIS - automatical locality improving storage

         +   Best  results  obtained  when  techniques  are  combined:

             reorganize to make sequential, cluster, and replicate.


     +   RAID

         +   Observations:

             +   Small disks cheaper than large ones (due to economies

                 of scale)

             +   Failure rate is constant, independent of disk size

             +   Therefore, if we replace a few large disks with  lots

                 of small disks, failure rate increases

         +   Solution:

             +   Interleave the blocks of the file  across  a  set  of

                 smaller disks, and add a parity disk.

             +   Note that since we presume (a) only one disk failure,

                 and (b) we know which disk failed, we can reconstruct

                 the failed disk.

             +   Can do parity in two directions for extra  reliabili-

                 ty.

         +   Advantage:

             +   Improves read bandwidth.

         +   Problem:

             +   This means that we have to write the parity  disk  on

                 every write.  It becomes a bottleneck.

             +   A solution - interleave on a different basis than the

                 number  of  disks.   That  means that the parity disk

                 varies, and the bottleneck is spread around.


                                  - .13 -


         +   Types of RAID:

             +   RAID 0 - ordinary disks

             +   RAID 1 - replication

                 +   RAID 4 - parity disk in fixed location

                     +   RAID 5 - parity disk in varying location


                                  - .14 -


                 Topic: Directories and Other File System  Topics


     +   Naming:

         +   How do users refer to their files?

         +   How does OS refer to the file itself?

         +   How does OS find file, given name?


     +   File Descriptor is a data strucure or record  that  describes

         the file.

     +   The file descriptor information has to be stored on disk,  so

         it  will stay around even when the OS doesn't.  (Note that we

         are assuming that disk contents are permanent.)

         +   In Unix, all the descriptors are stored in a  fixed  size

             array  on  disk.  The descriptors also contain protection

             and accounting information.

         +   A special area of disk is used for  this  (disk  contains

             two  parts:   the  fixed-size  descriptor  array, and the

             remainder, which  is  allocated  for  data  and  indirect

             blocks).

         +   The size of the descriptor array is determined  when  the

             disk  is initialized, and can't be changed.  In Unix, the

             descriptor is called an inode, (index node and its  index

             in the array is called its i-number).  Internally, the OS

             uses the i-number to refer to the file.

             +   IBM calls the equivalent structure the  volume  table

                 of contents (VTOC).


                                  - .15 -


         +   The Inode is the focus of  all  file  activity  in  UNIX.

             There  is a unique inode allocated for each file, includ-

             ing directories.  An inode is 'named' by its  dev/inumber

             pair. (iget/iget.c)

         +   Inode fields:

             +   reference count (number of times open)

             +   number of links to file

             +   owner's user id, owner's group id

             +   number of bytes in file

             +   time last accessed, time  last  modified,  last  time

                 inode changed

             +   disk block addresses, indirect blocks (discussed pre-

                 viously)

             +   flags: (inode is locked, file has been modified, some

                 process waiting on lock)

             +   file mode: (type of file: character special, directo-

                 ry, block special, regular, symbolic link, socket),

                 +   A socket is an endpoint of a  communication,  re-

                     ferred  to by a descriptor, just like a file or a

                     pipe.  Two processes can each create a socket and

                     then connect those two endpoints to produce a re-

                     liable byte stream.  (a pipe  requires  a  common

                     parent  process.   a  socket  does  not,  and the

                     processes may be on different machines)

         +   (items below not in text on 4.4BSD)

             +   protection info: (set user id on execution, set group

                 id  on  execution,  read, write, execute permissions,


                                  - .16 -


                 sticky bit? (check))

             +   count of shared locks on inode

             +   count of exclusive locks on inode

             +   unique identifier

             +   file sys associated with this inode

             +   quota structure controlling this file


         +   When a file is open,  its  descriptor  is  kept  in  main

             memory.   When  the  file  is  closed,  the descriptor is

             stored back to disk.

             +   There is usually a per process table of open files.

                 +   In Unix, there is a process open file table, with

                     one  entry for each file open.  The integer entry

                     into that table is the handle for that file open.

                     Multiple opens for the file will get multiple en-

                     tries.  (note that if a process  forks,  a  given

                     entry can be shared by several processes.)

                     +   (standard-in is #0 and  standard-out  is  #1,

                         stderr is #2, must be per process.)

             +   Unix also has a system open file table, which  points

                 to the inode for the file (in the inode table).  This

                 table is system wide.  Maps names to files.

             +   There is also the inode table, which is a system-wide

                 table holding active and recently used inodes.

             +   Descriptor is kept in OS space which  is  paged.   So

                 may  be  necessary  to  have  page  fault  to  get to


                                  - .17 -


                 descriptor info.


     +   Users need a way of referencing files that they leave  around

         on  disk.   One  approach  is  just  to  have  users remember

         descriptor indexes.  I.e. the user  would  have  to  remember

         something  like  the  number of the descriptor, or some such.

         Unfortunately, not very user friendly.


     +   Of course, users want to use text names to  refer  to  files.

         Special  disk  structures called directories are used to tell

         what descriptor indices correspond to what names.


     +   Approach #1:  have a single directory  for  the  whole  disk.

         Use a special area of disk to hold the directory.

         +   Directory contains <name, index> pairs.

         +   Problems:

             +   If one user uses a name, no-one else can.

             +   If you can't remember the name of  a  file,  you  may

                 have to look through a very long list.

             +   Security problem - people can  see  your  file  names

                 (which can be dangerous.)

         +   Old personal computers (pre-Windows) work this way.


     +   Approach #2: have a separate directory for each user (TOPS-10

         approach).   This  is still clumsy:  names from a user's dif-

         ferent projects get confused.  Still can't remember names  of

         files.


                                  - .18 -


         +   IBM's VM is similar to this.  Files  have  3  part  name:

             <name,  type,  location>, where location is A, B, C, etc.

             (i.e. which disk).  Very painful.  (Also, file names lim-

             ited to 8 characters.)


     +   #3 - Unix approach:  generalize the directory structure to  a

         tree.

         +   Directories are stored on disk just  like  regular  files

             (i.e.  file descriptor with 13 pointers, etc.).

             +   User programs can manipulate directories almost  like

                 any  other  file.  Only  special  system programs may

                 write directories.

         +   Each directory  contains  <name,  file  descriptor  index

             (inode  #)>  pairs.  The file pointed to by the index may

             be  another  directory.   Hence,  get  hierarchical  tree

             structure.   Names  have slashes separating the levels of

             the tree.

         +   There is one special directory, called  the  root.   This

             directory  has  no  name,  and  is the file pointed to by

             descriptor 2 (descriptors 0 and 1 have other special pur-

             poses).

             +   Note that we need ROOT.  Otherwise, we would have  no

                 way  to  reach any files.  From root, we can get any-

                 where in the file system.

         +   Full file name is the path  name,  i.e.  full  name  from

             root.

             +   A directory consists of  some  number  of  blocks  of


                                  - .19 -


                 DIRBLKSIZ  bytes, where DIRBLKSIZ is chosen such that

                 it can be transferred to  disk  in  a  single  atomic

                 operation (e.g. 512 bytes on most machines).

             +   Each directory block contains some number of directo-

                 ry  entry  structures,  which are of variable length.

                 Each directory entry has info at  the  front  of  it,

                 containing its inode number, the length of the entry,

                 and the length of the name contained  in  the  entry.

                 These  are  followed  by  the name padded to a 4 byte

                 boundary with null bytes.  All names  are  guaranteed

                 null terminated.

         +   Note that in Unix, a file name is not the name of a file.

             It  is only a name by which the kernel can search for the

             file.  The inode is really the "name" of the file.

             +   Each pointer from a directory to a file is  called  a

                 hard link.

                 +   In some systems, there is a distinction between a

                     "branch" and a "link", where the link is a secon-

                     dary access path, and the branch is  the  primary

                     one (goes with ownership).

             +   You "erase" a file by removing  a  link  to  it.   In

                 reality,  a count is kept of the number of links to a

                 file.  It is only really erased when the last link is

                 removed.

                 +   To really erase a file, we put the blocks of  the

                     file on the free list.


                                  - .20 -


     +   Symbolic Links

         +   There are two ways to  "link"  to  another  directory  or

             file.   One is a direct pointer.  In Unix, such links are

             limited to not cross "file systems" - i.e. not to another

             disk.

         +   We can use symbolic links, by which instead  of  pointing

             to  the  file  or  directory, we have a symbolic name for

             that file or directory.

         +   We need to be careful not to create cycles in the  direc-

             tory  system - otherwise recursive operations on the file

             system will loop.  (E.g.   cp  -r).   In  Unix,  this  is

             solved  by  not  permitting hard links to existing direc-

             tories (except by the superuser).


     +   Pros and Cons of tree structured directory scheme

         +   Can organize files in logical manner.  Easy to  find  the

             file  you're  looking  for,  even  if  you  don't exactly

             remember its name.

         +   "Name" of the file is in fact a concatenation of the path

             from  the  root.   Thus name is actually quite long- pro-

             vides semantic info.

         +   Can have duplicate names, if path to  the  file  is  dif-

             ferent.

         +   Can (assuming protection scheme permits) give away access

             to  a subdirectory and the files under it, without giving

             access to all files.  (Note: Unix does not permit  multi-


                                  - .21 -


             ple hard links to a directory, unless done by superuser.)

         +   Access to a  file  requires  only  reading  the  relevant

             directories,  not  the entire list of files.  (My list of

             files prints out to a 1/2" printout-  20000 files)

         +   Structure is more complex to move around and maintain

         +   A file access may require that many directories be  read,

             not just one.

         +   It is very nice that directories and file descriptors are

             separate,  and that directories are implemented just like

             files.  This simplifies the implementation and management

             of  the structure (can write ``normal'' programs to mani-

             pulate them as files).

             +   I.e.  the  file  descriptors  are  things  the   user

                 shouldn't  have to touch.  Directories can be treated

                 as normal files.


     +   Working directory:  it is cumbersome constantly  to  have  to

         specify the full path name for all files.

         +   In Unix, there is one directory per process,  called  the

             working directory, which the system remembers.

         +   This is not the same as  the  home  directory,  which  is

             where  you are at log-in time, and which is in effect the

             root of your personal file system.


     +   Every user has a search path, which is a list of  directories

         in  which  to look to resolve a file name.  The first element

         is almost always the working directory.


                                  - .22 -


         +   ``/'' is an escape to allow full path names.   I.e.  most

             names  are  relative  file names.  Ones with "/" are full

             (complete) path names.

         +   Note that in Unix, the search path is maintained  by  the

             shell.  If any other program wants to do the same, it has

             to rebuild the facilities from scratch.  Should be in the

             OS.  ("set path" in .cshrc or .login.)

             +   My path is: (. ~/bin /usr/new /usr/ucb /bin  /usr/bin

                 /usr/local /usr/hosts ~/com)

             +   Basically, want to look in  working  directory,  then

                 system library directories.

             +   We probably don't want a search strategy that actual-

                 ly  searches more widely.  If it did, it might find a

                 file that wasn't really the target.

         +   This is yet another example of locality.

         +   Simple means to change working  directory  -  "cd".   Can

             also  refer  to  directories  of other users by prefacing

             their logins by "~".


     +   Operations on Files

         +   Open - put a file descriptor  into  your  table  of  open

             files.   Those  are  the files that you can use.  May re-

             quire that locks be set, and a user count be incremented.

             (If  any  locking  is  involved,  may  have  to check for

             deadlock.)

         +   Close - inverse of open.

         +   Create a file - sometimes done automatically by open.


                                  - .23 -


         +   Remove (rm) or erase - drop the link to  the  file.   Put

             the  blocks  back  on  the  free list if this is the last

             link.

         +   Read - read a record from the file.  (This usually  means

             that  there is an "access method" - i.e. I/O code - which

             deals with the user in terms of records, and  the  device

             in terms of physical blocks).

         +   Write - like read, but may also require disk space  allo-

             cation.

         +   Rename ("mv" or "move") - rename the file.  Unix combines

             two different operations here.  Rename would strictly in-

             volve changing the file name within the  same  directory.

             "move"  moves  the  file  from  one directory to another.

             Unix does both with one command.

             +   Note that mv also destroys old file if there  is  one

                 with new name.  (This is BUG).

                 +   (which could be considered a bug, not a feature)

         +   Seek - move to a given location in the file.

         +   Synch - write blocks of file from disk cache back to disk

         +   Change properties (e.g. protection info, owner)

         +   Link - add a link to a file

         +   Lock & Unlock - lock/unlock the file.

         +   Partial Erase (truncate)

     +   Note that commands such as "copy", "cat", etc., are built out

         of the simpler commands listed above.


     +   Pseudo Files


                                  - .24 -


         +   We have commands such as "read", and "write"  for  files.

             We  want  to do similar things to devices (e.g. terminal,

             printer, etc.).  There is no reason not to treat I/O dev-

             ices as files, and we can do so.  Called "pseudo files".


     +   File Backup and Recovery

         +   The problem - want to avoid losing files due to :

             +   1. System Crashes (hardware or software)

                 +   a. Physical Hard Failure - usually Head Crash

                 +   b. Software Failure

                 +   c. General system failure while the file is open.

                     (this  is  the most common problem and the one we

                     are  usually  concerned  with.)  (Usually   Power

                     failure)

             +   2. User Errors

                 +   Want to be able to get files back after  we  have

                     destroyed  them  (overwritten  or erased).  (Unix

                     doesn't provide this.)

             +   3. Sabotage and malicious users.


         +   Approaches:

             +   Periodic full dump - periodically  dump  all  of  the

                 files  (and  directories)  to backup storage, such as

                 tape.  System can be reloaded from dump tape.   Some-

                 times called checkpoint dump.

                 +   Note: system has to be shut down during  dumping.

                     Slow.   Recovery  is only back to last dump - not


                                  - .25 -


                     up to date.  Large amount of data - slow to dump,

                     and large number of tapes.

             +   Incremental (periodic) dump - dump all modified files

                 periodically - e.g.  when the user logs out, or after

                 the file is closed.  Thus we can  lose  a  file  only

                 when it is open.

                 +   Disadvantages: large quantities of data, long and

                     involved recovery problem.


         +   One recovery problem is that when a crash is software  or

             hardware,  some tables may be left in inconsistent condi-

             tion (e.g. free list may be wrong,  etc.).   It  is  also

             necessary to fix all the tables.  Very system dependent.

             +   There are several approaches  to  the  problem  of  a

                 crash while modifying a file.

                 +   Work on a copy of the file and swap  it  for  the

                     original when you close.

                     +   This is usually  what  an  editor  (e.g.  vi)

                         does.

                     +   What if open by more than one person at  same

                         time?

                     +   How do we make "swap" atomic?  (see below)

                 +   Write a log of all changes to the file, so we can

                     backup, if necessary (audit trail).

                 +   Write a list of changes  to  the  file  prior  to

                     modifying the file, so we can restart the list at

                     the point at which a crash  occurred  (intentions


                                  - .26 -


                     list, or log-tape write ahead protocol.).

                 +   Keep multiple copies of the  file.   Update  one,

                     and then copy the update to the 2'nd (careful re-

                     placement).

                 +   Make a new copy of any part of the file as it  is

                     modified.   Replace old parts with new parts when

                     we close the file (differential file).

                     +   I.e. Duplicate the file  descriptor.   Update

                         the  new  copy  of the file descriptor as new

                         copies of blocks are made.  Swap the new file

                         descriptor  for  the  old  when  the  file is

                         closed.


                                  - .27 -


           Topic: Networks and Communication Protocols


     +   Two trends:

         +   Lots of small machines.

         +   Lots of computers everywhere, and a need to communicate.


     +   Problem: communication and cooperation are difficult.

         +   How do people on the same project share files?

         +   How does new software get distributed to all users?

         +   How is electronic mail handled?


     +   Solution: tie machines together with networks,  develop  mes-

         sage  protocols  that  allow  communication  and  cooperation

         again.


     +   Goal: ideally, we would like all computers to look  like  one

         very  large,  unified system.  We could share files, communi-

         cate, etc. as if it were a timesharing system.

         +   But we will always be able to tell the difference, due to

             performance.


     +   Wide Area Networks - networks that  connect  sites  that  are

         geographically apart.

     +   Local area networks - LANs:  Developed mid-70's to  hook  to-

         gether  personal computers.  Most popular interconnection for

         LANs is Ethernet.  LANs are used very differently than  wide-

         area networks.

                                  - .28 -


     +   Examples of networks:

         +   ARPAnet:  (Defense Advanced Research Projects  Agency)  -

             1st  and  most  famous  network, developed early 70's but

             still in use.  Connected together large timesharing  sys-

             tems all over the country using leased phone lines.  Pro-

             vided mail, file transfer, remote login.

             +   ARPANET used IMPs (Interface Message  Processors)  as

                 routers  and  TIPs (Terminal Interface Processors) to

                 connect from a terminal.

         +   Usenet: Developed late 70's, early  80's.   Unix  systems

             phone each other up to send mail and transfer files.

         +   CSnet: developed to be less expensive clone  of  Arpanet,

             and tie together CS departments.

         +   BITNET: ties together mostly sites using  IBM  equipment,

             including a lot of physics laboratories.

         +   VNET - IBM's  internal  corporate  network.   Has  highly

             secure gateways to connect to CSnet.

         +   DECNET - DEC's network system.  Name of product, and also

             refers to corporate communication system.

         +   Misc. commercial nets, such as America  Online,  Prodigy,

             Compuserve.

         +   Internetworks:  mechanisms for tying together many exist-

             ing networks, such as ARPAnet, Usenet, and LANs.

         +   The Internetwork is the combination of most of the above.

             They are now widely interconnected.


                                  - .29 -


     +   Network Hardware

         +   LAN

             +   Usually ethernet, which uses either shared cable,  or

                 wires from each machine to a hub or switch.

         +   WAN

             +   Point-to-point links (used by most  early  networks).

                 Examples are leased phone lines (50k bits, DARPAnet),

                 RS232 connections, T1 and  T3  lines,  regular  phone

                 calls, satellite links, etc.


     +   Network Topologies:

         +   Fully connected:  every site can  talk  directly  to  any

             other site (e.g. Usenet).

         +   Partially connected:  star and  ring  are  most  popular.

             Intermediate nodes must forward messages.

         +   Multi-access bus / broadcast -  (used by  most  LANs  to-

             day).   A  single  cable or group of cables connects many

             machines together.  Best example is Ethernet (one  wire).

             Alternative is radio broadcast.


     +   Network Performance Parameters

     +   Networks are usually characterized in terms  of  two  perfor-

         mance parameters:

         +   Latency:  the minimum time to get the minimum  amount  of

             information between two sites.

             +   Note the  difference  between  transmission  latency,

                 which  is time for a given bit to get from one end to


                                  - .30 -


                 the other after the connection is set up and

             +   set-up latency - which is time to get the  first  bit

                 there.

         +   Bandwidth:  once information is flowing,  how  many  bits

             per second can be transmitted (i.e. the marginal cost per

             bit).

         +   Note also cost.


     +   Protocols:

         +   These are the key to networks.  A  protocol  is  just  an

             agreement  between  the  parties on the network about how

             information will be transmitted between them and what the

             information  format  is.  There are many different proto-

             cols to do different things (e.g.  mail,  file  transfer,

             remote  login).   Typically,  protocols  are  built up in

             layers.  Section 15.6 of the book lists the 7 ISO  proto-

             col layers.

             +   Hierarchical protocols - relate  layers  in  a  given

                 system.   Can  be  network  system, operating system,

                 etc.

             +   Peer to Peer Protocols - relate  the  same  layer  of

                 different  systems.  Rely on lower layers to actually

                 communicate across machines.


     +   ISO Protocols

         +   1. Lowest protocol layer:   physical  layer.   Determines

             the  electrical  mechanisms  for transmitting bits:  vol-


                                  - .31 -


             tages, delays, currents, etc.

         +   2. Data Link protocol layer:  how to get packets  between

             two directly-connected components.  Includes error detec-

             tion and recovery from physical layer. 

         +   3. Network layer - Responsible for providing  connections

             (to  nodes  that  are not directly connected) and routing

             packets.  Takes care of addresses.  (Takes care of  route

             changes due to changing loads.)

         +   4. Transport layer - low level access to network.  Breaks

             messages  into packets, keeps packets in order, flow con-

             trol, physical address generation.  (Takes  care  of  re-

             transmission of lost or destroyed packets.)

         +   5. Session layer - process to process protocols.

         +   6. Presentation  Layer  -  resolves  differences  between

             sites  in formats (e.g. character types, number represen-

             tation, full/half duplex, etc.)

         +   7. Application layer - interacts  with  users.   Supports

             electronic mail, distributed data bases, etc.


     +   Wide area networks are usually built up of  Local  Area  Net-

         works  (LANs), which are interconnected.  Local area networks

         are usually some type of broadcast network.


     +   Broadcast networks:  single shared communication  medium,  no

         central controller to allocate access to it.

         +   Simplest scheme is the Aloha mechanism:   just  broadcast

             blindly,  use  recovery  protocols  if packet doesn't get


                                  - .32 -


             through.  This system has stability problems:  can't  get

             more  than  18% utilization of channel (1/2e), and system

             completely falls apart under heavy loads.

             +   Aloha system  uses  satellite  -  no  choice.   Can't

                 listen - delays too long (1/4 second).

             +   Can be improved with "slotted aloha" - messages occu-

                 py slots.  (1/e) - doubles bandwidth.


         +   Ethernet (using a physical coax cable) adds two things.

             +   First is carrier sense:  listen before  broadcasting,

                 defer until channel is clear, then broadcast.

             +   Also listen while broadcasting.  Collision can  still

                 happen  if  two stations start up at exactly the same

                 time.  If collision detected,  jam  network  so  that

                 everyone  will know about collision (don't waste time

                 transmitting junk).  Then wait a ``random'' interval,

                 retry.   If  repeated  collisions,  wait  longer  and

                 longer intervals.

             +   This is called CSMA/CD (carrier  sense  multiple  ac-

                 cess, with collision detection).

             +   Ethernet Frame: destination address (6 bytes), source

                 address  (6  bytes),  type  (2  bytes), data (46-1500

                 bytes), frame check sequence (4 bytes).

             +   Problem with Basic (original) Ethernet:

                 +   Reliability: if any station jams network,  nobody

                     can  do anything, can't even figure out who's do-

                     ing it.


                                  - .33 -


                 +   Fairness:  there's no guarantee  against  starva-

                     tion.   People  with  real-time  needs don't like

                     this.

                 +   Bandwidth limited to cable (10 mbits)

                 +   Original ethernet limited to site  which  can  be

                     connected by cable (4000 feet).

                     +   Longer connections with switch 

                 +   Security -  relatively  easy  to  listen  to  all

                     traffic, and/or tap cable.

             +   More Recent Ethernet Designs:

                 +   Use Switch to route, rather than shared cable

                 +   Rates of 10Mbit, 100Mbit, 1Gbit/sec.   10G  under

                     development.

                 +   Wireless ethernet (801.11a,b,g)

                 +   Continues to use most of ethernet protocol- frame

                     format,   timeouts,  collision  detect,  software

                     stuff.


     +   Ring networks:  - these are a type of broadcast network.   An

         additional  protocol built on top of a ring-structured set of

         point-to-point links.

         +   Normally, an electronic token (special packet) circulates

             at high speed around the ring.  If a station doesn't have

             anything to broadcast, it just retransmits everything  it

             receives.

         +   When ready to broadcast, a station waits until the  token

             passes  by.  Instead  of  retransmitting  the token, send


                                  - .34 -


             packet instead.

         +   When packet has been transmitted, put token back on  ring

             for next station to use.

         +   Packet loops all the way around, gets swallowed by sender

             when it comes back again (recognize self as destination).

         +   Problems with ring system:

             +   If any station dies, token can't  circulate  so  ring

                 dies.

             +   If token is missing, system dies.

             +   Starvation is possible.

             +   If a second token is created, system can  get  messed

                 up.


     +   We can use Ethernet or Ring Network for local net.  Need some

         way  of  constructing  (structuring)  a  wide area net.  I.e.

         build an "internetwork".


     +   Three methods for link between two machines:

         +    Circuit switching - like telephone  system  -  you  have

             circuit between the source and destination machines.

         +   Packet switching - communications are broken into packets

             and sent piece by piece.

             +   Note that packet switching can be  used  to  build  a

                 virtual  circuit  -  looks like circuit, but actually

                 packets on a shared medium.

         +   Message switching - a virtual  circuit  exists  for  long

             enough  to  complete  a  message, and then the circuit is


                                  - .35 -


             dropped.  Or can use physical link.


     +   Getting stuff where you want it.

         +   Packets must be forwarded from machine to  machine  until

             they   reach  the  destination.   Machines  that  forward

             between networks are called gateways.  Problem is how  to

             get stuff where you want it.


     +   Names vs. addresses vs. routes:

         +   Name:  a symbolic term  for  something:   ``Robert'',  or

             ``ucbcory''.  Good for people to remember.

         +   Address:  where the thing is:  in an internetwork  situa-

             tion,  usually consists of the number of the network, the

             number of the site on the network, the id of the host  at

             the  site,  and  sometimes  a  more specific host (e.g. a

             workstation).  E.g. jones@chaos.netnode.berkeley.edu

         +   Route: directions for how to get there from here  (a  se-

             quence  of  hosts  and links to pass through to reach the

             destination).

             +   Sometimes the sender has to provide the  route,  e.g.

                 in   UUCP:   hplabs!hp-pcd!hpcvc0!cliff.    All  each

                 machine has to be able to do is remember  its  neigh-

                 bors and forward messages.  This is clumsy for users.

         +   It's better if the hosts of the internetwork  can  figure

             out  the  routing  stuff for themselves.  This involves a

             special protocol  between  the  hosts  to  build  routing

             tables.   E.g.   in  the Internet, hosts send messages to


                                  - .36 -


             nearest neighbors, build up tables of most  direct  paths

             from each host to each other host (fewest hops).

             +   Difficulty with routing tables is that they get to be

                 very large.

             +   Note that routes can change dynamically.   There  can

                 be  more  than  one way to get from A to B.  Note the

                 problem if instability in routing if changed for per-

                 formance reasons.

             +   In LANs, only gateways have to worry  about  routing:

                 all  the other hosts just ship packets to gateway un-

                 less for host on local net.


     +   Communications Problems:

         +   Packets can get lost:

             +   Transmission errors.

                 +   Address is corrupted, and packet circulates  for-

                     ever.

                 +   Contents of packet are corrupted.

             +   A host has all its packet buffers full so it  has  no

                 place to put another incoming packet.

                 +   Can happen at intermediate host  if  packets  are

                     arriving  on  a  fast network but have to be for-

                     warded onto  a  much  slower  network.   This  is

                     called network congestion.

                 +   Can happen at destination if user  process  can't

                     work  fast  enough  to process all the packets as

                     they arrive.


                                  - .37 -


             +   Receiver is down, and sender sends anyway.

         +   Packets can arrive out of order:  if some hosts  suddenly

             go  down,  or  if  routing  tables  change, packets might

             wander off into the network and  come  back  much  later.

             Most protocols include a time-to-live mechanism:  after a

             certain time, packets  are  killed  so  that  they  don't

             wander endlessly.


     +   Datagram protocols:  used to deliver individual packets;  the

         packets are not guaranteed to get through or to arrive in any

         particular order.  This is useful for some applications,  but

         not very many.


     +   Most applications would like guarantees  about  delivery  and

         order.

         +   This is called a connection, and the protocols to  imple-

             ment  it  are  called virtual circuit or transport proto-

             cols.

         +   To do this, the sender and receiver must  remember  state

             about what has been happening.

             +   Simple acknowledgement-based protocol:

                 +   Store a serial number in each packet.  Sender as-

                     signs serial numbers, increments for each packet.

                 +   Sender sends one or more packets.

                 +   Receiver sends an ack acknowledgement packet  for

                     each packet or group of packets.

                 +   Sender waits for acknowledgement  before  sending


                                  - .38 -


                     next (group of) packets (must also save old pack-

                     ets!).

                 +   If sender doesn't receive acknowledgement  within

                     a reasonable time, it assumes that the packet got

                     lost and retransmits it.

                 +   Retransmission could result in  receiver  getting

                     two  packets  with same serial number:  it checks

                     serial numbers and  throws  away  duplicates  and

                     out-of-order packets.

             +   Sender and receiver  must  negotiate  about  how  far

                 ahead  the  sender  can send:  otherwise the receiver

                 might run out of buffer space  and  have  to  discard

                 packets.  This is called the flow control problem.

             +   No matter what the virtual circuit mechanism, setting

                 up  the  connection  is  complex  and time-consuming.

                 It's tricky to get two hosts to agree to  communicate

                 with  each  other  and  get  their  state initialized

                 correctly.


     +   TCP/IP

         +   TCP/IP is collection of network protocols making  up  the

             Internet Protocol Suite.

         +   History:

             +   1969- Arpanet with 4 nodes. (SDC, UCSB, UCLA, SRI)

             +   1972- Arpanet Demo (50 hosts)

             +   mid-1970s- TCP developed, running on Unix  (DEC  PDP-

                 11)


                                  - .39 -


             +   early 80s- Berkeley Unix.  Runs TCP.

             +   1983- Arpanet converts to TCP/IP.  In use by Sun.


         +   ISO Levels:

             +   Level 3 - Network Layer:

                 +   IP - Internet Protocol  -  provides  host-to-host

                     datagram delivery.

                     +   Provides packet routing, will insulate higher

                         levels  from network specific characteristics

                         (e.g. packet size).

                     +   Fields of IP packet header include:  version,

                         header length, total length, ID (same for all

                         fragments of datagram), time-to-live,  check-

                         sum, source address, destination address.

                     +   IP address 32 bits.  (newer version  longer).

                         Broken  into  four 8-bit segments.  Addresses

                         allocated in blocks.

                 +   ICMP - Internet Control Message Protocol  -  used

                     by  gateways  and hosts to approse other hosts of

                     conditions related to their IP  services.   (e.g.

                     routing, congestion)

                 +   ARP - address resolution protocol -  Maps  an  IP

                     address  to  an  associated ethernet address. (32

                     bits -> 48 bits).

                 +   RARP - Reverse ARP- Maps an ethernet  address  to

                     an associated IP address.

             +   Level 4- Transport Layer:


                                  - .40 -


                 +   TCP - Transmission Control  Protocol:  connection

                     oriented, reliable, byte-stream protocol.

                     +   TCP  packet  header  includes:  source   port

                         (identifies  process  or  service in sender),

                         destination port, sequence number (32  bits),

                         acknowledgement  number,  control  flags (SYN

                         (connection request), ACK, RST  (reset),  FIN

                         (end)),  window (window size- number of pack-

                         ets that will be accepted), checksum.

                     +   Provides means to connect with a  socket  [IP

                         address, port number].

                     +   Takes care of timeouts, retransmissions, flow

                         control.

                     +   Some well known  ports:   20,  21  (FTP),  23

                         (Telnet), 25 (SMTP)

                 +   UDP - User  Datagram  Protocol  -  unacknowledged

                     transaction-oriented protocol parallel to TCP.

             +   Levels 5-7:  Session,  Presentation  and  Application

                 Layers:

                 +   SMTP- Simple Mail Transfer Protocol

                 +   DNS - Domain Name  Service-  maps  names  to  ad-

                     dresses

                     +   Top level is Network Information Center (NIC)

                         computers.

                 +   FTP - File Transfer Protocol

                 +   Telnet - provides virtual terminal services.


                                  - .41 -


     +   The mechanisms described above form the basis for  tying  to-

         gether  distributed  systems.   So  far, though, they've only

         been used for loose coupling:

         +   Each machine is completely autonomous:  separate account-

             ing, separate file system, separate password file, etc.

         +   Can send mail between machines.

         +   Can transfer files between machines (but only  with  spe-

             cial commands).

         +   Can execute commands remotely.

         +   Can login remotely.


     +   Loose coupling like this is OK for a network with only a  few

         machines  spread  all  over  the country, but not for a high-

         performance LAN where every user has a private machine.


     +   What would we like in a distributed computer system?

         +   Unified, transparent file system.

         +   Unified, transparent computation - from any terminal, you

             can  run  on  any machine, transparently.   (You actually

             shouldn't care which machine you're running on.)

         +   Load Balancing, process migration, file migration.

     +   Local area networks can more or less provide that now.

     +   Wide area networks cannot provide this transparenly,  due  to

         performance problems.  May be possible in future?


     +   Distributed File Systems

         +   Remote files appear to be local (except for performance).


                                  - .42 -


         +   Issues:

             +   Failures - what happens when remote system crashes.

             +   Performance - remote is not same as  local.   Can  do

                 some caching.


     +   Sun's NFS (Network File System)

         +   NFS permits the mounting of a remote file system as if it

             were  local.  Therefore, by using mount commands, can set

             up transparent distributed file system.

         +   Caches file  blocks,  descriptors  at  both  clients  and

             servers.

             +   Write-through caching.   When  file  is  closed,  all

                 modified  blocks are sent immediately to server disk.

                 "Close" doesn't return  until  all  bytes  stored  on

                 disk.

                 +   Consistency  is  weak.   Polls  periodically  for

                     changes  to  file; may use old version until pol-

                     ling.  May have simultaneous conflicting updates.

             +   Server keeps no state about client (except hints, for

                 performance).   Each read gives enough info to do en-

                 tire operation.  (I.e. Read(I#, position).

             +   When server crashes and  restarts,  can  start  again

                 processing requests immediately.

             +   All requests are "idempotent" - i.e. can be  repeated

                 with  no ill effects.  So if message may be lost, can

                 resend (and possibly redo).


                                  - .43 -