CS 162 Lecture Notes Prof. Alan Jay Smith Topic: File Structure, I/O Optimization + File: a named collection of bits (usually stored on disk). + From the OS' standpoint, the file consists of a bunch of blocks stored on the device. + Programmer may actually see a different interface (bytes or records), but this doesn't matter to the file system (just pack bytes into blocks, unpack them again on read- ing). + File may have attributes and properties- e.g. name(s), protection, type (numeric, alphabetic, binary, c program, fortran progrm, data, etc.), time of creation, time of last use, time of last modification, owner, length, link count, layout (format). + How do we (or can we) use a file? + Sequential: information is processed in order, one piece after the other. This is by far the most common mode: e.g. editor writes out new file, compiler compiles it, etc. + Random Access: can address any block in the file direct- ly without passing through its predecessors. E.g. the data set for demand paging, libraries, databases. Need to know what block we want (e.g. some sort of index or address needed.) + Keyed: search for blocks with particular values, e.g. - .1 - hash table, associative database, dictionary. Usually not provided by operating system (but is provided in some IBM systems). Keyed access can be considered a form of random access. + Modern file and I/O systems must address four general prob- lems: + 1. Disk Management: + Efficient use of disk space + Fast access to files + File structures + Device use optimization. + User has hardware independent view of disk. (Mostly, so does OS). + 2. Naming: how do users refer to files? + This concerns directories, links, etc. + 3. Protection: all users are not equal. + Want to protect users from each other. + Want to have files from various users on same disk. + Want to permit controlled sharing. + 4. Reliability + Information must last safely for long periods of time. + Disk Management: + How should the blocks of the file be placed on the disk? + How should the map to find and access the blocks look? - .2 - + File Descriptor- A data structrure that gives file attributes and contains the map which tells you where the blocks of your file are. + File descriptors are stored on disk along with the files (when the files are not open). + Some system, user and file characteristics: + Most files are small. + In Unix, most files are very small. Lots of files with a few commands in them, etc. + Much of the disk is allocated to large files. + Many of the I/O operations are made to large files. + Most (between 60% and 85%) of the I/Os are reads. + Most I/Os are sequential. Thus, per-file cost must be low but large files must have good performance. + File Block Layout and Access + Contiguous + Linked + Indexed or tree structured + Note - this is just standard data structures stuff, but on disk. + Contiguous allocation: + Allocate file in a contiguous set of blocks or tracks. + Keep a free list of unused areas of the disk. When - .3 - creating a file, make the user specify its length, allo- cate all the space at once. Descriptor contains location and size. + Advantages: + Easy access, both sequential and random, + Low overhead + Simple. + Few seeks. + Very good performance for sequential access. + Drawbacks: + Horrible fragmentation will make large files impossi- ble. + Hard to predict needs at file creation time. + May over allocate. + Hard to enlarge files. + Can improve this scheme by permitting files to be allo- cated in extents. I.e. ask for contiguous block; if it isn't enough, get another contiguous block. + Example: IBM OS/360 permits up to 16 extents. Extra space in last extent can be released after file is writ- ten. + Linked files: Link the blocks of the file together as a linked list. In file descriptor, just keep pointer to first block. In each block of file keep pointer to next block. + Advantages? Files can be extended, no external fragmen- tation problems. Sequential access is easy: just chase - .4 - links. + Drawbacks? Random access requires sequential access through list. Lots of seeking, even in sequential ac- cess. Some overhead in block for link. Example: TOPS-10, sort of. Alto, sort of. + (Simple) Indexed files: Simplest approach is to just keep an array of block pointers for each file. File maximum length must be declared when it is created. Allocate an array to hold pointers to all the blocks, but don't allocate the blocks. Then fill in the pointers dynamically using a free list. + Advantages? + Not as much space wasted by overpredicting, both sequential and random access are easy. Only waste space in the index. + Drawbacks? + May still have to set maximum file size (Can have an overflow scheme if file is larger than predicted max- imum.) + Blocks are probably allocated randomly over disk sur- face, and there will be lots of seeks. + Index array may be large, and may require large file descriptor. + Multi-level indexed files: the VAX Unix solution (version 4.4). - .5 - + In general, any sort of multi-level tree structure. More specifically, we describe what Berkeley 4.3BSD Unix does: + File descriptors: 15 block pointers. The first 12 point to data blocks, the next three to indirect, doubly- indirect, and triply-indirect blocks (256 pointers in each indirect block). Maximum file length is fixed, but large. Descriptor space isn't allocated until needed. + Advantages: simple, easy to implement, incremental ex- pansion, easy access to small files. Good random access to blocks. Easy to insert block in middle of file. Easy to append to file. Small file map. + Drawbacks: + Indirect mechanism doesn't provide very efficient ac- cess to large files: 3 descriptor ops for each real operation. (When we "open" the file, we can keep the first level or two of the file descriptor around, so we don't have to read it each time.) + File isn't generally allocated contiguously, so we have to seek between blocks. + Block Allocation: + If all blocks are same size, can use bit map solution. + One bit per disk block. + Cache parts of bit map in memory. Select block at random (or not randomly) from bitmap. + If blocks are variable size, can use free list. + This requires free storage area management. Fragmen- - .6 - tation and compaction. + In Unix, blocks are grouped in groups for efficiency: each block on the free list contains pointers to many free blocks, plus a pointer to the next list block. Thus there aren't many references involved in alloca- tion or deallocation. + Block-by-block organization of free list means that file data gets spread around the disk. + A more efficient solution (used in DEMOS system built at Los Alamos.): + Allocate groups of sequential blocks. Use multi-level index scheme described above, but each pointer isn't to one block - it is to a sequence of blocks. + When we need another block for a file, we attempt to al- locate the next physical block on the track (or cylinder). + If we can't do it sequentially, we try to do it near- by. + If we have detected a pattern of sequential writing, then we grab a bunch of blocks at a time (release them if unused). (The size of the bunch will depend on how many sequential writes have occurred so far.) + Keep part of the disk unallocated always (as Unix does now) - then probability we can find sequential block to allocate is high. - .7 - + I/O Optimization + Block Size Optimization + Small blocks + Small I/O buffers + Discuss I/O buffers- used for reads and writes. + Are quickly transferred + Require lots more transfers for a fixed amount of data. + High overhead on disk - wasted bytes for every disk block. (Inter record gaps, header bytes, ERC bytes). + More entries in file descriptor to point to blocks (Inode). + Less internal fragmentation. + If random allocation, more seeks. + Optimal block sizes tend to range from 2K to 8K bytes. Optimum increasing with improvements in technology. + Berkeley Unix uses 4K blocks. (now 8K?) Basic (hardware) block size in VAX is 512 bytes. + Berkeley Unix also uses fragments that are 1/4 the size of the logical block size. + Disk Arm Scheduling: in timesharing systems, it may sometimes be the case that there are several disk I/O's requested at the same time. - .8 - + First come first served (FIFO, FCFS): may result in a lot of unnecessary disk arm motion under heavy loads. + diagram + Shortest seek time first (SSTF): handle nearest request first. This can reduce arm movement and result in greater overall disk efficiency, but some requests may have to wait a long time. + Problem is starvation. Imagine that disk is heavily loaded, with 3 open files. Two of the files located near center of disk. Other file near edge. Disk can be fully busy servicing first two files, and ignoring last file. + Scan: -like an elevator. Move arm in one direction, ser- vicing requests, until there are no additional requests in that direction. Then, reverse direction and continue. + This algorithm doesn't get hung up in any one place for very long. It works well under heavy load. But - it may not get the shortest seek. + Also, tends to neglect files at periphery of disk. + CScan - (circular scan) like a one-way elevator. Moves only in one direction. When it finds no further requests in the scan direction, it returns immediately to the furthest request in the other direction, and it resumes the scan. + This treats all files (and tracks) equally, but some- what higher mean access time than Scan. + SSTF has best mean access time. Scan or CScan can be used if - .9 - there is a danger of starvation. + Most of the time there aren't very many disk requests in the queue, so this isn't a terribly important decision. + Also, if contiguous allocation is used (as with OS/360), then seeks are seldom required. + Rotational Scheduling + It is rare to have more than one request outstanding for a given cylinder. (This was more relevant when drums were used.) + SRLTF (shortest rotational latency first) works well. + But rotational scheduling can be useful for writing data, if we don't have to write back to same location. (log structured file system.) + Rotational scheduling is hard using logical block address (LBA) - since you don't know the rotational position or the number of blocks per track. + Rotational and seek scheduling can be usefully combined (into shortest time to next block) if done in the onboard disk controller, which should know the angular and radial position. + Skip-Sector or Interleaved disk allocation. + Imagine that you are reading the blocks of a file sequen- tially and quickly, and file is allocated sequentially. + Usually, will find that you try to read a block just after the start of the block has been passed. - .10 - + Solution is to allocate file blocks to alternate disk blocks or sectors. Then haven't passed block when we want to read it. + Note that if all bits read are immediately placed into a semiconductor buffer, this is unnecessary. + Track offset for head and cylinder switching + It takes time to switch between heads on different tracks or cylinders. Thus we may want to skip several blocks when moving sequentially between tracks, to allow the head to be selected. + File Placement + Seek distances will be minimized if commonly used files are located near center of disk. + Even better results if reference patterns are analyzed and files that are frequently referenced to- gether are placed near each other. + Freqency of seeks, and queueing for disks will be reduced if commonly used files (or files used at the same time) are located on different disks. + E.g. spread the paging data sets and operating sys- tems data sets over several disks. + Disk Caching + Keep a cache of recently used disk blocks in main memory. + Recently read blocks are retained in cache until re- - .11 - placed. + Writes go to disk cache, and are later written back. + Typically would include index blocks for an open file. + Also use the cache for read ahead and write behind. + Can load entire disk tracks into cache at once. + Typically works quite well - hit ratios of 70-90%. + Can also do caching in disk controller - most controllers these days have 64K-4MB of cache/buffer in the controll- er. Mostly useful as buffer, not cache, since the main memory cache is so much larger. + Prefetching and Data Reorganization + Since disk blocks are often read (and written) sequen- tially, it can be very helpful to prefetch ahead of the current read point. + It is also therefore useful to make sure that the physi- cal layout of the data reflects the logical organization of the data - i.e. logically sequential blocks are also physically sequential. Thus it is useful to periodically reorganize the data on the disk. + Data Replication + Frequential used data can be replicated at multiple loca- tions on the disk. + This means that on writes, extra copies must either be updated or invalidated. - .12 - + ALIS - automatical locality improving storage + Best results obtained when techniques are combined: reorganize to make sequential, cluster, and replicate. + RAID + Observations: + Small disks cheaper than large ones (due to economies of scale) + Failure rate is constant, independent of disk size + Therefore, if we replace a few large disks with lots of small disks, failure rate increases + Solution: + Interleave the blocks of the file across a set of smaller disks, and add a parity disk. + Note that since we presume (a) only one disk failure, and (b) we know which disk failed, we can reconstruct the failed disk. + Can do parity in two directions for extra reliabili- ty. + Advantage: + Improves read bandwidth. + Problem: + This means that we have to write the parity disk on every write. It becomes a bottleneck. + A solution - interleave on a different basis than the number of disks. That means that the parity disk varies, and the bottleneck is spread around. - .13 - + Types of RAID: + RAID 0 - ordinary disks + RAID 1 - replication + RAID 4 - parity disk in fixed location + RAID 5 - parity disk in varying location - .14 - Topic: Directories and Other File System Topics + Naming: + How do users refer to their files? + How does OS refer to the file itself? + How does OS find file, given name? + File Descriptor is a data strucure or record that describes the file. + The file descriptor information has to be stored on disk, so it will stay around even when the OS doesn't. (Note that we are assuming that disk contents are permanent.) + In Unix, all the descriptors are stored in a fixed size array on disk. The descriptors also contain protection and accounting information. + A special area of disk is used for this (disk contains two parts: the fixed-size descriptor array, and the remainder, which is allocated for data and indirect blocks). + The size of the descriptor array is determined when the disk is initialized, and can't be changed. In Unix, the descriptor is called an inode, (index node and its index in the array is called its i-number). Internally, the OS uses the i-number to refer to the file. + IBM calls the equivalent structure the volume table of contents (VTOC). - .15 - + The Inode is the focus of all file activity in UNIX. There is a unique inode allocated for each file, includ- ing directories. An inode is 'named' by its dev/inumber pair. (iget/iget.c) + Inode fields: + reference count (number of times open) + number of links to file + owner's user id, owner's group id + number of bytes in file + time last accessed, time last modified, last time inode changed + disk block addresses, indirect blocks (discussed pre- viously) + flags: (inode is locked, file has been modified, some process waiting on lock) + file mode: (type of file: character special, directo- ry, block special, regular, symbolic link, socket), + A socket is an endpoint of a communication, re- ferred to by a descriptor, just like a file or a pipe. Two processes can each create a socket and then connect those two endpoints to produce a re- liable byte stream. (a pipe requires a common parent process. a socket does not, and the processes may be on different machines) + (items below not in text on 4.4BSD) + protection info: (set user id on execution, set group id on execution, read, write, execute permissions, - .16 - sticky bit? (check)) + count of shared locks on inode + count of exclusive locks on inode + unique identifier + file sys associated with this inode + quota structure controlling this file + When a file is open, its descriptor is kept in main memory. When the file is closed, the descriptor is stored back to disk. + There is usually a per process table of open files. + In Unix, there is a process open file table, with one entry for each file open. The integer entry into that table is the handle for that file open. Multiple opens for the file will get multiple en- tries. (note that if a process forks, a given entry can be shared by several processes.) + (standard-in is #0 and standard-out is #1, stderr is #2, must be per process.) + Unix also has a system open file table, which points to the inode for the file (in the inode table). This table is system wide. Maps names to files. + There is also the inode table, which is a system-wide table holding active and recently used inodes. + Descriptor is kept in OS space which is paged. So may be necessary to have page fault to get to - .17 - descriptor info. + Users need a way of referencing files that they leave around on disk. One approach is just to have users remember descriptor indexes. I.e. the user would have to remember something like the number of the descriptor, or some such. Unfortunately, not very user friendly. + Of course, users want to use text names to refer to files. Special disk structures called directories are used to tell what descriptor indices correspond to what names. + Approach #1: have a single directory for the whole disk. Use a special area of disk to hold the directory. + Directory contains pairs. + Problems: + If one user uses a name, no-one else can. + If you can't remember the name of a file, you may have to look through a very long list. + Security problem - people can see your file names (which can be dangerous.) + Old personal computers (pre-Windows) work this way. + Approach #2: have a separate directory for each user (TOPS-10 approach). This is still clumsy: names from a user's dif- ferent projects get confused. Still can't remember names of files. - .18 - + IBM's VM is similar to this. Files have 3 part name: , where location is A, B, C, etc. (i.e. which disk). Very painful. (Also, file names lim- ited to 8 characters.) + #3 - Unix approach: generalize the directory structure to a tree. + Directories are stored on disk just like regular files (i.e. file descriptor with 13 pointers, etc.). + User programs can manipulate directories almost like any other file. Only special system programs may write directories. + Each directory contains pairs. The file pointed to by the index may be another directory. Hence, get hierarchical tree structure. Names have slashes separating the levels of the tree. + There is one special directory, called the root. This directory has no name, and is the file pointed to by descriptor 2 (descriptors 0 and 1 have other special pur- poses). + Note that we need ROOT. Otherwise, we would have no way to reach any files. From root, we can get any- where in the file system. + Full file name is the path name, i.e. full name from root. + A directory consists of some number of blocks of - .19 - DIRBLKSIZ bytes, where DIRBLKSIZ is chosen such that it can be transferred to disk in a single atomic operation (e.g. 512 bytes on most machines). + Each directory block contains some number of directo- ry entry structures, which are of variable length. Each directory entry has info at the front of it, containing its inode number, the length of the entry, and the length of the name contained in the entry. These are followed by the name padded to a 4 byte boundary with null bytes. All names are guaranteed null terminated. + Note that in Unix, a file name is not the name of a file. It is only a name by which the kernel can search for the file. The inode is really the "name" of the file. + Each pointer from a directory to a file is called a hard link. + In some systems, there is a distinction between a "branch" and a "link", where the link is a secon- dary access path, and the branch is the primary one (goes with ownership). + You "erase" a file by removing a link to it. In reality, a count is kept of the number of links to a file. It is only really erased when the last link is removed. + To really erase a file, we put the blocks of the file on the free list. - .20 - + Symbolic Links + There are two ways to "link" to another directory or file. One is a direct pointer. In Unix, such links are limited to not cross "file systems" - i.e. not to another disk. + We can use symbolic links, by which instead of pointing to the file or directory, we have a symbolic name for that file or directory. + We need to be careful not to create cycles in the direc- tory system - otherwise recursive operations on the file system will loop. (E.g. cp -r). In Unix, this is solved by not permitting hard links to existing direc- tories (except by the superuser). + Pros and Cons of tree structured directory scheme + Can organize files in logical manner. Easy to find the file you're looking for, even if you don't exactly remember its name. + "Name" of the file is in fact a concatenation of the path from the root. Thus name is actually quite long- pro- vides semantic info. + Can have duplicate names, if path to the file is dif- ferent. + Can (assuming protection scheme permits) give away access to a subdirectory and the files under it, without giving access to all files. (Note: Unix does not permit multi- - .21 - ple hard links to a directory, unless done by superuser.) + Access to a file requires only reading the relevant directories, not the entire list of files. (My list of files prints out to a 1/2" printout- 20000 files) + Structure is more complex to move around and maintain + A file access may require that many directories be read, not just one. + It is very nice that directories and file descriptors are separate, and that directories are implemented just like files. This simplifies the implementation and management of the structure (can write ``normal'' programs to mani- pulate them as files). + I.e. the file descriptors are things the user shouldn't have to touch. Directories can be treated as normal files. + Working directory: it is cumbersome constantly to have to specify the full path name for all files. + In Unix, there is one directory per process, called the working directory, which the system remembers. + This is not the same as the home directory, which is where you are at log-in time, and which is in effect the root of your personal file system. + Every user has a search path, which is a list of directories in which to look to resolve a file name. The first element is almost always the working directory. - .22 - + ``/'' is an escape to allow full path names. I.e. most names are relative file names. Ones with "/" are full (complete) path names. + Note that in Unix, the search path is maintained by the shell. If any other program wants to do the same, it has to rebuild the facilities from scratch. Should be in the OS. ("set path" in .cshrc or .login.) + My path is: (. ~/bin /usr/new /usr/ucb /bin /usr/bin /usr/local /usr/hosts ~/com) + Basically, want to look in working directory, then system library directories. + We probably don't want a search strategy that actual- ly searches more widely. If it did, it might find a file that wasn't really the target. + This is yet another example of locality. + Simple means to change working directory - "cd". Can also refer to directories of other users by prefacing their logins by "~". + Operations on Files + Open - put a file descriptor into your table of open files. Those are the files that you can use. May re- quire that locks be set, and a user count be incremented. (If any locking is involved, may have to check for deadlock.) + Close - inverse of open. + Create a file - sometimes done automatically by open. - .23 - + Remove (rm) or erase - drop the link to the file. Put the blocks back on the free list if this is the last link. + Read - read a record from the file. (This usually means that there is an "access method" - i.e. I/O code - which deals with the user in terms of records, and the device in terms of physical blocks). + Write - like read, but may also require disk space allo- cation. + Rename ("mv" or "move") - rename the file. Unix combines two different operations here. Rename would strictly in- volve changing the file name within the same directory. "move" moves the file from one directory to another. Unix does both with one command. + Note that mv also destroys old file if there is one with new name. (This is BUG). + (which could be considered a bug, not a feature) + Seek - move to a given location in the file. + Synch - write blocks of file from disk cache back to disk + Change properties (e.g. protection info, owner) + Link - add a link to a file + Lock & Unlock - lock/unlock the file. + Partial Erase (truncate) + Note that commands such as "copy", "cat", etc., are built out of the simpler commands listed above. + Pseudo Files - .24 - + We have commands such as "read", and "write" for files. We want to do similar things to devices (e.g. terminal, printer, etc.). There is no reason not to treat I/O dev- ices as files, and we can do so. Called "pseudo files". + File Backup and Recovery + The problem - want to avoid losing files due to : + 1. System Crashes (hardware or software) + a. Physical Hard Failure - usually Head Crash + b. Software Failure + c. General system failure while the file is open. (this is the most common problem and the one we are usually concerned with.) (Usually Power failure) + 2. User Errors + Want to be able to get files back after we have destroyed them (overwritten or erased). (Unix doesn't provide this.) + 3. Sabotage and malicious users. + Approaches: + Periodic full dump - periodically dump all of the files (and directories) to backup storage, such as tape. System can be reloaded from dump tape. Some- times called checkpoint dump. + Note: system has to be shut down during dumping. Slow. Recovery is only back to last dump - not - .25 - up to date. Large amount of data - slow to dump, and large number of tapes. + Incremental (periodic) dump - dump all modified files periodically - e.g. when the user logs out, or after the file is closed. Thus we can lose a file only when it is open. + Disadvantages: large quantities of data, long and involved recovery problem. + One recovery problem is that when a crash is software or hardware, some tables may be left in inconsistent condi- tion (e.g. free list may be wrong, etc.). It is also necessary to fix all the tables. Very system dependent. + There are several approaches to the problem of a crash while modifying a file. + Work on a copy of the file and swap it for the original when you close. + This is usually what an editor (e.g. vi) does. + What if open by more than one person at same time? + How do we make "swap" atomic? (see below) + Write a log of all changes to the file, so we can backup, if necessary (audit trail). + Write a list of changes to the file prior to modifying the file, so we can restart the list at the point at which a crash occurred (intentions - .26 - list, or log-tape write ahead protocol.). + Keep multiple copies of the file. Update one, and then copy the update to the 2'nd (careful re- placement). + Make a new copy of any part of the file as it is modified. Replace old parts with new parts when we close the file (differential file). + I.e. Duplicate the file descriptor. Update the new copy of the file descriptor as new copies of blocks are made. Swap the new file descriptor for the old when the file is closed. - .27 - Topic: Networks and Communication Protocols + Two trends: + Lots of small machines. + Lots of computers everywhere, and a need to communicate. + Problem: communication and cooperation are difficult. + How do people on the same project share files? + How does new software get distributed to all users? + How is electronic mail handled? + Solution: tie machines together with networks, develop mes- sage protocols that allow communication and cooperation again. + Goal: ideally, we would like all computers to look like one very large, unified system. We could share files, communi- cate, etc. as if it were a timesharing system. + But we will always be able to tell the difference, due to performance. + Wide Area Networks - networks that connect sites that are geographically apart. + Local area networks - LANs: Developed mid-70's to hook to- gether personal computers. Most popular interconnection for LANs is Ethernet. LANs are used very differently than wide- area networks. - .28 - + Examples of networks: + ARPAnet: (Defense Advanced Research Projects Agency) - 1st and most famous network, developed early 70's but still in use. Connected together large timesharing sys- tems all over the country using leased phone lines. Pro- vided mail, file transfer, remote login. + ARPANET used IMPs (Interface Message Processors) as routers and TIPs (Terminal Interface Processors) to connect from a terminal. + Usenet: Developed late 70's, early 80's. Unix systems phone each other up to send mail and transfer files. + CSnet: developed to be less expensive clone of Arpanet, and tie together CS departments. + BITNET: ties together mostly sites using IBM equipment, including a lot of physics laboratories. + VNET - IBM's internal corporate network. Has highly secure gateways to connect to CSnet. + DECNET - DEC's network system. Name of product, and also refers to corporate communication system. + Misc. commercial nets, such as America Online, Prodigy, Compuserve. + Internetworks: mechanisms for tying together many exist- ing networks, such as ARPAnet, Usenet, and LANs. + The Internetwork is the combination of most of the above. They are now widely interconnected. - .29 - + Network Hardware + LAN + Usually ethernet, which uses either shared cable, or wires from each machine to a hub or switch. + WAN + Point-to-point links (used by most early networks). Examples are leased phone lines (50k bits, DARPAnet), RS232 connections, T1 and T3 lines, regular phone calls, satellite links, etc. + Network Topologies: + Fully connected: every site can talk directly to any other site (e.g. Usenet). + Partially connected: star and ring are most popular. Intermediate nodes must forward messages. + Multi-access bus / broadcast - (used by most LANs to- day). A single cable or group of cables connects many machines together. Best example is Ethernet (one wire). Alternative is radio broadcast. + Network Performance Parameters + Networks are usually characterized in terms of two perfor- mance parameters: + Latency: the minimum time to get the minimum amount of information between two sites. + Note the difference between transmission latency, which is time for a given bit to get from one end to - .30 - the other after the connection is set up and + set-up latency - which is time to get the first bit there. + Bandwidth: once information is flowing, how many bits per second can be transmitted (i.e. the marginal cost per bit). + Note also cost. + Protocols: + These are the key to networks. A protocol is just an agreement between the parties on the network about how information will be transmitted between them and what the information format is. There are many different proto- cols to do different things (e.g. mail, file transfer, remote login). Typically, protocols are built up in layers. Section 15.6 of the book lists the 7 ISO proto- col layers. + Hierarchical protocols - relate layers in a given system. Can be network system, operating system, etc. + Peer to Peer Protocols - relate the same layer of different systems. Rely on lower layers to actually communicate across machines. + ISO Protocols + 1. Lowest protocol layer: physical layer. Determines the electrical mechanisms for transmitting bits: vol- - .31 - tages, delays, currents, etc. + 2. Data Link protocol layer: how to get packets between two directly-connected components. Includes error detec- tion and recovery from physical layer. + 3. Network layer - Responsible for providing connections (to nodes that are not directly connected) and routing packets. Takes care of addresses. (Takes care of route changes due to changing loads.) + 4. Transport layer - low level access to network. Breaks messages into packets, keeps packets in order, flow con- trol, physical address generation. (Takes care of re- transmission of lost or destroyed packets.) + 5. Session layer - process to process protocols. + 6. Presentation Layer - resolves differences between sites in formats (e.g. character types, number represen- tation, full/half duplex, etc.) + 7. Application layer - interacts with users. Supports electronic mail, distributed data bases, etc. + Wide area networks are usually built up of Local Area Net- works (LANs), which are interconnected. Local area networks are usually some type of broadcast network. + Broadcast networks: single shared communication medium, no central controller to allocate access to it. + Simplest scheme is the Aloha mechanism: just broadcast blindly, use recovery protocols if packet doesn't get - .32 - through. This system has stability problems: can't get more than 18% utilization of channel (1/2e), and system completely falls apart under heavy loads. + Aloha system uses satellite - no choice. Can't listen - delays too long (1/4 second). + Can be improved with "slotted aloha" - messages occu- py slots. (1/e) - doubles bandwidth. + Ethernet (using a physical coax cable) adds two things. + First is carrier sense: listen before broadcasting, defer until channel is clear, then broadcast. + Also listen while broadcasting. Collision can still happen if two stations start up at exactly the same time. If collision detected, jam network so that everyone will know about collision (don't waste time transmitting junk). Then wait a ``random'' interval, retry. If repeated collisions, wait longer and longer intervals. + This is called CSMA/CD (carrier sense multiple ac- cess, with collision detection). + Ethernet Frame: destination address (6 bytes), source address (6 bytes), type (2 bytes), data (46-1500 bytes), frame check sequence (4 bytes). + Problem with Basic (original) Ethernet: + Reliability: if any station jams network, nobody can do anything, can't even figure out who's do- ing it. - .33 - + Fairness: there's no guarantee against starva- tion. People with real-time needs don't like this. + Bandwidth limited to cable (10 mbits) + Original ethernet limited to site which can be connected by cable (4000 feet). + Longer connections with switch + Security - relatively easy to listen to all traffic, and/or tap cable. + More Recent Ethernet Designs: + Use Switch to route, rather than shared cable + Rates of 10Mbit, 100Mbit, 1Gbit/sec. 10G under development. + Wireless ethernet (801.11a,b,g) + Continues to use most of ethernet protocol- frame format, timeouts, collision detect, software stuff. + Ring networks: - these are a type of broadcast network. An additional protocol built on top of a ring-structured set of point-to-point links. + Normally, an electronic token (special packet) circulates at high speed around the ring. If a station doesn't have anything to broadcast, it just retransmits everything it receives. + When ready to broadcast, a station waits until the token passes by. Instead of retransmitting the token, send - .34 - packet instead. + When packet has been transmitted, put token back on ring for next station to use. + Packet loops all the way around, gets swallowed by sender when it comes back again (recognize self as destination). + Problems with ring system: + If any station dies, token can't circulate so ring dies. + If token is missing, system dies. + Starvation is possible. + If a second token is created, system can get messed up. + We can use Ethernet or Ring Network for local net. Need some way of constructing (structuring) a wide area net. I.e. build an "internetwork". + Three methods for link between two machines: + Circuit switching - like telephone system - you have circuit between the source and destination machines. + Packet switching - communications are broken into packets and sent piece by piece. + Note that packet switching can be used to build a virtual circuit - looks like circuit, but actually packets on a shared medium. + Message switching - a virtual circuit exists for long enough to complete a message, and then the circuit is - .35 - dropped. Or can use physical link. + Getting stuff where you want it. + Packets must be forwarded from machine to machine until they reach the destination. Machines that forward between networks are called gateways. Problem is how to get stuff where you want it. + Names vs. addresses vs. routes: + Name: a symbolic term for something: ``Robert'', or ``ucbcory''. Good for people to remember. + Address: where the thing is: in an internetwork situa- tion, usually consists of the number of the network, the number of the site on the network, the id of the host at the site, and sometimes a more specific host (e.g. a workstation). E.g. jones@chaos.netnode.berkeley.edu + Route: directions for how to get there from here (a se- quence of hosts and links to pass through to reach the destination). + Sometimes the sender has to provide the route, e.g. in UUCP: hplabs!hp-pcd!hpcvc0!cliff. All each machine has to be able to do is remember its neigh- bors and forward messages. This is clumsy for users. + It's better if the hosts of the internetwork can figure out the routing stuff for themselves. This involves a special protocol between the hosts to build routing tables. E.g. in the Internet, hosts send messages to - .36 - nearest neighbors, build up tables of most direct paths from each host to each other host (fewest hops). + Difficulty with routing tables is that they get to be very large. + Note that routes can change dynamically. There can be more than one way to get from A to B. Note the problem if instability in routing if changed for per- formance reasons. + In LANs, only gateways have to worry about routing: all the other hosts just ship packets to gateway un- less for host on local net. + Communications Problems: + Packets can get lost: + Transmission errors. + Address is corrupted, and packet circulates for- ever. + Contents of packet are corrupted. + A host has all its packet buffers full so it has no place to put another incoming packet. + Can happen at intermediate host if packets are arriving on a fast network but have to be for- warded onto a much slower network. This is called network congestion. + Can happen at destination if user process can't work fast enough to process all the packets as they arrive. - .37 - + Receiver is down, and sender sends anyway. + Packets can arrive out of order: if some hosts suddenly go down, or if routing tables change, packets might wander off into the network and come back much later. Most protocols include a time-to-live mechanism: after a certain time, packets are killed so that they don't wander endlessly. + Datagram protocols: used to deliver individual packets; the packets are not guaranteed to get through or to arrive in any particular order. This is useful for some applications, but not very many. + Most applications would like guarantees about delivery and order. + This is called a connection, and the protocols to imple- ment it are called virtual circuit or transport proto- cols. + To do this, the sender and receiver must remember state about what has been happening. + Simple acknowledgement-based protocol: + Store a serial number in each packet. Sender as- signs serial numbers, increments for each packet. + Sender sends one or more packets. + Receiver sends an ack acknowledgement packet for each packet or group of packets. + Sender waits for acknowledgement before sending - .38 - next (group of) packets (must also save old pack- ets!). + If sender doesn't receive acknowledgement within a reasonable time, it assumes that the packet got lost and retransmits it. + Retransmission could result in receiver getting two packets with same serial number: it checks serial numbers and throws away duplicates and out-of-order packets. + Sender and receiver must negotiate about how far ahead the sender can send: otherwise the receiver might run out of buffer space and have to discard packets. This is called the flow control problem. + No matter what the virtual circuit mechanism, setting up the connection is complex and time-consuming. It's tricky to get two hosts to agree to communicate with each other and get their state initialized correctly. + TCP/IP + TCP/IP is collection of network protocols making up the Internet Protocol Suite. + History: + 1969- Arpanet with 4 nodes. (SDC, UCSB, UCLA, SRI) + 1972- Arpanet Demo (50 hosts) + mid-1970s- TCP developed, running on Unix (DEC PDP- 11) - .39 - + early 80s- Berkeley Unix. Runs TCP. + 1983- Arpanet converts to TCP/IP. In use by Sun. + ISO Levels: + Level 3 - Network Layer: + IP - Internet Protocol - provides host-to-host datagram delivery. + Provides packet routing, will insulate higher levels from network specific characteristics (e.g. packet size). + Fields of IP packet header include: version, header length, total length, ID (same for all fragments of datagram), time-to-live, check- sum, source address, destination address. + IP address 32 bits. (newer version longer). Broken into four 8-bit segments. Addresses allocated in blocks. + ICMP - Internet Control Message Protocol - used by gateways and hosts to approse other hosts of conditions related to their IP services. (e.g. routing, congestion) + ARP - address resolution protocol - Maps an IP address to an associated ethernet address. (32 bits -> 48 bits). + RARP - Reverse ARP- Maps an ethernet address to an associated IP address. + Level 4- Transport Layer: - .40 - + TCP - Transmission Control Protocol: connection oriented, reliable, byte-stream protocol. + TCP packet header includes: source port (identifies process or service in sender), destination port, sequence number (32 bits), acknowledgement number, control flags (SYN (connection request), ACK, RST (reset), FIN (end)), window (window size- number of pack- ets that will be accepted), checksum. + Provides means to connect with a socket [IP address, port number]. + Takes care of timeouts, retransmissions, flow control. + Some well known ports: 20, 21 (FTP), 23 (Telnet), 25 (SMTP) + UDP - User Datagram Protocol - unacknowledged transaction-oriented protocol parallel to TCP. + Levels 5-7: Session, Presentation and Application Layers: + SMTP- Simple Mail Transfer Protocol + DNS - Domain Name Service- maps names to ad- dresses + Top level is Network Information Center (NIC) computers. + FTP - File Transfer Protocol + Telnet - provides virtual terminal services. - .41 - + The mechanisms described above form the basis for tying to- gether distributed systems. So far, though, they've only been used for loose coupling: + Each machine is completely autonomous: separate account- ing, separate file system, separate password file, etc. + Can send mail between machines. + Can transfer files between machines (but only with spe- cial commands). + Can execute commands remotely. + Can login remotely. + Loose coupling like this is OK for a network with only a few machines spread all over the country, but not for a high- performance LAN where every user has a private machine. + What would we like in a distributed computer system? + Unified, transparent file system. + Unified, transparent computation - from any terminal, you can run on any machine, transparently. (You actually shouldn't care which machine you're running on.) + Load Balancing, process migration, file migration. + Local area networks can more or less provide that now. + Wide area networks cannot provide this transparenly, due to performance problems. May be possible in future? + Distributed File Systems + Remote files appear to be local (except for performance). - .42 - + Issues: + Failures - what happens when remote system crashes. + Performance - remote is not same as local. Can do some caching. + Sun's NFS (Network File System) + NFS permits the mounting of a remote file system as if it were local. Therefore, by using mount commands, can set up transparent distributed file system. + Caches file blocks, descriptors at both clients and servers. + Write-through caching. When file is closed, all modified blocks are sent immediately to server disk. "Close" doesn't return until all bytes stored on disk. + Consistency is weak. Polls periodically for changes to file; may use old version until pol- ling. May have simultaneous conflicting updates. + Server keeps no state about client (except hints, for performance). Each read gives enough info to do en- tire operation. (I.e. Read(I#, position). + When server crashes and restarts, can start again processing requests immediately. + All requests are "idempotent" - i.e. can be repeated with no ill effects. So if message may be lost, can resend (and possibly redo). - .43 -