CS 162 Lecture Notes Prof. Alan Jay Smith Topic: Sharing Main Memory -- Segmentation and Paging + How do we allocate memory to processes? + 1. Simple uniprogramming with a single segment per process: + One program in memory at a time. (Can actually multipro- gram by swapping programs.) + Highest memory holds OS. + Process is allocated memory starting at 0 (or J), up to (or from) the OS area. + Process always loaded at 0. + Examples: early batch monitors where only one job ran at a time and all it could do was wreck the OS, which would be rebooted by an operator. Many of today's personal computers also operate in a similar fashion. + Advantages + Low overhead + Simple + No need to do relocation. Always loaded at zero. + Disadvantages + No protection - process can overwrite OS + which means it can get complete control of system + Multiprogramming requires swapping entire process in and out + Overhead for swapping - .1 - + Idle time while swapping + Process limited to size of memory + CTSS ("compatible" time sharing system), and how system swapped users completely. + No good way share - only one process at a time (can't even overlap CPU and I/O, since only one process in memory.) + 2. Relocation - load program anywhere in memory. + Idea is to use loader or linker to load program at an ar- bitrary memory address. + Note that program can't be moved (relocated) once loaded. (WHY??) + This scheme (#2) is essentially the same as #1, but the ability to load at any address will be used in #3 below. + 3. Simple multiprogramming with static software relocation, no protection, one segment per process: + Highest or lowest memory holds OS. + Processes allocated memory starting at 0 (or N), up to the OS area. + When a process is initially loaded, link it so that it can run in its allocated memory area + Can have several programs in memory at once, each loaded at a different (non overlapping) address. + Advantages: + Allows multiprogramming without swapping processes in - .2 - and out. + Makes better use of memory + Higher CPU use due to more efficient multiprogram- ming. + Disadvantages + No protection - jobs can read or write others. + External fragmentation + Overhead for variable size memory allocation. + Still limited to size of physical memory. + Hard to increase amount of memory allocation. + Programs are staticly loaded - are tied to fixed lo- cations in memory. Can't be moved or expanded. If swapped out, must be swapped to same location. + 4. Dynamic memory relocation: instead of changing the ad- dresses of a program before it's loaded, change the address dynamically during every reference. + Figure of a processor and a memory box, with a memory re- location box in between. + There are many types of relocation - to be discussed. + Under dynamic relocation, each program-generated address (called a logical or virtual address) is translated in hardware to a physical, or real address. This happens as part of each memory reference. + Virtual (logical) Address is what the program gen- erates. + Virtual address space is set of (legal) virtual - .3 - addresses the program can generate. + Physical (real) addresses - set of addresses in phy- sical memory. + Physical address space of program - set of physi- cal addresses it can get to. + Physical address space of machine - set of ad- dresses in physical memory. + Dynamic relocation leads to two views of memory, called address spaces. We have the virtual address space and the real address space. Each process has its own virtual address space. With static relocation we force the views to coincide. In some systems, there are several levels of mapping. + Several types of dynamic relocation. + Base & bounds relocation: + Two hardware registers: base register for process, bounds register that indicates the last valid address the process may generate. + Real address = base + virtual address + IF virtual_address < bounds, and VA>= 0 + In parallel, the real address is generated by adding it to the base register. + This is a form of translation. + Discuss why comparison is done in parallel. + On each memory reference, the virtual address is compared - .4 - + Advantages: + Each process appears to have a completely private memory of size equal to the bounds register plus 1. + Processes are protected from each other. + No address relocation is necessary when a process is loaded. + Task switching is very cheap, when done between processes in memory- just reload processor registers. + Higher overhead to load process from disk. + Compaction is possible. + Disadvantages: + Still limited to size of main memory. + External fragmentation (between processes) + Overhead for allocating variable size spaces in memory. + Sharing difficult - only possible if bases & bounds overlap. + Only one "segment" - i.e. one region of memory. + New, special hardware needed for relocation. + Time to do relocation (it isn't free). + OS must be able to change value of relocation registers (why?). + OS loads new process and sets base and bounds regis- ters. + OS schedules process, and sets base and bounds regis- ter. - .5 - + When tasks are switched, must be able to swap base, bounds and PC registers simultaneously. + These imply that OS must run with base and bounds re- location turned off - otherwise, would affect itself when running. (Or would need its own set of base and bounds registers.) + Use of base and bounds controlled by status bit, usu- ally in PSW or SSW, or similar control register. + Users must not be able to change values of base and bounds registers + Otherwise, no protection between users. Can trash others or OS. + Problem: how does OS regain control once it has given it up? + OS is entered on trap (including SVC) or interrupt. + When OS is entered, use of base and bounds must be disabled. (I.e. bit in PSW is reset.) + Typically, trap handler loads new control register values. + Base & bounds is cheap -- only 2 registers -- and fast -- the add and compare can be done in parallel. + Examples: CRAY-1. IBM 7040/7090. + Can consider three types of systems using base and bounds re- gisters: + Uniprogramming - single user region. Bring a user in, and run him. - .6 - + Multiprogramming with Fixed Partitions - (OS/MFT) - par- tition memory into fixed regions (may be different sizes). User goes into region of given size. + Not very flexible. + IBM OS circa 1965-68 + Multiprogramming with Variable Partitions (OS/MVT) - par- titions are dynamically variable. + IBM OS circa 1967-72. + Note that we can do any of the three above schemes without base and bounds registers - just load programs into region at appropriate base address. + Task Switching + We can now switch between processes very cheaply - don't have to reload memory, just change contents of process control block (which now has values of base and bounds registers). + We can also run processes which are not in memory - how? + Find empty area of memory in which to place process - how?. + Remove one or more processes from memory, if necessary, in order to find space. + (I.e. copy the removed processes to space on disk.) + Copy new process (from disk) into memory. + If only one process fits in memory, have to wait for swap - .7 - to take place. + If several processes fit in memory, can swap one while executing the other. + 5. Multiple segments - Segmentation. + Divide virtual address space into several "segments". + This is not the same as the "segments" of linkers and loaders. + Use a separate base and bound for each segment, and also add protection bits (read, write, execute), and valid bit. (Also will want dirty bit.) + Each address is now + Each memory reference indicates a segment and offset in one or more of three ways: + Top bits of address select segment, low bits the offset. This is the most common, and the best. + Or, segment is selected implicitly by the instruction (e.g. code vs. data, stack vs. data, which base re- gister is used, or 8086 prefixes). + Instruction specifies directly or indirectly a base register for the segment. + Subprograms (procedures, functions, etc.) can be separate segments. + Segments typically are associated with logical partitions of your process address space - e.g. code, data stack. Or, each module or procedure can be a separate segment. - .8 - + Need either segment table or segment registers to hold the base and bounds for each segment. + Draw picture of segment table, with segment table en- tries. + Memory mapping procedure consists of table lookup + add + compare. + Example: PDP-10 with high and low segments selected by high-order address bit. + Address translation for segmentation + Have segment table - maps segment number to [Segment base address, segment length (limit), protection bits, valid bit, reference bit, dirty bit] + This info is in Segment Table Entry (STE) + Diagram of segment table. + Segment descriptor + Need some hardware to automatically map virtual (segment number, word number) to real address. + Real address = segment_table(segment #) + word number. + Invalid if word_number > limit. (Note that we do test without adding bound to both) + Also valid bit must be on, and permission bits must permit access. + Need more hardware to make it go fast (discuss later) + Have Segment Table Base Register (STBR) point to base of segment table (for hardware to use) + Alternate approach - if there are a small number of segments, - .9 - can have segment registers - one register per segment. + Can also multiplex a small number of segment registers among a large number of segments (as with X86 architecture) + Advantages + Each process has own virtual address space + Protection between address spaces + Separate protection between segments (R/W/E) + Virtual space can be larger than physical memory + Unused segments don't need to be loaded Can load segments as needed. + Attempt to reference missing segment called segment fault. + Discuss segment faults later. + Can share one or more segments + sharing is tricky - we'll talk about this later + Segments can be placed anywhere in memory that they fit. + Memory compaction easy. + Segment sizes can be changed independently. + Disadvantages + Each segment must be allocated contiguously. + Segment size < memory size + External fragmentation + Overhead of allocating memory + Need hardware for address translation + Overhead (time/hardware) of doing address translation - .10 - + More complicated. + Space for segment table. + Note that segment tables are usually 1-1 with processes. A segment table defines a process's address space. + What would happen if all processes shared the segment table? + Protection is a problem + Have same problem as before - now we have to allocate shared virtual instead of shared physical memory. + When we switch processes, we reload the STBR (segment table base register), which changes address space. + Processes vs. Threads + A process is a single flow of control associated one to one with an address space. + A thread is a single flow of control. There may be several threads within an address space. + Threads are considered lightweight, because the over- head of creating a thread is usually much less than that to create a process. Cost to communicate between threads in same address space is very low. Cost to communicate between different address spaces is high (e.g. pipe, file). + Threads in one address space share code and data. Threads do not usually share stack - usually each has its own. Can synchronize using P and V without in- - .11 - volving the operating system. + Processes do not normally share, so P&V must use OS as intermediary. + To use threads, usually have constructs like fork, join, signal, wait, broadcast. + Managing segments: + Keep copy of segment table in process control block (or if block is too small, associated with it). + When creating process, define segments in segment table/PCB. + When process is assigned memory, figure out where each segment goes, and put base and bounds into segment table. + Need memory map, which maps memory to segments. (Segment table maps segments to memory.) Also called core map + When switching contexts, save segment table or pointer to it in old process's PCB, reload it from new process's PCB. + When process dies, return segments to free pool. + When there's no space to allocate a new segment: + Compact memory (move all segments, update bases) to get all free space together. + Or, swap one or more segments to disks to make space (must then check during context switching and bring segments back in before letting process run). + To enlarge segment: - .12 - + See if space above segment is free. If so, just up- date the bound and use that space. + Or, move the segment above this one to disk, in order to make the memory free. + Or, move this segment to disk and bring it back into a larger hole (or, maybe just copy it to a larger hole). + Or, move it down, if there is space below. + Can load segments only when needed. + Segment Fault - an attempt to reference a segment which is not present. + Trap to OS + Find space for segment - replace another one, if necessary + Load Segment (remove other segments to make space, if necessary) + Set valid bit==1, and update other entries in STE. + Make process ready. + Paging: goal is to make allocation and swapping easier, and to reduce memory fragmentation. + Make all chunks of virtual memory the same size, call them pages. Typical sizes range from 512-16K bytes. + Divide real memory into page frames, which are the same size as pages. - .13 - + I will frequently be sloppy and say "page" when I mean "page frame". + Virtual Address typically now consists of N bits, parti- tioned as K (page number) and N-K (byte within page). + For each process, a page table defines the base address of each of that process' pages. Each page table entry contains bits for the real address of the page, protec- tion, valid, reference, and dirty bits. + Diagram of page table - see figure + Page table base register points to base of page table. + Translation process: page number always comes directly from the (virtual) address. Since page size is a power of two, no comparison or addition is necessary. Just do table lookup and bit substitution. + Diagram of translation process + No limit field is needed or used. (just overflow to next page) + We will need a table (page map or core map) or memory map telling us who owns which page frame in memory. Points back to any page table that points to this page. + Not all of a process' memory has to be loaded into real memory. If one attempts to reference a location not in memory, it is prevented by a page fault - this condition is detected by the valid bit. + Same as before with segment fault. - .14 - + Page fault - trap condition. Detected by hardware when valid bit is off. + Trap to OS (trap, not interrupt) + OS finds page frame, (somehow - discussed later) + gets page, (reads from disk) + updates page table, + make process ready. + Pages and Paging are used to produce a physical partitioning of the process address space and memory. There usually isn't any relation between page boundaries and what is in a page. + Advantages + Easy to allocate: keep a free list of available page frames and grab the first one. + No external fragmentation. + When combined with segmentation (discussed later): Non- contiguous allocation of segments. + Permits process to have virtual space much larger than physical space. + Permits pages to be loaded as/when needed. + Disadvantages + Internal fragmentation: page size doesn't match up with information size. The larger the page, the worse this is. - .15 - + Hardware for address translation. + Time for address translation. + Page faults may cause considerable overhead. + What happens when we have a page fault (missing page)? - to be discussed later. + Need for page replacement algorithm. + We need algorithms to decide when to move pages into and out of memory. (discussed later). + Table space: if pages are small, the table space could be substantial. In fact, this is a problem even for nor- mal page sizes: consider a 32-bit addresss space with 1k pages. What if the whole table has to be present at once? + 1. Partial solution: keep base and bounds for page table, so only large processes have to have large tables. + 2. Usual solution: make page table two level. (see figure 6) Map high order bits through first table, and lower order page number bits through 2'nd table. + First level table can be called page directory, or segment table (confusing usage). Second level table usually called page table. + 3. Put user page tables in OS virtual memory - then unneeded pages are not allocated. + Note that this yields a 2 level page table - ad- dress is mapped through OS page table and then user page table. - .16 - + 4. Make page table a Hash Table (done by IBM and HP) + Called inverted page table + Efficiency of access: even small page tables are gen- erally too large to load into fast memory in the reloca- tion box. Instead, page tables are kept in main memory and the relocation box only has the page table's base ad- dress. It thus takes one overhead reference for every real memory reference. If page table is two level, re- quires two extra references. + Where are the page tables? + Page tables are either referred to with real addresses, or OS virtual addresses. + Cannot be put where users can get to them. + Otherwise, users could change values, which would bypass protection. + Page table entries are usually real addresses (including addresses of first level page table, and PTBR.) + Could have OS virtual addresses in entries, which means that another level of translation is needed. + Is the OS paged? + Yes - advantages for users also apply to OS. + Can page tables be paged out? + Sure - why not? + But if page tables are in OS's virtual memory, and page tables have OS address space virtual addresses - .17 - in them, then translation of user virtual address also requires OS virtual addr. translation. + This might require a recursive page fault. + Means that OS page tables must be in real memory and use real addresses. + Alternative is to put page tables in "real memory" and use real addresses. (I.e. have V=R). + What can't be paged out? + This is called ``wired down''. + The code that brings in pages. + Pages for critical parts of the operating system. (Han- dling a page fault takes time.) + Some interrupt and trap handlers, including code that starts up a process. + OS page tables + Sensitive real time routines + Pages currently undergoing I/O. (i.e. I/O buffers) + Note how effective paging is for protection - you can only reference parts of memory which appear in your page table(s). The only parts that appear are those that you have access to. + Paging and segmentation combined + Diagram of segment table/ page table mapping. In segment table entry, put protection bits (read, write, execute), - .18 - valid bit. + Each segment broken into one or more pages. + Segments correspond to logical units: code, data, stack. Segments vary in size and are often large. Protection can be associated with segments. + Pages are for the use of the OS; they are fixed-size to make it easy to manage memory. + Going from paging to P+S is like going from single seg- ment to multiple segments, except at a higher level. In- stead of having a single page table, have many page tables with a base and bound for each. Call the stuff associated with each page table a segment. + Advantages: + Provides 2 level mapping (as did page directory and page table). Makes page table size manageable. + Provides both physical unit of management (page) and log- ical unit of management (segment). + Effectively produces two dimensional addressing [segment, address within segment]. + Can grow and shrink segments individually, and without interfering with other segments. Just add pages (which can be anywhere in memory.) + Segmentation with no compaction or fragmentation problem. + Bounds checks on segments handled by having page not be valid. (quantized to page size). + No page table for segment which doesn't exist. - .19 - + Can share segment and/or page. + Protection at level of page and/or segment. + Disadvantages + More complicated than either segmentation or paging. + Overhead of 2 level mapping (time and hardware). + Overhead of both schemes. + Usual internal fragmentation problem, but if page size is small compared to most segments, then internal fragmenta- tion is not too bad. + Paging vs. Segmentation - + page is fixed size, + physical unit of information, + used only for memory management; + not visible to programmer. + Segment is logical unit (usually) + visible to user, + of arbitrary size. + Note that user may see (be aware of) segmentation. User should not be aware of paging. + Can share at two levels: single page, or single segment (whole page table). - Diagram of shared pages or shared seg- ments (shared page table). + Does shared region have to be at same address in each process? - .20 - + No - as long as it can be found. + Can shared region contain any absolute addresses (i.e. virtual addr)? + Usually not - very dangerous - addresses may not be the same in each process. + But can contain relative addresses - eg. offsets to certain registers or segment base. Such registers can be loaded by each process differently. + If entire segment is shared, and addresses are rela- tive to start of segment, we are okay. + Copy on write. + Share pages, but with 2 separate page tables. Both page tables point to same pages. + Pages are made read only. On attempt to write, a copy is made. + Problem: how does the operating system get information from user memory? E.g. I/O buffers, parameter blocks. Note that the user passes the OS a virtual address. + 1. Use real addresses - In some cases the OS just runs unmapped. Then all it has to do is read the tables and translate user addresses in software. + Note: addresses that are contiguous in the virtual address space may not be contiguous physically. Thus I/O operations may have to be split up into multiple - .21 - blocks. Draw an example. + 2. Can specify (somehow) that the data addresses are to use the User Page Tables. (would need special hardware) + Note that we therefore need two active PTBRs - user PTBR and System PTBR. + 3. Have OS page tables point to user pages. + 4. A few machines, most notably the VAX, make both system information and user information visible at once (but can't touch system stuff unless running with special ker- nel protection bit set). This makes life easy for the kernel, although it doesn't solve the I/O problem. + I.e. OS is in everyone's address space. + VAX Addressing + Another example: VAX. + Address is 32 bits, top two select segment. Four base- bound pairs define page tables (system, P0, P1, unused). + Pages are 512 bytes long. + Read-write protection information is contained in the page table entries, not in the segment table. + One segment contains operating system stuff, two contain stuff of current user process. + Potential problem: page tables can get big. Don't want to have to allocate them contiguously, especially for large user processes. Solution is to use the system page table to map the user page tables so the user page tables - .22 - can be scattered: + System base-bounds pairs are physical addresses, sys- tem tables must be contiguous. + User base-bounds pairs are virtual addresses in the system space. This allows the user page tables to be scattered in non-contiguous pages of physical memory. + The result is a two-level scheme. + This is alternative to normal two level scheme. If normal two level scheme were used, and if page tables were paged, would actually be four level scheme. + Inverted Page Table + Idea is that page table is organized as hash table. Hash from virtual address into table with number of entries larger than physical memory size. (Page table shared by all processes.) + Problem with segmentation and paging: extra memory refer- ences to access translation tables can slow programs down by a factor of two or three. There are obviously too many translations required to keep them all in special processor registers. + But for small machines (e.g. PDP-11), can have one regis- ter for every page in memory, since can only address 64Kbytes. - .23 - + Solution: Translation Lookaside Buffer (TLB), also called + Translation Buffer (TB) (DEC), or + Directory Lookaside Table (DLAT) (IBM), or + Address Translation Cache (ATC) (Motorola). + A TLB is used to store a few of the translation table en- tries. It's very fast, but only remembers a small number of entries. On each memory reference: (draw picture, ex- plain name) + First ask TLB if it knows about the page. If so, the reference proceeds fast. + If TLB has no info for page, translator must go through page and segment tables to get info. Refer- ence takes a long time, but give the info for this page to TLB so it will know it for next reference (TLB must forget one of its current entries in order to record new one). + TLB Organization: Picture of black box. Virtual page number goes in, physical page location comes out. Similar to a cache. + So what the TLB does is: + Accept virtual address + See if virtual address matches entry in TLB + If so, return real address + If not, ask translator to provide real address. + Translator loads new translation into TLB, replacing old - .24 - one. (Usually one not used recently.) + (Must replace entry in same set.) + Will the TLB work well if it holds only a few entries, and the program is very big? + Yes - due to Principle of Locality. (Peter Denning) + Principle of Locality + 1. Temporal Locality - Information that has been used re- cently is likely to be continued to be used. + Alternate formulation - information in use now con- sists mostly of the same information as was used re- cently. + 2. Spatial Locality - info near the current locus of reference is also likely to be used in the near future. + Example - top of desk is cache for file cabinet. If desk is messy, stuff on top is likely to be what you need. + Explanation- code is either sequential or loops. Data used together is often clustered together (array ele- ments, stack, etc.) + In practice, TLBs work quite well. Typically find 96% to 99.9% of the translations in the TLB. + TLB is just a memory with some comparators. Typical size of memory: 16-512 entries. Each entry holds a virtual page number and the corresponding physical page number. How can memory be organized to find an entry quickly? + One possibility: search whole table associatively on - .25 - every reference. Hard to do for more than 32 or 64 en- tries. + A better possibility: restrict the info for any given virtual page to fall in into a subset of entries in the TLB. Then only need to search that Set. Called set as- sociative. E.g. use the low-order bits of the virtual page number as the index to select the set. Real TLBs are either fully associative or set associative. If the size of the set is one, called direct mapped. + Diagram of set associative TLB. + Replacement must be in same set. + Translator is a piece of hardware that knows how to translate virtual to real addresses. It uses the PTBR to find the page table(s). Reads the page table to find the page. + TLBs are a lot like hash tables except simpler (must be to be implemented in hardware). Some hash functions are better than others. + Is it better to use low page number bits than high ones to select the set? + Low ones are best: if a large contiguous chunk of memory is being used, all pages will fall in dif- ferent sets. + Must be careful to flush TB during each context swap. Why? + Otherwise, when we switch processes, we'll still be using - .26 - the old translations from virtual to real, and will be addressing the wrong part of memory. + Alternative - can make process identifier (PID) part of virtual address. Have a Process Identifier Register (PIDR) which supplies that part of the address. + When we modify the page table, we must either flush TLB or flush the entry that was modified. - .27 - Topic: Demand Paging, Thrashing, Working Sets + So far we have disentangled the programmer's view of memory from the system's view using a mapping mechanism. Each sees a different organization. This makes it easier for the OS to shuffle users around and simplifies memory sharing between users. + However, until now a user process had to be completely loaded into memory before it could run. (sort of- we mentioned page faults and segment faults, but...) This is wasteful since a process may only need a small amount of its total memory at any one time (locality). Virtual memory permits a process to run with only some of its virtual address space loaded into physical memory. + Virtual address space, translated to either a) physical memory (small, fast) or b) disk (backing store), which is large but slow. + Backing storage is typically disk. + The idea is to produce the illusion that the entire virtual address space is in main memory, when in fact, it isn't. + More generally, we have a multi-level (2 level in this case) memory hierarchy. We want to have the cost of the slower and larger level, and the performance of the smaller and faster level. - .28 - + Diagram of a memory hierarchy, showing access times. + The reason that this works is that most programs spend most of their time in only a small piece of the code. + Principle of Locality - there are two parts. . + Temporal Locality - the same information is likely to be reused. + Spatial Locality - nearby information is also likely to be used in the near future. + (Idea invented (?) by Peter Denning.) + If not all of process is loaded when it is running, what hap- pens when it references a byte that is only in the backing store? Hardware and software cooperate to make things work anyway. + First, extend the page tables with an extra bit ``present, or valid''. If present isn't set then a reference to the page results in a trap. This trap is given a special name, page fault. + Page fault - an attempt to reference a page which is not in memory. + Diagram of Page Table Entry. (show real address, protec- tion bits, valid/present bit, dirty bit, reference bit). + Any page not in main memory right now has the ``present/valid'' bit cleared in its page table entry. - .29 - + When page fault occurs: + Trap to OS (why?) + Verify that reference is to valid page; if not, abend. + Find page frame to put page. + Find a page to replace, if no empty frame. + If dirty, find a place to put replaced page on secon- dary storage (Can reuse previous location.) + Remove page (either copy back or overwrite) + Update page table. + Update map of secondary storage if necessary (to show where we put page) + Update memory (core) map + Flush TLB entry for page that has been removed. + Operating system brings page into memory + Find page on secondary storage. + Transfer it. + Update page table (set valid bit, and real address) + Update map of file system/disk to show that page is now in memory. (e.g. update cache of inodes) + Update Core Map (memory map). + The process resumes execution. (i.e. it goes on ready list. maybe it resumes) + Note that all of these take time. We may switch to another process while the IO is taking place. + Multiprogramming is supposed to overlap the fetch of a page - .30 - (or I/O) for one process with the execution of another. + If no process is available to run (all doing I/O or page fault), called multiprogramming idle or page fetch idle. + Page out - to remove a page. + Page out a process - remove it from memory. + Page in a process - load its pages into memory. + Continuing (resuming) the process is very tricky, since page fault may have occurred in the middle of an instruction. Don't want user process to be aware that the page fault even happened. + Can the instruction just be skipped? + Suppose the instruction is restarted from the beginning? + How is the ``beginning'' located? + Even if the beginning is found, what about instruc- tions with side effects, like MOV (SP)+, 10? + Without additional information from the hardware, it may be impossible to restart a process after a page fault. Machines that permit restarting must have hardware sup- port to keep track of all the side effects so that they can be undone before restarting. + Early Apollo approach for 68000 + (two processors, one just for handling page faults) + IBM 370 solution (execute long instructions twice) + If you think about this when designing the instruction set, it isn't too hard to make a machine support virtual - .31 - memory. It's much harder to do after the fact. + How many page faults can occur in one instruction?? + E.g. instruction spans page boundaries, and each of two operands spans two pages. Could have 2 level page table, with one page of page table needed to point to each instruction & data page. + Once the hardware has provided basic capabilities for virtual memory, the OS must implement 3 algorithms: + Page fetch algorithm: when to bring pages into memory. + Page replacement algorithm: which page(s) should be thrown out, and when. + Page placement algorithm: where to put the page in memory. + Note that the page placement algorithm for main memory is ir- relevant - memory is uniform. (But CRAY has non-uniform memory access time. Also not irrelevant for other parts of memory hierarchy.) + Page Fetch Algorithms: + Demand paging: start up process with no pages loaded, load a page when a page fault for it occurs, i.e. wait until it absolutely MUST be in memory. Almost all paging systems are like this. + Request paging: let user say which pages are needed. What's wrong with this? - .32 - + Users don't always know best, and aren't always im- partial. They will overestimate needs. Maybe men- tion overlays here, although overlays are even more draconian than request paging. + Still need demand paging, in case user doesn't remember to bring in the right page. + Prefetching, or Prepaging: bring a page into memory be- fore it is referenced (e.g. when one page is referenced, bring in the next one, just in case). + Reason for prepaging is + (a) bring in several pages at once - cut per page overhead + (b) eliminate real time delay in waiting for page - overlap computation and fetch. + Idea is to guess at which page will be needed. Hard to do effectively without a prophet, may spend a lot of time doing wasted work. If used at all, typically One block lookahead - i.e. the next one. + Seldom works. + Can also do "swapping", ("working set restoration") whereby when you start a process, you swap in most or all of its pages, or at least all of the pages it was using the last time it was running. When it stops, you swap out its pages in a bunch on contiguous tracks on disk. + Also called working set restoration - .33 - + Overlays - a technique by which the user divides his pro- gram into segments. The user issues commands to load and unload the segments from memory; these commands specify the location in memory where the segments are placed. Used when there is no virtual memory, and the user is given a partition of real memory to work with. + Page Replacement Algorithms: + Random (RAND): pick any page at random. + FIFO: throw out the page that has been in memory the longest. The ideas are: (a) its simple, and (b) the first page that was fetched is believed to be no longer needed. + LRU (least recently used): use the past to predict the future. Throw out the page that hasn't been used in the longest time. If there is locality, then this is presum- ably the best you can do. + MIN (or OPT): as always, the best algorithm arises if we can predict the future. + Throw out the page that won't be used for the longest time into the future. This requires a prophet, so it isn't practical, but it is good for comparison. + Real and Virtual Time + Virtual Time is time as measured by a running process - doesn't include time that process is blocked (e.g. for page fault or other reason). Often in units of memory - .34 - references. + Real Time - time as measured by wall clock. Includes time that process is blocked (including page faults). + How to evaluate paging algorithms: + What are the costs of page faults? + CPU overhead for page fault- handler, dispatcher, I/O routines. (e.g. 3000 instructions). + Possible CPU (multiprogramming) idle while page ar- rives + I/O busy while page is transferred + Main memory (or cache) interference while page is transferred. + Real time delay to handle page fault. + Two approaches (Metrics) for Eval Paging Algorithms: + Curve of page faults vs. amount of space used - preferable. (Called "parachor curve") + Space time product vs. amount of space. Want to minimize STP. (show curve) + Space time product (STP)- integral of amount of space used by program over the time it runs. In- cludes time for page faults. This is the real space time product. + Exact formula is integral(0,E(space)) [m(t) dt], where E is ending time of program, and m(t) is memory used by program at time t (real time). - .35 - + In discrete time, {sum(0,R,i) [m(i)(1+f(i)*PFT)]}, where R is ending time of program in discrete time (i.e. number of memory references), i is i'th memory reference, m(i) is number of pages in memory at i'th reference, f(i) is indicator function = 0 if no page fault, =1) if page fault. PFT = page fault time. + First product is virtual space-time product. Second term adds in time for page faults. + Space time product can be computed approximately from page fault vs. space curve. (approximately) STP = (virtual running time for program(F) + pft * number of page faults) * (mean space occupied by program (n bar)). pft is time for page fault to be handled. + Space time product depends on PFT, so is technol- ogy dependent. Also doesn't take into account fact that machine may not be idle when page is being fetched. + Example: Try the reference string 4, 3, 2, 1, 4, 3, 5, 4, 3, 2, 1, 5. assume there are three or four page frames of phy- sical memory. Show the memory allocation state after each memory reference. + Do for MIN, LRU, FIFO + see figures. + Note the anomally for FIFO - we would like the miss ratio - .36 - to decline with increasing memory size. + Stack Algorithm - An algorithm which obeys the inclusion pro- perty - the set of pages in a memory of size N at time t is always a subset of the set of pages in a memory of size N+1 at time t. Obviously cannot have miss ratio increasing with memory size. + Stack is list of pages in order of size of memory which includes them. + Implementing LRU: need some form of hardware support in ord- er to keep track of which pages have been used recently. + Perfect LRU? Keep a register for each page, and store the system clock into that register on each memory refer- ence. To replace a page, scan through all of them to find the one with the oldest clock. This is expensive if there are a lot of memory pages. + Or, could use linked list to maintain "LRU stack". Note that we can see (by inspection) that with LRU, miss ratio with never increase with increasing number of pages in memory. + In practice, almost nobody implements perfect LRU. (CDC-Star did). Instead, we settle for an approximation that is efficient. Just find an old page, not necessari- ly the oldest. + LRU is just an approximation anyway so why not ap- - .37 - proximate a little more? + use bit (reference bit) - a bit in the page table entry (usu- ally cached in the TLB), that is set when the page is refer- enced. It is turned off under OS control. + Clock algorithm: keep ``use'' bit for each page frame, hardware sets the bit for the referenced page on every memory reference. Have a pointer pointing to the k'th page frame. When a fault occurs, look at the use bit of the page being pointed to. If it is on, turn it off, increment the pointer, and repeat. If it is off, replace the page in that page frame, set use(k)=1. (Clock diagram.) + Also called FINUFO - first in, not used, first out. + In effect, the use bit, when used with the clock algorithm breaks the pages into two groups: those "in use" and those "not in use". We want to replace one of the latter. + What does it mean if the clock hand is sweeping very slowly? + What does it mean if the clock hand is sweeping very fast? + Some systems also use a ``dirty'' bit to give some extra preference to dirty pages. This is because it is more expen- sive to throw out dirty pages: clean ones need not be writ- ten to disk. - .38 - + What are tradeoffs here? + Cost of page fault declines - lower probability of writing out dirty block. + Probability of fault increases - i.e. if clock was a good algorithm, and we mess with it, it should make it worse. + How would Least Frequently Used replacement work? + It would be a disaster, since locality changes. + A per process replacement algorithm or local page replacement algorithm, or per job replacement algorithm allocates page frames to individual processes: a page fault in one process can only replace one of that process' frames. This relieves interference from other processes. + If all pages from all processes are lumped together by the replacement algorithm, then it is said to be a global re- placement algorithm. Under this scheme, each process com- petes with all of the other processes for page frames. + If you are using a local replacement algorithm, you have partitioned memory among the jobs or processes. + Local algorithm: + Protects jobs from others which are badly behaved. + Hard to decide how much space to allocate to each process. + Allocation may be unreasonable. + Global algorithm: - .39 - + Permits memory allocation for process to shift over time. + Permits memory allocation to adapt to process needs + Permits badly behaved process to grab too much memory. + Thrashing: A situation when the page fault rate is so high that the system spends most of its time either processing a page fault or waiting for a page to arrive. + Thrashing means that there is too much page fetch idle - time when the processor is idle waiting for a page to ar- rive. + Suppose there are many users, and that between them their processes are making frequent references to 50 pages, but memory has 40 pages. + Each time one page is brought in, another page, whose contents will soon be referenced, is thrown out. + Compute average memory access time. + The system will spend all of its time reading and writing pages. It will be working very hard but not getting any- thing done. + The progress of the programs will make it look like the access time of memory is as slow as disk, rather than disks being as fast as memory. + Plot of CPU utilization vs. level of multiprogramming. + Thrashing was a severe problem in early demand paging systems. - .40 - + Thrashing occurs because the system doesn't know when it has taken on more work than it can handle. LRU mechanisms order pages in terms of last access, but don't give absolute numbers indicating pages that mustn't be thrown out. + What do humans do when thrashing? If flunking all courses at midterm time, drop one. + Solutions to Thrashing: + If a single process is too large for memory, there is nothing the OS can do. That process will simply thrash. (Buy more memory) + If the problem arises because of the sum of several processes: + Figure out how much memory each process needs. Change scheduling priorities to run processes in groups whose memory needs can be satisfied. + Shed load. + Change paging algorithm + Working Sets are a solution proposed by Peter Denning. An informal definition is + Working set = ``the set of pages that a process is work- ing with, and which must thus be resident if the process is to avoid thrashing.'' + The idea is to use the recent needs of a process to predict its future needs. + Formally, ``Exactly that set of pages used in the - .41 - preceeding T virtual time units'' (T usually given in un- its of memory references.) + Choose T, the working set parameter. At any given time, all pages referenced by a process in its last T seconds of execution are considered to comprise its working set. + Working Set Paging Algorithm keeps in memory exactly those pages used in the preceding T time units. + Minimum values for T are about 10,000 to 100,000 memory references. + A process will never be executed unless its working set is resident in main memory. Pages outside the working set may be discarded at any time. + Note that this requires a reservoir of unassigned page frames. + Working set paging requires that the sum of the sizes of the working sets of the jobs eligible to run (which we will call the balance set) be less than or equal to the amount of space available. We previously referred to the balance set as the jobs in the in-memory queue. + Some algorithm must be provided for moving processes into and out of the balance set. What happens if the balance set changes too frequently? + Still get thrashing + As working sets change, corresponding changes will have to be made in the balance set. - .42 - + Working set also has the advantage over LRU that it adjusts the amount of space in use according to what the process needs. LRU works with a fixed amount of space, even though a process' needs change. + How do we implement working set? Can it be done exactly? + One of the initial plans was to store some sort of a capacitor with each memory page. The capacitor would be charged on each reference, then would discharge slowly if the page wasn't referenced. Tau would be determined by the size of the capacitor. This wasn't actually imple- mented. One problem is that we want separate working sets for each process, so the capacitor should only be allowed to discharge when a particular process executes. + What if a page is shared? + Actual solution: take advantage of use bits. + OS maintains idle time value for each page: amount of CPU time received by process since last access to page. + Every once in a while, scan all pages of a process. For each use bit on, clear page's idle time. For use bit off, add process' CPU time (since last scan) to idle time. Turn all use bits off during scan. + Scans happen on order of every few seconds (in Unix, I is on the order of a minute or more). + What is overhead of sampling reference bits regularly? + Assume samples every 10,000 memory references. - .43 - 40Mbyte memory, with 4K pages. 5 instructions to sample one bit, with 10 memory refs. Then 100,000 memory refs required just to record use bits. + Other questions about working sets and memory management in general: + What should T be? + What if it's too large? + What if it's too small? + plot STP vs. T, Page Fault Rate vs. I + What algorithms should be used to determine which processes are in the balance set? + How much memory is needed in order to keep the CPU busy? Note than under working set methods the CPU may occasion- ally sit idle even though there are runnable processes. + (How do we compute working sets if pages are shared?) + Working Set Restoration + Idea is that when we remove a process from the in-memory queue, we know what its working set is. + When we run the process again (i.e. promote it to the in-memory queue), we can restore the working set to memory all at once.l + Advantages: + minimize CPU overhead + don't have to wait for each page fault -> all transfers at once. - .44 - + Can optimize layout when writing out, and can fetch from consecutive locations + Or can just sort the fetches, so that average la- tency is much smaller. + A problem with working set is that even the approximate im- plementation above has a lot of overhead. Instead, Opderbeck and Chu created an algorithm called + Page Fault Frequency - Let X be the virtual time since the last page fault for this process. + At the time of a page fault, [If X>T, remove all pages (of the process) with the use bit off.] Then get a page frame for the new page, and turn off all reference bits for the process. + Idea was to make this a quick and easy way to implement working set. Idea is that as long as process is faulting too often (