6.097: OPERATING SYSTEM ENGINEERING

Fall 2002

Lab 4

Hand out date: Wednesday October 16th

Due date: Monday, October 28th (Exercises 1-4)

Due date: Thursday, October 31st (Exercise 5)

Due date: Thursday, November 7th (Exercise 6)

0. Clarifications and Corrections

exercise 1 -- compile problems
The source tarball may not compile correctly when you first untar it and copy over your lab 3 source files. Nevertheless, the compilation proceeds far enough so that you can answer the questions. By the time you finish exercise 1, you should have fixed the problems.
exercise 2 -- 0x30 equals decimal 48
Be certain to note the "0x" in "0x30". This number is in hex.
exercise 4 -- exercise4_test1()
Perhaps contrary to your intuition, this test case should not generate a fault. Do you see why?

In addition to running exercise4_test1() as is, edit it and run it as shown below.
```
void
exercise4_test1 (void)
{
  sys_cputs ((char *)1);
}
```
exercise 6 -- order of sys_mem_remap and sys_mod_perm
The instructions suggested ordering the sys_mod_perm before sys_mem_remap. In fact, the parent should remap before changing its own permissions. (Why?)

1. Introduction

In this lab, you will fill out the interface that the kernel exports to the user process. The first major addition will be the system call interface. Then you will implement a sophisticated page fault handling mechanism, which will allow a user process to handle its own page faults. Finally, the user process will use these new interfaces to implement fork() and IPC.

The lab consists of a series of exercises which build on top of each other, culminating in the final exercise: implementing fork(). Note! In order to encourage you to pace yourself appropriately, we have broken up the lab into three parts.

The first part is due in a week and a half on October 28th. This part lays the framework for building user processes and making system calls so that user processes can begin to do interesting things.
The second part is due on Thursday, October 31st. This part implements user page fault handling. It is critical that this part work correctly for the last part.
The final part is due on Thursday, November 7th. This exercise instructs you to write the code for fork(). This part involves getting many details right so be sure to budget in debugging time for this exercise.

It will be essential to test each exercise as you progress, otherwise the final exercise might be unusually time consuming. Even doing this, expect the final exercise to be challenging--particularly getting its test cases to run.

2. Hand-in Procedure

Put your answers to the exercises in ossrc/lab4_answers.html (do not link to any external files).

When you are finished coding and answering questions, make a hand-in tarball.

athena% cd ~/6.097/ossrc
athena% gmake handin
.
.
.
athena% ls -l handin.tgz

Make this file web accessible then send email containing a URL in the Subject line. Your email will be processed automatically, so be certain to follow this format.
A fool-proof way to accomplish this is to use the following command:
```
athena% mhmail 6.097-handin@pdos.lcs.mit.edu -subject http://web.mit.edu/PATH/TO/handin.tgz -body empty
```

Each week you should repeat the procedure above, appending your new answers to lab4_answers.html. This will allow us to see your progress and give you incremental feedback.

3. Getting Started

You should now download the code for the lab. Some files are absent from this tarball and must be copied form your lab 3 solutions.

Be careful not to overwrite your lab 3 solutions.

athena% add gnu 6.097 sipb
athena% cd ~/6.097
athena% mv ossrc lab3-solutions
athena% wget http://pdos.lcs.mit.edu/6.097/labs/lab4.tar.gz
athena% gtar -zvxf lab4.tar.gz
drwxr-xr-x cates/wheel       0 Sep 15 19:34 2002 ossrc/
drwxr-xr-x cates/wheel       0 Sep 15 19:34 2002 ossrc/kern/
drwxr-xr-x cates/wheel       0 Sep 15 19:34 2002 ossrc/kern/inc/
-rw-r--r-- cates/wheel    2528 Sep 14 17:00 2002 ossrc/kern/inc/asm.h
.
.
.
athena% cp lab3-solutions/kern/locore.S ossrc/kern
athena% cp lab3-solutions/kern/trap.c ossrc/kern
athena% cp lab3-solutions/kern/pmap.c ossrc/kern
athena% cp lab3-solutions/kern/env.c ossrc/kern
athena% cp lab3-solutions/kern/sched.c ossrc/kern
athena% cp lab3-solutions/kern/init.c ossrc/kern

4. Exercises

Exercise 1: Building Environments (Cleanly)

The tar ball contains new code (in ossrc/user/) and new makefiles for building user environments in a clean manner. They supplant the crude way of writing/building user environment in lab 3.

In that lab, machine code snippets from lab3.S were copied into the user's address space at UTEXT. The problem is that the code snippets were compiled as part of the kernel, and so they were linked at the kernel's link address. Thus, the code in lab3.S was constrained to being position independent.

Questions:

Explain how the new build system eliminates this constraint.
To do this, run these commands and examine the output of "gmake all".
```
athena% cd ossrc
athena% gmake clean
athena% gmake all
```
In your answer, list the steps that the source files in the ossrc/user/ undergo. How are they incorporated into the kernel image? Why is the ossrc/user/ code permitted to be position dependent?
ossrc/user/kenv0 is known as the idle process and it serves a function integral to the correct operation of the kernel: it spins in a loop. This turns to be highly constructive because while it's spinning, interrupts are enabled. The idle process gives the kernel something to do when all other processes are not ready to be run (perhaps waiting on I/O) without blocking device interrupts.

ossrc/user/simple is called the init process. The kernel makes no assumptions on the actions of this process.

Modify the code in i386_init() to create the idle process and the init process. Make sure the idle process is created as __envs[0].
Now modify scheduler code. The scheduler should continue to perform round-robing scheduling. However, the idle process should only be run when no other processes are runnable. You should assume (in fact, assert in the code) that the idle process is always runnable. Be certain not to create starvation problems.

Exercise 2: System calls

In this exercise, you will implement system calls. System calls are the mechanism by which a user process calls a sub routine in the kernel. Here's the control flow which you will implement:

the user process moves the system call arguments into specific registers
- EAX -- system call number
- EDX -- first argument
- ECX -- second argument
- EBX -- third argument
the user process executes "int $0x30"
the kernel trap handler dispatches to trap()
trap() calls dispatch_syscall() when tf_trapno == 0x30.
dispatch_syscall() dispatches to the correct C system call function, based on the system call number.
On the return path, the kernel places the result of the system call in the EAX register (of the trapframe) before returning back to the user process.

Skeletal code implementing these actions exists in inc/syscall.h and syscall.c. You must finish the coding.

The system calls sys_getenvid(), sys_cputu() and sys_cputs have been implemented for you. Once you have finished coding the kernel system call path, you are ready to run your system. You do not have to implement any additional system calls at this point. The init process will print a banner to the screen. Do not continue until the init process runs correctly. The test code in the init process is located in __main() in the file user/simple/libos.c.

Exercise 3: IPC

In this exercise you will implement a simple IPC mechanism which will transfer a single 4-byte word of data from one environment to another. A key feature (which your implementation be required to fulfill) is that every IPC send corresponds to exactly one IPC receive: no IPCs are ever lost or received more than once.

This mechanism is implemented with three fields in the Env structure: env_ipc_value, env_ipc_from, and env_ipc_blocked; and two system calls: sys_ipc_send and sys_ipc_unblock. The comments in the source code explain these fields and functions.

Implement the system calls sys_ipc_unblock() and sys_ipc_send().
Implement the user wrappers for these: ipc_send() and ipc_read() (in user/simple/libos.c).
Test your code by editing i386_init() to create two instances of the init process. Then edit the code for the init process, and change it to run the exercise3() test. The two init processes should continually ping pong an increasing counter back and forth over IPC.

Before continuing, revert i386_init() so that it creates the init process only once.

Background: for exercise 4 and 5

In lab 3, you should have coded your kernel so that any page fault is handled by the page_fault_handler() routine.

void
page_fault_handler (struct Trapframe *tf)
{
  u_int va = rcr2 ();
#if 0
  u_int env_id = curenv ? curenv->env_id : -1;
  printf ("%%%% [0x%x] Page fault (code 0x%x) for VA 0x%x (env 0x%x)\n"
          "   eip = 0x%x, cs = 0x%x, eflags = 0x%x\n",
          tf->tf_trapno, 0xffff & tf->tf_err, va, env_id,
          tf->tf_eip, 0xffff & tf->tf_cs, tf->tf_eflags);
  /* Only traps from user mode push %ss:%esp */
  if (tf->tf_err & FEC_U)
    printf ("   esp = 0x%x, ss = 0x%x\n", tf->tf_esp, 0xffff & tf->tf_ss);
#endif
}

As you know from lab 3, in response to an exception, the processor pushes parameters on to the kernel stack.

In particular, a page fault causes an error code to be pushed. The FEC_U bit can be tested. If it's set, the fault was generated while the processor was in user mode (i.e., the bottom 2 bits of the %cs are cleared). Otherwise, it was the kernel that faulted.

Only faults that transfer control from the user to the kernel cause the %esp and %ss to be pushed at the time of the exception. When the kernel faults, the trapframe fields tf_esp and tf_ss will contain random garbage.

Additionally, the processor puts the faulting address in the %cr2 register.

Exercise 4a: Kernel page faults

In this exercise, you will write the code that handles kernel page faults. The description of what you must implement follows.

The kernel global variable page_fault_mode controls the response of page_fault_handler() to a kernel page fault. It takes on two values: PFM_NONE and PFM_KILL.

PFM_NONE -- Normally, the kernel is not expected to generate a page fault, and if it does, the kernel must have a bug. This state is represented by PFM_NONE. If the page_fault_handler fields a kernel page fault and page_fault_handler is set to this value, it should panic. Essentially, this is your "blue screen of death".

PFM_KILL -- In specific places in the kernel, such as within the sys_cputs function, the page_fault_mode is set to PFM_KILL while the user pointer is accessed. If a fault occurs the page_fault_handler() checks the page_fault_mode, and if it is set to PFM_KILL it "blames" the user process for the fault, kills it and continues. Technically the kernel faulted, but it faulted on behalf the user process, so the user process is killed.

You should now implement this page fault handling strategy.

Exercise 4b: `sys_cputs()` continued

There is another type of malicious user behavior that the kernel must prevent. The user process is not permitted to read any memory location greater than or equal to VA ULIM. So, logically, if the user passes an address past ULIM to sys_cputs(), the kernel should not print the memory at that address.

You should now modify sys_cputs() again to protect against this malicious user behavior. Hint: use the trup macro in mmu.h. That macro is part of the kernel's strategy to protect itself against pointers passed to it by the user process.

Questions:

How long approximately did it take you to do the first four exercises?

Exercise 5: User page faults

In this exercise, you will further extend the user process so that it can handles its own page faults in an analogous way to which the kernel handles its page faults. User page fault handling requires a careful contract between the kernel and the user process which is detailed below.

The user process uses the system call sys_set_pgfault_handler() to specify the address of its page fault handler routine and the address of its exception stack.

void
sys_set_pgfault_handler (u_int func, u_int xstacktop)
{
  curenv->env_pgfault_handler = func;
  curenv->env_xstacktop = xstacktop;
}

Aside: you might notice the strong analogy to the hardware exception mechanism. The kernel tells the hardware the handler routine by setting up the IDT and sets up the TSS (ts.ts_esp0 = KSTACKTOP) to point to the exception stack.

When the user process page faults, the follow steps occur:

The kernel fields the page fault in locore.S, which calls trap() and eventually page_fault_handler().
page_fault_handler() should test the FEC_U bit of the error code.

If the user process faulted, page_fault_handler() should propogate the fault to the user process, by pushing paramters on the user's exception stack:

                <---- curenv->env_xstacktop
   spare 1
   spare 2
   spare 3
   spare 4
   spare 5
   eip
   eflags
   esp
   error code
   va            <----- user %esp

However, if the user's %esp (at the time of the page fault) falls within the user's exception stack (with is 4096 bytes in size), then the user's %esp is taken to be the user's exception stack pointer. In this case, the exception paramters are pushed on the exception stack at the point of the current value of the user's %esp. This permits recursive user page faults.

                      Recursive user fault.

      (user %esp must be within the user's exception stack!)        

                <---- user %esp at time of page fault
   spare 1            
   spare 2
   spare 3
   spare 4
   spare 5
   eip
   eflags
   esp
   error code
   va            <----- user %esp

The kernel then iret's to the user code location curenv->env_pgfault_handler. (You must figure out how to arrange for this to happen.) From there the user process can handle the fault as it sees fit.
The user process saves the caller saved registers (EAX, ECX, EDX) in the spare slots (3, 4, and 5 -- see above) of the of the exception stack.
The user process calls into a C routine to handle the page fault.
The C routine may chose to print and error message and kill the process. Or it may handle the fault, and chose to continue running the process. If the latter choice is taken, the C routine returns to the assembly page fault handler.
The page fault handler restores the state of the registers to the state current at the time of the page fault. Thus, the user process resumes execution at the point of the fault, re-executing the faulting instruction.

Some code exists to do what has been described above. You should finish it. Among other things, this work will include implementing the sys_mem_alloc system call. Then you should run the test cases provided.

The test cases do not generate recursive user page faults (i.e., a page fault from within the user's page fault handler). You might wish to test this functionality yourself. It's worth spending extra effort now to get rid of any lurking bugs. Exercise 6 is difficult enough by itself.

Questions:

How long approximately did it take you to do this exercise?

Exercise 6: copy-on-write `fork()`

In this exercise, you will implement fork(). This function will build on top of the mechanisms you have implemented in the previous exercises.

fork() clones the parent process at the time it is called, creating a child process whose memory contents are identical to the parent. After the fork, any subsequent memory writes by the parent modify its memory only--not the child's memory. And similarly, subsequent writes by the child are not visible to the parent.

These two properties are the essence of fork(), so let's point them out again:

Property 1 -- Initially parent and child have identical memory contents
Property 2 -- Subsequent memory writes are only visible locally to the writer

A simple implementation of fork() is to copy all of parent's memory into the child. It's easy to see how it satisfies both properties 1 and 2. However, this implementation is inefficient. The text segment, for example, is identical for the parent and child, so it seems wasteful to make a second copy of it for the child.

Your implementation of fork() addresses these objections by employing virtual memory tricks. Instead, of copying all the parent's memory into the child, the parent just maps it all into the child. This is far more efficient than a copying implementation of fork(), because it takes far fewer CPU cycles to map memory (you just need to update the page table entries) than to copy it. It's trivial to see that property 1 is satisfied. However property 2 is not. When the child writes to memory, say to a global variable, the parent will see that write when it reads that global variable.

Property 2 will be satisfied by using well-known technique called copy-on-write. This copy-on-write technique is actually just a slight tweak to the mapping implementation of fork outlined in the paragraph above.

The parent will still map all of its memory into the child's address space. However, any pages that were previously writeable (PG_W) in the parent's address space, are marked PG_COW (and not PG_W), for both the parent and the child. (note: Property 1 is still satisfied at this point.) Then both the parent and child can be executed.

If either process writes to memory, it will fault (since PG_W bits are off). Then, the user level fault handler, handles the fault, and checks if the fault was to a PG_COW page. If so, the handler allocates a new page, copies the content of read-only page to it, and maps the new page with PG_W permission into the faulting memory location. The user fault handler then resumes execution at the instruction which caused the fault. When, the faulting instruction is re-executed it will not fault; it will write the memory successfully.

The key is that as on-demand writes to PG_COW pages happen, the user process clones the page. Since, new memory is allocated for the cloned pages, writes are only visible to the process doing the write. Hence, property 2 is satisfied.

To summarize, the strategy for implementing fork:

Map parent's memory into child. This satisfies property 1, without the need to resort to inefficent copying.
Perform copy-on-write to satisfy property 2. It is also efficient, since only those pages that are actually written to are copied, not all writeable pages.

Here's the control flow for fork():

The parent calls sys_env_alloc() to allocate a child environment. (The child's initial status is ENV_NOTRUNNABLE.)
For each writable page in its address space (below UTOP), the parent remaps the page copy-on-write into the address space of the child and then marks the page as copy-on-write in its own address space (i.e., adds PG_COW to the PTE's and removes PG_W).
The exception stack is excepted from this. It should be left writeable in the parent's address space. The exception stack is what implements the copy-on-write handling, so it shouldn't be subject to it.
The parent then allocates an exception stack for the child. The page should be mapped writeable in the child's address space.
Note: the exception stack is a violation of property 1. The parent and the child have their own. This is very important. Serious problems result from the parent and child sharing the same physical memory for their exception stacks.
The child is ready to run, so the parent sets its status to ENV_OK.

Here's the control flow for the user's exception handler:

kernel propogates user page faults to user's asm_pgfault_handler
asm_pgfault_handler calls pgfault_handler -- the user's C routine
pgfault_handler checks and realizes the fault is a write (check the FEC_WR bit of the error code) to a page marked PG_COW, so it calls handle_cow_fault.
handle_cow_fault handles the copy-on-write by:
1. allocating a new page
2. copying NBPG from va&~PGMASK to the new page
3. insert the new page with PG_W bit set at va&~PGMASK
returns back to pgfault_handler
returns back to asm_pgfault_handler
returns back to faulting user instruction

Ok, now you have to implement everything described above. To help you out, we've roughly outlined the work in a series of steps:

Start by implementing the system calls fork() needs: sys_mem_remap, sys_mod_perms, and sys_env_alloc. You'll need to add (some of) these system calls to kern/inc/syscall.h. Also, you'll notice that sys_mem_remap takes one more argument than is currently supported by the syscall mechanism. Change the syscall mechanism to pass this last argument in a register of your choosing.
Now implement fork(), asm_pgfault_handler(), pgfault_handler(), and handle_cow_fault() as described above (all are in simple/libos.c). Refer to the function dump_va_space() in that file. It demonstrates some useful techniques.
Debug your coding. You should probably do this by making a set of small test cases, each of which tests a very specific part of your new code.
When your code is ready for prime time, re-run the pingpong test. But now, instead of the kernel creating the init process twice, the init process will fork a child itself. Then the parent and the child will ping pong to each other.
A second more demanding test is the sieve test case. It implements the prime number seive of Erosathenes.
Now, that your code works, it's time to address that "detail" omitted from handle_cow_fault. Consider the following scenario:
1. fork() is executed
2. parent writes to a global variable, and so does COW fault
3. subsequently, child writes to the same global varible, and now it faults, because the corresponding PTE has the PG_COW bit set.
4. in handle_cow_fault, the child is the last process refering to the physical page in question (pp_refcnt == 1). In this case it is unnecessary to clone the page, the child should just use sys_mod_perm to change the perms from PG_COW to PG_W.
You should implement the optimization suggested in the above scenario.
re-run the ping pong and sieve test cases

Questions:

How long approximately did it take you to do this exercise?

This completes the lab.

Version: $Revision: 1.12 $. Last modified: $Date: 2002/11/07 06:43:50 $

0. Clarifications and Corrections

1. Introduction

2. Hand-in Procedure

3. Getting Started

4. Exercises

Exercise 1: Building Environments (Cleanly)

Exercise 2: System calls

Exercise 3: IPC

Background: for exercise 4 and 5

Exercise 4a: Kernel page faults

Exercise 4b: sys_cputs() continued

Exercise 5: User page faults

Exercise 6: copy-on-write fork()

Exercise 4b: `sys_cputs()` continued

Exercise 6: copy-on-write `fork()`